Authors:Sixun Dong, Wei Fan, Teresa Wu, Yanjie Fu
Abstract:
Time series forecasting traditionally relies on unimodal numerical inputs, which often struggle to capture high-level semantic patterns due to their dense and unstructured nature. While recent approaches have explored representing time series as text using large language models (LLMs), these methods remain limited by the discrete nature of token sequences and lack the perceptual intuition humans typically apply, such as interpreting visual patterns. In this paper, we propose a multimodal contrastive learning framework that transforms raw time series into structured visual and textual perspectives. Rather than using natural language or real-world images, we construct both modalities directly from numerical sequences. We then align these views in a shared semantic space via contrastive learning, enabling the model to capture richer and more complementary representations. Furthermore, we introduce a variate selection module that leverages the aligned representations to identify the most informative variables for multivariate forecasting. Extensive experiments on fifteen short-term and six long-term forecasting benchmarks demonstrate that our approach consistently outperforms strong unimodal and cross-modal baselines, highlighting the effectiveness of multimodal alignment in enhancing time series forecasting. Code is available at: https://github.com/Ironieser/TimesCLIP.
中文摘要:本文提出一种多模态对比学习框架,将时间序列转化为对齐的视觉与文本表征,通过增强语义对齐和变量选择机制,在预测任务中实现了卓越性能。
English Summary: This paper introduces a multimodal contrastive learning framework that transforms time series data into aligned visual and textual representations, achieving superior forecasting performance through enhanced semantic alignment and variate selection.
Authors:Yuqing Wang, Shangding Gu
Abstract:
Data selection plays a crucial role in data-driven decision-making, including in large language models (LLMs), and is typically task-dependent. Properties such as data quality and diversity have been extensively studied and are known to enhance model performance. However, it remains unclear whether there exist other quantitative and general principles of data selection that can consistently improve performance, especially for complex tasks with limited prior knowledge. In this paper, we demonstrate that selecting more uniformly distributed data can improve training efficiency while enhancing performance. Specifically, we establish that more uniform (less biased) distribution leads to a larger minimum pairwise distance between data points, denoted by $h_{\min}$, and prove that a smaller $h_{\min}$ can slow down the training dynamics of gradient descent (GD). Moreover, we theoretically show that the approximation error of neural networks decreases as $h_{\min}$ increases. Our analysis introduces a convergence framework for GD beyond the Neural Tangent Kernel (NTK) regime, applicable to a broad class of architectures, including transformers, without requiring Lipschitz smoothness. This framework further provides theoretical justification for the use of residual connections and function compositions in deep neural architectures. In the end, we conduct comprehensive experiments for supervised fine-tuning across various settings, including different optimization strategies, model sizes, and training datasets. The results consistently demonstrate that selecting data by maximizing pairwise distance significantly accelerates training and achieves comparable or better performance in LLMs across diverse datasets. Code and Datasets are available at the link: https://github.com/SafeRL-Lab/data-uniformity.
中文: 本研究证明,通过选择更均匀分布的数据来增加数据点间的最小成对距离,能够提高大型语言模型的训练效率和性能,加速梯度下降收敛并降低近似误差。
English: This research demonstrates that selecting more uniformly distributed data enhances training efficiency and performance in large language models by increasing the minimum pairwise distance between data points, which accelerates gradient descent convergence and reduces approximation error.
Authors:Shai Yehezkel, Omer Dahary, Andrey Voynov, Daniel Cohen-Or
Abstract:
Denoising diffusion models excel at generating high-quality images conditioned on text prompts, yet their effectiveness heavily relies on careful guidance during the sampling process. Classifier-Free Guidance (CFG) provides a widely used mechanism for steering generation by setting the guidance scale, which balances image quality and prompt alignment. However, the choice of the guidance scale has a critical impact on the convergence toward a visually appealing and prompt-adherent image. In this work, we propose an annealing guidance scheduler which dynamically adjusts the guidance scale over time based on the conditional noisy signal. By learning a scheduling policy, our method addresses the temperamental behavior of CFG. Empirical results demonstrate that our guidance scheduler significantly enhances image quality and alignment with the text prompt, advancing the performance of text-to-image generation. Notably, our novel scheduler requires no additional activations or memory consumption, and can seamlessly replace the common classifier-free guidance, offering an improved trade-off between prompt alignment and quality.
Authors:Lijun Sheng, Jian Liang, Ran He, Zilei Wang, Tieniu Tan
Abstract:
Test-time adaptation (TTA) methods have gained significant attention for enhancing the performance of vision-language models (VLMs) such as CLIP during inference, without requiring additional labeled data. However, current TTA researches generally suffer from major limitations such as duplication of baseline results, limited evaluation metrics, inconsistent experimental settings, and insufficient analysis. These problems hinder fair comparisons between TTA methods and obscure their practical strengths and weaknesses. To address these challenges, we introduce TTA-VLM, a comprehensive benchmark for evaluating TTA methods on VLMs. Our benchmark implements 8 episodic TTA and 7 online TTA methods within a unified and reproducible framework, and evaluates them across 15 widely used datasets. Unlike prior studies focused solely on CLIP, we extend the evaluation to SigLIP--a model trained with a Sigmoid loss--and include training-time tuning methods such as CoOp, MaPLe, and TeCoA to assess generality. Beyond classification accuracy, TTA-VLM incorporates various evaluation metrics, including robustness, calibration, out-of-distribution detection, and stability, enabling a more holistic assessment of TTA methods. Through extensive experiments, we find that 1) existing TTA methods produce limited gains compared to the previous pioneering work; 2) current TTA methods exhibit poor collaboration with training-time fine-tuning methods; 3) accuracy gains frequently come at the cost of reduced model trustworthiness. We release TTA-VLM to provide fair comparison and comprehensive evaluation of TTA methods for VLMs, and we hope it encourages the community to develop more reliable and generalizable TTA strategies.
中文: TTA-VLM作为一个综合性基准,旨在解决当前视觉语言模型测试时适应研究中的局限性,通过统一框架和多样化评估指标,促进对多种方法、数据集及模型可信度的公平全面比较。
English: TTA-VLM is introduced as a comprehensive benchmark to address limitations in current test-time adaptation research for vision-language models, enabling fair and holistic evaluations across multiple methods, datasets, and metrics beyond just accuracy.
Authors:Shiao Wang, Ju Huang, Qingchuan Ma, Jinfeng Gao, Chunyi Xu, Xiao Wang, Lan Chen, Bo Jiang
Abstract:
Combining traditional RGB cameras with bio-inspired event cameras for robust object tracking has garnered increasing attention in recent years. However, most existing multimodal tracking algorithms depend heavily on high-complexity Vision Transformer architectures for feature extraction and fusion across modalities. This not only leads to substantial computational overhead but also limits the effectiveness of cross-modal interactions. In this paper, we propose an efficient RGB-Event object tracking framework based on the linear-complexity Vision Mamba network, termed Mamba-FETrack V2. Specifically, we first design a lightweight Prompt Generator that utilizes embedded features from each modality, together with a shared prompt pool, to dynamically generate modality-specific learnable prompt vectors. These prompts, along with the modality-specific embedded features, are then fed into a Vision Mamba-based FEMamba backbone, which facilitates prompt-guided feature extraction, cross-modal interaction, and fusion in a unified manner. Finally, the fused representations are passed to the tracking head for accurate target localization. Extensive experimental evaluations on multiple RGB-Event tracking benchmarks, including short-term COESOT dataset and long-term datasets, i.e., FE108 and FELT V2, demonstrate the superior performance and efficiency of the proposed tracking framework. The source code and pre-trained models will be released on https://github.com/Event-AHU/Mamba_FETrack
Chinese: 本文提出Mamba-FETrack V2,一种基于视觉Mamba网络和提示生成器的高效RGB-事件目标跟踪框架,在降低计算复杂度的同时实现了卓越的跟踪性能。
English: This paper introduces Mamba-FETrack V2, an efficient RGB-Event object tracking framework that uses a Vision Mamba network and a prompt generator to achieve superior performance with reduced computational complexity.
Authors:Xue Wen Tan, Stanley Kok
Abstract:
Every publicly traded U.S. company files an annual 10-K report containing critical insights into financial health and risk. We propose Tiny eXplainable Risk Assessor (TinyXRA), a lightweight and explainable transformer-based model that automatically assesses company risk from these reports. Unlike prior work that relies solely on the standard deviation of excess returns (adjusted for the Fama-French model), which indiscriminately penalizes both upside and downside risk, TinyXRA incorporates skewness, kurtosis, and the Sortino ratio for more comprehensive risk assessment. We leverage TinyBERT as our encoder to efficiently process lengthy financial documents, coupled with a novel dynamic, attention-based word cloud mechanism that provides intuitive risk visualization while filtering irrelevant terms. This lightweight design ensures scalable deployment across diverse computing environments with real-time processing capabilities for thousands of financial documents which is essential for production systems with constrained computational resources. We employ triplet loss for risk quartile classification, improving over pairwise loss approaches in existing literature by capturing both the direction and magnitude of risk differences. Our TinyXRA achieves state-of-the-art predictive accuracy across seven test years on a dataset spanning 2013-2024, while providing transparent and interpretable risk assessments. We conduct comprehensive ablation studies to evaluate our contributions and assess model explanations both quantitatively by systematically removing highly attended words and sentences, and qualitatively by examining explanation coherence. The paper concludes with findings, practical implications, limitations, and future research directions. Our code is available at https://github.com/Chen-XueWen/TinyXRA.
Chinese: TinyXRA 是一种轻量级、可解释的变压器模型,它利用高级指标和可视化技术从10-K报告中评估公司风险,实现了最先进的预测精度并具备实时处理能力。
English: TinyXRA is a lightweight, explainable transformer model that assesses company risk from 10-K reports using advanced metrics and visualization, achieving state-of-the-art accuracy with real-time processing capabilities.
Authors:Mario Koddenbrock, Rudolf Hoffmann, David Brodmann, Erik Rodner
Abstract:
In real-world vision-language applications, practitioners increasingly rely on large, pretrained foundation models rather than custom-built solutions, despite limited transparency regarding their training data and processes. While these models achieve impressive performance on general benchmarks, their effectiveness can decline notably under specialized domain shifts, such as unique imaging conditions or environmental variations. In this work, we introduce Deepbench, a framework designed to assess domain-specific robustness of vision-language models (VLMs). Deepbench leverages a large language model (LLM) to generate realistic, context-aware image corruptions tailored to specific deployment domains without requiring labeled data. We evaluate a range of contrastive vision-language architectures and architectural variants across six real-world domains and observe substantial variability in robustness, highlighting the need for targeted, domain-aware evaluation. Deepbench is released as open-source software to support further research into domain-aware robustness assessment.
中文: Deepbench框架利用大语言模型生成针对特定领域的图像干扰,以评估视觉语言模型的鲁棒性,结果发现在不同现实领域中性能存在显著差异,凸显了针对性评估的必要性。
English: Deepbench is a framework that uses a large language model to generate domain-specific image corruptions for evaluating the robustness of vision-language models, revealing significant performance variations across different real-world domains and emphasizing the need for targeted assessments.
Authors:Min-Yeong Park, Won-Jeong Lee, Seong Tae Kim, Gyeong-Moon Park
Abstract:
Recently, forecasting future abnormal events has emerged as an important scenario to tackle real-world necessities. However, the solution of predicting specific future time points when anomalies will occur, known as Anomaly Prediction (AP), remains under-explored. Existing methods dealing with time series data fail in AP, focusing only on immediate anomalies or failing to provide precise predictions for future anomalies. To address the AP task, we propose a novel framework called Anomaly to Prompt (A2P), comprised of Anomaly-Aware Forecasting (AAF) and Synthetic Anomaly Prompting (SAP). To enable the forecasting model to forecast abnormal time points, we adopt a strategy to learn the relationships of anomalies. For the robust detection of anomalies, our proposed SAP introduces a learnable Anomaly Prompt Pool (APP) that simulates diverse anomaly patterns using signal adaptive prompt. Comprehensive experiments on multiple real-world datasets demonstrate the superiority of A2P over state-of-the-art methods, showcasing its ability to predict future anomalies. Our implementation code is available at https://github.com/KU-VGI/AP.
中文: A2P框架通过异常感知预测和合成异常提示,学习异常关联并模拟多样化模式,在多个真实数据集上展现出超越现有方法的未来异常预测能力。
English: The A2P framework, integrating Anomaly-Aware Forecasting and Synthetic Anomaly Prompting, effectively predicts future anomaly occurrences by learning anomaly relationships and simulating diverse patterns, outperforming existing methods in real-world datasets.
Authors:Yuhao Huang, Yueyue Xu, Haoran Dou, Jiaxiao Deng, Xin Yang, Hongyu Zheng, Dong Ni
Abstract:
Congenital uterine anomalies (CUAs) can lead to infertility, miscarriage, preterm birth, and an increased risk of pregnancy complications. Compared to traditional 2D ultrasound (US), 3D US can reconstruct the coronal plane, providing a clear visualization of the uterine morphology for assessing CUAs accurately. In this paper, we propose an intelligent system for simultaneous automated plane localization and CUA diagnosis. Our highlights are: 1) we develop a denoising diffusion model with local (plane) and global (volume/text) guidance, using an adaptive weighting strategy to optimize attention allocation to different conditions; 2) we introduce a reinforcement learning-based framework with unsupervised rewards to extract the key slice summary from redundant sequences, fully integrating information across multiple planes to reduce learning difficulty; 3) we provide text-driven uncertainty modeling for coarse prediction, and leverage it to adjust the classification probability for overall performance improvement. Extensive experiments on a large 3D uterine US dataset show the efficacy of our method, in terms of plane localization and CUA diagnosis. Code is available at https://github.com/yuhoo0302/CUA-US.
中文: 本文提出一种智能系统,通过去噪扩散模型和强化学习,自动完成三维超声中的平面定位和先天性子宫异常诊断,实验证明其方法高效可靠。
English: This paper introduces an intelligent system that uses a denoising diffusion model and reinforcement learning to automate both plane localization and diagnosis of congenital uterine anomalies from 3D ultrasound data, demonstrating high effectiveness in experiments.
Authors:Heitor R. Medeiros, Hossein Sharifi-Noghabi, Gabriel L. Oliveira, Saghar Irandoust
Abstract:
Real-world time series often exhibit a non-stationary nature, degrading the performance of pre-trained forecasting models. Test-Time Adaptation (TTA) addresses this by adjusting models during inference, but existing methods typically update the full model, increasing memory and compute costs. We propose PETSA, a parameter-efficient method that adapts forecasters at test time by only updating small calibration modules on the input and output. PETSA uses low-rank adapters and dynamic gating to adjust representations without retraining. To maintain accuracy despite limited adaptation capacity, we introduce a specialized loss combining three components: (1) a robust term, (2) a frequency-domain term to preserve periodicity, and (3) a patch-wise structural term for structural alignment. PETSA improves the adaptability of various forecasting backbones while requiring fewer parameters than baselines. Experimental results on benchmark datasets show that PETSA achieves competitive or better performance across all horizons. Our code is available at: https://github.com/BorealisAI/PETSA
中文摘要:PETSA提出了一种参数高效的测试时自适应方法,仅通过更新输入输出的微型校准模块,结合低秩适配器和专门设计的损失函数,在减少参数量的同时实现了与现有方法相当或更优的时序预测性能。
English Summary: PETSA introduces a parameter-efficient test-time adaptation method for time series forecasting that updates only small calibration modules using low-rank adapters and a specialized loss function, achieving competitive performance with fewer parameters than existing approaches.
Authors:Jiale Zhang, Zichong Wang, Avash Palikhe, Zhipeng Yin, Wenbin Zhang
Abstract:
Despite the growing reliance on fairness benchmarks to evaluate language models, the datasets that underpin these benchmarks remain critically underexamined. This survey addresses that overlooked foundation by offering a comprehensive analysis of the most widely used fairness datasets in language model research. To ground this analysis, we characterize each dataset across key dimensions, including provenance, demographic scope, annotation design, and intended use, revealing the assumptions and limitations baked into current evaluation practices. Building on this foundation, we propose a unified evaluation framework that surfaces consistent patterns of demographic disparities across benchmarks and scoring metrics. Applying this framework to sixteen popular datasets, we uncover overlooked biases that may distort conclusions about model fairness and offer guidance on selecting, combining, and interpreting these resources more effectively and responsibly. Our findings highlight an urgent need for new benchmarks that capture a broader range of social contexts and fairness notions. To support future research, we release all data, code, and results at https://github.com/vanbanTruong/Fairness-in-Large-Language-Models/tree/main/datasets, fostering transparency and reproducibility in the evaluation of language model fairness.
中文: 本研究对语言模型公平性评估数据集进行系统性分析,揭示潜在偏见并提出统一评估框架,发现人口统计差异,呼吁建立涵盖更广社会背景的新基准。
English: This survey critically analyzes widely used fairness datasets for language models, revealing embedded biases and proposing a unified evaluation framework that uncovers demographic disparities while advocating for more comprehensive benchmarks.
Authors:Vikram Rangarajan, Shishira Maiya, Max Ehrlich, Abhinav Shrivastava
Abstract:
Implicit Neural Representations (INRs) offer exceptional fidelity for video compression by learning per-video optimized functions, but their adoption is crippled by impractically slow encoding times. Existing attempts to accelerate INR encoding often sacrifice reconstruction quality or crucial coordinate-level control essential for adaptive streaming and transcoding. We introduce SIEDD (Shared-Implicit Encoder with Discrete Decoders), a novel architecture that fundamentally accelerates INR encoding without these compromises. SIEDD first rapidly trains a shared, coordinate-based encoder on sparse anchor frames to efficiently capture global, low-frequency video features. This encoder is then frozen, enabling massively parallel training of lightweight, discrete decoders for individual frame groups, further expedited by aggressive coordinate-space sampling. This synergistic design delivers a remarkable 20-30X encoding speed-up over state-of-the-art INR codecs on HD and 4K benchmarks, while maintaining competitive reconstruction quality and compression ratios. Critically, SIEDD retains full coordinate-based control, enabling continuous resolution decoding and eliminating costly transcoding. Our approach significantly advances the practicality of high-fidelity neural video compression, demonstrating a scalable and efficient path towards real-world deployment. Our codebase is available at https://github.com/VikramRangarajan/SIEDD .
中文摘要:SIEDD提出了一种新颖架构,通过共享编码器和并行训练轻量解码器的方法,将隐式神经表示的编码速度提升20-30倍,同时保持重建质量和完整的坐标控制能力,显著推进了神经视频压缩技术的实际应用。
English Summary: SIEDD introduces a novel architecture that accelerates implicit neural representation encoding by 20-30 times while maintaining reconstruction quality and full coordinate-based control, making high-fidelity neural video compression practical for real-world deployment.
Authors:Tianxing Chen, Kaixuan Wang, Zhaohui Yang, Yuhao Zhang, Zanxin Chen, Baijun Chen, Wanxi Dong, Ziyuan Liu, Dong Chen, Tianshuo Yang, Haibao Yu, Xiaokang Yang, Yusen Qin, Zhiqiang Xie, Yao Mu, Ping Luo, Tian Nian, Weiliang Deng, Yiheng Ge, Yibin Liu, Zixuan Li, Dehui Wang, Zhixuan Liang, Haohui Xie, Rijie Zeng, Yunfei Ge, Peiqing Cong, Guannan He, Zhaoming Han, Ruocheng Yin, Jingxiang Guo, Lunkai Lin, Tianling Xu, Hongzhe Bi, Xuewu Lin, Tianwei Lin, Shujie Luo, Keyu Li, Ziyan Zhao, Ke Fan, Heyang Xu, Bo Peng, Wenlong Gao, Dongjiang Li, Feng Jin, Hui Shen, Jinming Li, Chaowei Cui, Yu Chen, Yaxin Peng, Lingdong Zeng, Wenlong Dong, Tengfei Li, Weijie Ke, Jun Chen, Erdemt Bao, Tian Lan, Tenglong Liu, Jin Yang, Huiping Zhuang, Baozhi Jia, Shuai Zhang, Zhengfeng Zou, Fangheng Guan, Tianyi Jia, Ke Zhou, Hongjiu Zhang, Yating Han, Cheng Fang, Yixian Zou, Chongyang Xu, Qinglun Zhang, Shen Cheng, Xiaohe Wang, Ping Tan, Haoqiang Fan, Shuaicheng Liu, Jiaheng Chen, Chuxuan Huang, Chengliang Lin, Kaijun Luo, Boyu Yue, Yi Liu, Jinyu Chen, Zichang Tan, Liming Deng, Shuo Xu, Zijian Cai, Shilong Yin, Hao Wang, Hongshan Liu, Tianyang Li, Long Shi, Ran Xu, Huilin Xu, Zhengquan Zhang, Congsheng Xu, Jinchang Yang, Feng Xu
Abstract:
Embodied Artificial Intelligence (Embodied AI) is an emerging frontier in robotics, driven by the need for autonomous systems that can perceive, reason, and act in complex physical environments. While single-arm systems have shown strong task performance, collaborative dual-arm systems are essential for handling more intricate tasks involving rigid, deformable, and tactile-sensitive objects. To advance this goal, we launched the RoboTwin Dual-Arm Collaboration Challenge at the 2nd MEIS Workshop, CVPR 2025. Built on the RoboTwin Simulation platform (1.0 and 2.0) and the AgileX COBOT-Magic Robot platform, the competition consisted of three stages: Simulation Round 1, Simulation Round 2, and a final Real-World Round. Participants totally tackled 17 dual-arm manipulation tasks, covering rigid, deformable, and tactile-based scenarios. The challenge attracted 64 global teams and over 400 participants, producing top-performing solutions like SEM and AnchorDP3 and generating valuable insights into generalizable bimanual policy learning. This report outlines the competition setup, task design, evaluation methodology, key findings and future direction, aiming to support future research on robust and generalizable bimanual manipulation policies. The Challenge Webpage is available at https://robotwin-benchmark.github.io/cvpr-2025-challenge/.
中文: CVPR 2025的RoboTwin双机械臂协作挑战赛通过仿真和现实阶段,让全球团队完成17项双臂操作任务,为可泛化的双手策略学习提供了重要见解,推动了具身人工智能的发展。
English: The RoboTwin Dual-Arm Collaboration Challenge at CVPR 2025 advanced Embodied AI by engaging global teams in 17 dual-arm manipulation tasks, yielding key insights for generalizable bimanual policies through simulation and real-world stages.
Authors:Lujun Li, Zhu Qiyuan, Jiacheng Wang, Wei Li, Hao Gu, Sirui Han, Yike Guo
Abstract:
Mixture of Experts (MoE) LLMs face significant obstacles due to their massive parameter scale, which imposes memory, storage, and deployment challenges. Although recent expert merging methods promise greater efficiency by consolidating multiple experts, they are fundamentally hindered by parameter conflicts arising from expert specialization. In this paper, we present Sub-MoE, a novel MoE compression framework via Subspace Expert Merging. Our key insight is to perform joint Singular Value Decomposition (SVD) on concatenated expert weights, reducing conflicting parameters by extracting shared $U$-matrices while enabling effective merging of the expert-specific $V$ components. Specifically, Sub-MoE consists of two innovative phases: (1) Adaptive Expert Clustering, which groups functionally coherent experts via K-means clustering based on cosine similarity of expert outputs; and (2) Subspace Expert Merging, which first enforces Experts Union Decomposition to derive the shared $U$-matrix across experts in the same group, then pursues frequency-based merging for individual $V$-matrices, and finalizes expert reconstruction using the merged $V$-matrix. In this way, we align and fuse experts in a shared subspace, and can be extended with intra-expert compression for further inference optimization. Extensive experiments on Mixtral, DeepSeek, and Qwen-1.5|3 MoE LLMs demonstrate that our Sub-MoE significantly outperforms existing expert pruning and merging methods. Notably, our Sub-MoE maintains 96\%|86\% of original performance with 25\%|50\% expert reduction on Mixtral-8x7B in zero-shot benchmarks. Code will be released at https://github.com/lliai/MoERazor.
Chinese: Sub-MoE是一种新颖的专家混合模型压缩框架,通过专家聚类和基于联合奇异值分解的子空间专家合并,有效解决参数冲突问题,在保持高性能的同时显著提升模型效率。
English: Sub-MoE is a novel compression framework that addresses parameter conflicts in Mixture of Experts LLMs by clustering experts and merging them in a shared subspace using joint SVD, achieving significant efficiency gains while maintaining high performance.
Authors:Haoran Li, Muhao Guo, Marija Ilic, Yang Weng, Guangchun Ruan
Abstract:
Accurate residential load forecasting is critical for power system reliability with rising renewable integration and demand-side flexibility. However, most statistical and machine learning models treat external factors, such as weather, calendar effects, and pricing, as extra input, ignoring their heterogeneity, and thus limiting the extraction of useful external information. We propose a paradigm shift: external data should serve as meta-knowledge to dynamically adapt the forecasting model itself. Based on this idea, we design a meta-representation framework using hypernetworks that modulate selected parameters of a base Deep Learning (DL) model in response to external conditions. This provides both expressivity and adaptability. We further integrate a Mixture-of-Experts (MoE) mechanism to enhance efficiency through selective expert activation, while improving robustness by filtering redundant external inputs. The resulting model, dubbed as a Meta Mixture of Experts for External data (M2oE2), achieves substantial improvements in accuracy and robustness with limited additional overhead, outperforming existing state-of-the-art methods in diverse load datasets. The dataset and source code are publicly available at https://github.com/haorandd/M2oE2\_load\_forecast.git.
中文摘要:作者提出创新的M2oE2框架,将外部数据作为元知识,通过超网络和专家混合机制动态调整预测模型,以最小计算开销显著提升了住宅负荷预测的准确性与鲁棒性。
English Summary: The authors propose a novel M2oE2 framework that uses external data as meta-knowledge to dynamically adapt forecasting models through hypernetworks and mixture-of-experts, achieving superior accuracy and robustness in residential load prediction with minimal overhead.
Authors:Gabriel Iturra-Bocaz, Felipe Bravo-Marquez
Abstract:
Word embeddings have become essential components in various information retrieval and natural language processing tasks, such as ranking, document classification, and question answering. However, despite their widespread use, traditional word embedding models present a limitation in their static nature, which hampers their ability to adapt to the constantly evolving language patterns that emerge in sources such as social media and the web (e.g., new hashtags or brand names). To overcome this problem, incremental word embedding algorithms are introduced, capable of dynamically updating word representations in response to new language patterns and processing continuous data streams.
This paper presents RiverText, a Python library for training and evaluating incremental word embeddings from text data streams. Our tool is a resource for the information retrieval and natural language processing communities that work with word embeddings in streaming scenarios, such as analyzing social media. The library implements different incremental word embedding techniques, such as Skip-gram, Continuous Bag of Words, and Word Context Matrix, in a standardized framework. In addition, it uses PyTorch as its backend for neural network training. We have implemented a module that adapts existing intrinsic static word embedding evaluation tasks for word similarity and word categorization to a streaming setting. Finally, we compare the implemented methods with different hyperparameter settings and discuss the results. Our open-source library is available at https://github.com/dccuchile/rivertext.
中文摘要:RiverText是一个Python库,用于从文本数据流中训练和评估增量词嵌入,通过集成Skip-gram和CBOW等动态更新技术,解决了传统静态词嵌入模型无法适应语言演变的问题,并提供了标准化评估框架。
English Summary: RiverText is a Python library designed for training and evaluating incremental word embeddings from text streams, addressing the limitations of static models by dynamically updating word representations using techniques like Skip-gram and CBOW within a PyTorch framework.
Authors:Shahad Hardan, Darya Taratynova, Abdelmajid Essofi, Karthik Nandakumar, Mohammad Yaqub
Abstract:
Privacy preservation in AI is crucial, especially in healthcare, where models rely on sensitive patient data. In the emerging field of machine unlearning, existing methodologies struggle to remove patient data from trained multimodal architectures, which are widely used in healthcare. We propose Forget-MI, a novel machine unlearning method for multimodal medical data, by establishing loss functions and perturbation techniques. Our approach unlearns unimodal and joint representations of the data requested to be forgotten while preserving knowledge from the remaining data and maintaining comparable performance to the original model. We evaluate our results using performance on the forget dataset, performance on the test dataset, and Membership Inference Attack (MIA), which measures the attacker's ability to distinguish the forget dataset from the training dataset. Our model outperforms the existing approaches that aim to reduce MIA and the performance on the forget dataset while keeping an equivalent performance on the test set. Specifically, our approach reduces MIA by 0.202 and decreases AUC and F1 scores on the forget set by 0.221 and 0.305, respectively. Additionally, our performance on the test set matches that of the retrained model, while allowing forgetting. Code is available at https://github.com/BioMedIA-MBZUAI/Forget-MI.git
Chinese: 提出的Forget-MI方法通过消除特定表征,在保持剩余数据性能的同时,有效从医疗多模态AI模型中移除敏感患者数据,并在降低成员推理攻击和遗忘集指标方面优于现有方法。
English: The proposed Forget-MI method effectively removes sensitive patient data from multimodal AI models in healthcare by unlearning specific representations while maintaining performance on retained data and outperforming existing approaches in reducing membership inference attacks and forget set metrics.
Authors:Xinlei Yu, Changmiao Wang, Hui Jin, Ahmed Elazab, Gangyong Jia, Xiang Wan, Changqing Zou, Ruiquan Ge
Abstract:
Multi-organ medical segmentation is a crucial component of medical image processing, essential for doctors to make accurate diagnoses and develop effective treatment plans. Despite significant progress in this field, current multi-organ segmentation models often suffer from inaccurate details, dependence on geometric prompts and loss of spatial information. Addressing these challenges, we introduce a novel model named CRISP-SAM2 with CRoss-modal Interaction and Semantic Prompting based on SAM2. This model represents a promising approach to multi-organ medical segmentation guided by textual descriptions of organs. Our method begins by converting visual and textual inputs into cross-modal contextualized semantics using a progressive cross-attention interaction mechanism. These semantics are then injected into the image encoder to enhance the detailed understanding of visual information. To eliminate reliance on geometric prompts, we use a semantic prompting strategy, replacing the original prompt encoder to sharpen the perception of challenging targets. In addition, a similarity-sorting self-updating strategy for memory and a mask-refining process is applied to further adapt to medical imaging and enhance localized details. Comparative experiments conducted on seven public datasets indicate that CRISP-SAM2 outperforms existing models. Extensive analysis also demonstrates the effectiveness of our method, thereby confirming its superior performance, especially in addressing the limitations mentioned earlier. Our code is available at: https://github.com/YU-deep/CRISP_SAM2.git.
中文摘要:CRISP-SAM2是一种新颖的多器官分割模型,通过跨模态交互和语义提示技术提升医学图像分析的细节精度并摆脱对几何提示的依赖,在多个公开数据集上展现出优越性能。
English Summary: CRISP-SAM2 is a novel multi-organ segmentation model that enhances medical image analysis through cross-modal interaction and semantic prompting, achieving superior performance by improving detail accuracy and eliminating reliance on geometric prompts.
Authors:Yu Zheng, Boyang Gong, Fanye Kong, Yueqi Duan, Bingyao Yu, Wenzhao Zheng, Lei Chen, Jiwen Lu, Jie Zhou
Abstract:
In this paper, we propose a Counterfactually Decoupled Attention Learning (CDAL) method for open-world model attribution. Existing methods rely on handcrafted design of region partitioning or feature space, which could be confounded by the spurious statistical correlations and struggle with novel attacks in open-world scenarios. To address this, CDAL explicitly models the causal relationships between the attentional visual traces and source model attribution, and counterfactually decouples the discriminative model-specific artifacts from confounding source biases for comparison. In this way, the resulting causal effect provides a quantification on the quality of learned attention maps, thus encouraging the network to capture essential generation patterns that generalize to unseen source models by maximizing the effect. Extensive experiments on existing open-world model attribution benchmarks show that with minimal computational overhead, our method consistently improves state-of-the-art models by large margins, particularly for unseen novel attacks. Source code: https://github.com/yzheng97/CDAL.
中文摘要:本文提出反事实解耦注意力学习(CDAL)方法,通过因果建模分离判别性模型特征与混杂偏差,在计算开销极小的前提下显著提升开放世界模型溯源性能,尤其针对未知攻击具有卓越泛化能力。
English Summary: This paper introduces Counterfactually Decoupled Attention Learning (CDAL), a novel method that uses causal modeling to separate discriminative model artifacts from confounding biases, significantly improving open-world model attribution performance for unseen attacks with minimal computational cost.
Authors:Asen Dotsinski, Udit Thakur, Marko Ivanov, Mohammad Hafeez Khan, Maria Heuss
Abstract:
We present a reproduction study of "Competition of Mechanisms: Tracing How Language Models Handle Facts and Counterfactuals" (Ortu et al., 2024), which investigates competition of mechanisms in language models between factual recall and counterfactual in-context repetition. Our study successfully reproduces their primary findings regarding the localization of factual and counterfactual information, the dominance of attention blocks in mechanism competition, and the specialization of attention heads in handling competing information. We reproduce their results on both GPT-2 (Radford et al., 2019) and Pythia 6.9B (Biderman et al., 2023). We extend their work in three significant directions. First, we explore the generalizability of these findings to even larger models by replicating the experiments on Llama 3.1 8B (Grattafiori et al., 2024), discovering greatly reduced attention head specialization. Second, we investigate the impact of prompt structure by introducing variations where we avoid repeating the counterfactual statement verbatim or we change the premise word, observing a marked decrease in the logit for the counterfactual token. Finally, we test the validity of the authors' claims for prompts of specific domains, discovering that certain categories of prompts skew the results by providing the factual prediction token as part of the subject of the sentence. Overall, we find that the attention head ablation proposed in Ortu et al. (2024) is ineffective for domains that are underrepresented in their dataset, and that the effectiveness varies based on model architecture, prompt structure, domain and task.
中文: 这项复制研究证实了Ortu等人关于语言模型中注意力机制处理事实与反事实信息的发现,同时扩展研究揭示了更大模型中注意力头专门化程度降低、对提示结构的敏感性增强,以及所提出消融方法在不同领域有效性存在差异的局限性。
English: This reproduction study confirms Ortu et al.'s findings about attention mechanisms handling factual and counterfactual information in language models, while extending the research to reveal limitations in attention head specialization across larger models, sensitivity to prompt structures, and domain-specific effectiveness of their proposed ablation method.
Authors:Amir Aghdam, Vincent Tao Hu, Björn Ommer
Abstract:
We address the task of zero-shot video classification for extremely fine-grained actions (e.g., Windmill Dunk in basketball), where no video examples or temporal annotations are available for unseen classes. While image-language models (e.g., CLIP, SigLIP) show strong open-set recognition, they lack temporal modeling needed for video understanding. We propose ActAlign, a truly zero-shot, training-free method that formulates video classification as a sequence alignment problem, preserving the generalization strength of pretrained image-language models. For each class, a large language model (LLM) generates an ordered sequence of sub-actions, which we align with video frames using Dynamic Time Warping (DTW) in a shared embedding space. Without any video-text supervision or fine-tuning, ActAlign achieves 30.5% accuracy on ActionAtlas--the most diverse benchmark of fine-grained actions across multiple sports--where human performance is only 61.6%. ActAlign outperforms billion-parameter video-language models while using 8x fewer parameters. Our approach is model-agnostic and domain-general, demonstrating that structured language priors combined with classical alignment methods can unlock the open-set recognition potential of image-language models for fine-grained video understanding.
Authors:Marc Bara Iniesta
Abstract:
The ambiguity function is fundamental to radar waveform design, characterizing range and Doppler resolution capabilities. However, its traditional formulation involves non-differentiable operations, preventing integration with gradient-based optimization methods and modern machine learning frameworks. This paper presents the first complete mathematical framework and computational implementation for differentiable radar ambiguity functions. Our approach addresses the fundamental technical challenges that have prevented the radar community from leveraging automatic differentiation: proper handling of complex-valued gradients using Wirtinger calculus, efficient computation through parallelized FFT operations, numerical stability throughout cascaded operations, and composability with arbitrary differentiable operations. We term this approach GRAF (Gradient-based Radar Ambiguity Functions), which reformulates the ambiguity function computation to maintain mathematical equivalence while enabling gradient flow through the entire pipeline. The resulting implementation provides a general-purpose differentiable ambiguity function compatible with modern automatic differentiation frameworks, enabling new research directions including neural network-based waveform generation with ambiguity constraints, end-to-end optimization of radar systems, and integration of classical radar theory with modern deep learning. We provide complete implementation details and demonstrate computational efficiency suitable for practical applications. This work establishes the mathematical and computational foundation for applying modern machine learning techniques to radar waveform design, bridging classical radar signal processing with automatic differentiation frameworks.
中文摘要:本文提出了GRAF这一新型可微分雷达模糊函数框架,实现了基于梯度的优化方法,并能与机器学习结合用于先进雷达波形设计。
English Summary: This paper introduces GRAF, a novel differentiable radar ambiguity function framework that enables gradient-based optimization and integration with machine learning for advanced radar waveform design.
Authors:Sina Tabakhi, Haiping Lu
Abstract:
A key challenge in learning from multimodal biological data is missing modalities, where all data from some modalities are missing for some patients. Current fusion methods address this by excluding patients with missing modalities, imputing missing modalities, or making predictions directly with partial modalities. However, they often struggle with diverse missing-modality patterns and the exponential growth of the number of such patterns as the number of modalities increases. To address these limitations, we propose MAGNET (Missing-modality-Aware Graph neural NETwork) for direct prediction with partial modalities, which introduces a patient-modality multi-head attention mechanism to fuse lower-dimensional modality embeddings based on their importance and missingness. MAGNET's complexity increases linearly with the number of modalities while adapting to missing-pattern variability. To generate predictions, MAGNET further constructs a patient graph with fused multimodal embeddings as node features and the connectivity determined by the modality missingness, followed by a conventional graph neural network. Experiments on three public multiomics datasets for cancer classification, with real-world instead of artificial missingness, show that MAGNET outperforms the state-of-the-art fusion methods. The data and code are available at https://github.com/SinaTabakhi/MAGNET.
中文: MAGNET是一种新型图神经网络,通过患者-模态注意力机制和图结构融合有效处理多模态生物数据中的缺失模态问题,在癌症分类任务中以线性增长的复杂度实现了优于现有方法的性能。
English: MAGNET is a novel graph neural network that effectively handles diverse missing-modality patterns in multimodal biological data by using patient-modality attention and graph-based fusion, achieving superior performance in cancer classification with linearly scalable complexity.
Authors:Kamil Faber, Marcin PietroÅ, Dominik Å»urek, Roberto Corizzo
Abstract:
The recently proposed xLSTM is a powerful model that leverages expressive multiplicative gating and residual connections, providing the temporal capacity needed for long-horizon forecasting and representation learning. This architecture has demonstrated success in time series forecasting, lossless compression, and even large-scale language modeling tasks, where its linear memory footprint and fast inference make it a viable alternative to Transformers. Despite its growing popularity, no prior work has explored xLSTM for anomaly detection. In this work, we fill this gap by proposing xLSTMAD, the first anomaly detection method that integrates a full encoder-decoder xLSTM architecture, purpose-built for multivariate time series data. Our encoder processes input sequences to capture historical context, while the decoder is devised in two separate variants of the method. In the forecasting approach, the decoder iteratively generates forecasted future values xLSTMAD-F, while the reconstruction approach reconstructs the input time series from its encoded counterpart xLSTMAD-R. We investigate the performance of two loss functions: Mean Squared Error (MSE), and Soft Dynamic Time Warping (SoftDTW) to consider local reconstruction fidelity and global sequence alignment, respectively. We evaluate our method on the comprehensive TSB-AD-M benchmark, which spans 17 real-world datasets, using state-of-the-art challenging metrics such as VUS-PR. In our results, xLSTM showcases state-of-the-art accuracy, outperforming 23 popular anomaly detection baselines. Our paper is the first work revealing the powerful modeling capabilities of xLSTM for anomaly detection, paving the way for exciting new developments on this subject. Our code is available at: https://github.com/Nyderx/xlstmad
中文: 本文提出了xLSTMAD,这是首个采用完整编码器-解码器xLSTM架构的多变量时间序列异常检测方法,在基准测试中实现了最先进的精度,并超越了23种基线方法。
English: This paper introduces xLSTMAD, the first anomaly detection method using a full encoder-decoder xLSTM architecture for multivariate time series, which achieves state-of-the-art accuracy on benchmark datasets and outperforms 23 baseline methods.
Authors:Ramya Hebbalaguppe, Tamoghno Kandar, Abhinav Nagpal, Chetan Arora
Abstract:
Vision-language models (VLM) have demonstrated impressive performance in image recognition by leveraging self-supervised training on large datasets. Their performance can be further improved by adapting to the test sample using test-time prompt tuning (TPT). Unfortunately, the singular focus of TPT approaches on improving the accuracy suffers from tunnel vision, and leads to degradation in confidence calibration. This limits the applicability of TPT in critical applications.
We make three contributions in this work. (1) We posit that random or naive initialization of prompts leads to overfitting on a particular test sample, and is the main reason for miscalibration of the VLM after TPT. To mitigate the problem, we propose careful initialization of test time prompt using prior knowledge about the target label attributes from a large language model (LLM); (2) To further maintain the quality of prompts during \tpt, we propose a novel regularization loss to reduce intraclass distance, and increase inter-class distance between the learnt
Through extensive experiments on different CLIP architectures and 15 datasets, we show that our approach can effectively improve the calibration after TPT. We report an average expected calibration error (ECE) of 4.11 with our method, TCA, compared to 11.7 for vanilla TPT, 6.12 for C-TPT (ICLR'24), 6.78 for DiffTPT (CVPR'23), and 8.43 for PromptAlign (NeurIPS'23). The code is publicly accessible at: https://github.com/rhebbalaguppe/TCA_PromptWithoutPanic.
中文: 视觉语言模型通过测试时提示调优可提升性能,但易导致过拟合和置信度校准不佳;我们提出的方法利用大语言模型初始化提示并结合新型正则化损失,有效改善了多个数据集的校准效果。
English: Vision-language models can be enhanced through test-time prompt tuning, but this often leads to overfitting and poor confidence calibration, which our proposed method mitigates by using large language model initialization and a novel regularization loss to significantly improve calibration across multiple datasets.
Authors:Byung Hyun Lee, Sungjin Lim, Seunggyu Lee, Dong Un Kang, Se Young Chun
Abstract:
Remarkable progress in text-to-image diffusion models has brought a major concern about potentially generating images on inappropriate or trademarked concepts. Concept erasing has been investigated with the goals of deleting target concepts in diffusion models while preserving other concepts with minimal distortion. To achieve these goals, recent concept erasing methods usually fine-tune the cross-attention layers of diffusion models. In this work, we first show that merely updating the cross-attention layers in diffusion models, which is mathematically equivalent to adding \emph{linear} modules to weights, may not be able to preserve diverse remaining concepts. Then, we propose a novel framework, dubbed Concept Pinpoint Eraser (CPE), by adding \emph{nonlinear} Residual Attention Gates (ResAGs) that selectively erase (or cut) target concepts while safeguarding remaining concepts from broad distributions by employing an attention anchoring loss to prevent the forgetting. Moreover, we adversarially train CPE with ResAG and learnable text embeddings in an iterative manner to maximize erasing performance and enhance robustness against adversarial attacks. Extensive experiments on the erasure of celebrities, artistic styles, and explicit contents demonstrated that the proposed CPE outperforms prior arts by keeping diverse remaining concepts while deleting the target concepts with robustness against attack prompts. Code is available at https://github.com/Hyun1A/CPE
中文摘要:本文提出概念精准擦除器(CPE),通过非线性残差注意力门控和对抗性训练,在扩散模型中有效擦除目标概念的同时保持其余概念的多样性,并具备抗攻击鲁棒性。
English summary: This paper introduces Concept Pinpoint Eraser (CPE), a novel framework that uses nonlinear Residual Attention Gates and adversarial training to effectively erase target concepts from diffusion models while preserving diverse remaining concepts and maintaining robustness against attacks.
Authors:Nuoye Xiong, Anqi Dong, Ning Wang, Cong Hua, Guangming Zhu, Lin Mei, Peiyi Shen, Liang Zhang
Abstract:
Recent advances in deep learning have led to increasingly complex models with deeper layers and more parameters, reducing interpretability and making their decisions harder to understand. While many methods explain black-box reasoning, most lack effective interventions or only operate at sample-level without modifying the model itself. To address this, we propose the Concept Bottleneck Model for Enhancing Human-Neural Network Mutual Understanding (CBM-HNMU). CBM-HNMU leverages the Concept Bottleneck Model (CBM) as an interpretable framework to approximate black-box reasoning and communicate conceptual understanding. Detrimental concepts are automatically identified and refined (removed/replaced) based on global gradient contributions. The modified CBM then distills corrected knowledge back into the black-box model, enhancing both interpretability and accuracy. We evaluate CBM-HNMU on various CNN and transformer-based models across Flower-102, CIFAR-10, CIFAR-100, FGVC-Aircraft, and CUB-200, achieving a maximum accuracy improvement of 2.64% and a maximum increase in average accuracy across 1.03%. Source code is available at: https://github.com/XiGuaBo/CBM-HNMU.
Chinese: 本文提出了增强人机互理解的概念瓶颈模型(CBM-HNMU),通过自动识别并优化有害概念来提升模型可解释性与准确率,在多个数据集上最高实现了2.64%的精度提升。
English: This paper introduces the Concept Bottleneck Model for Enhancing Human-Neural Network Mutual Understanding (CBM-HNMU), which automatically identifies and refines detrimental concepts to improve both model interpretability and accuracy, achieving up to a 2.64% accuracy gain across multiple datasets.
Authors:Minchao Jiang, Shunyu Jia, Jiaming Gu, Xiaoyuan Lu, Guangming Zhu, Anqi Dong, Liang Zhang
Abstract:
3D Gaussian Splatting (3DGS) has become horsepower in high-quality, real-time rendering for novel view synthesis of 3D scenes. However, existing methods focus primarily on geometric and appearance modeling, lacking deeper scene understanding while also incurring high training costs that complicate the originally streamlined differentiable rendering pipeline. To this end, we propose VoteSplat, a novel 3D scene understanding framework that integrates Hough voting with 3DGS. Specifically, Segment Anything Model (SAM) is utilized for instance segmentation, extracting objects, and generating 2D vote maps. We then embed spatial offset vectors into Gaussian primitives. These offsets construct 3D spatial votes by associating them with 2D image votes, while depth distortion constraints refine localization along the depth axis. For open-vocabulary object localization, VoteSplat maps 2D image semantics to 3D point clouds via voting points, reducing training costs associated with high-dimensional CLIP features while preserving semantic unambiguity. Extensive experiments demonstrate effectiveness of VoteSplat in open-vocabulary 3D instance localization, 3D point cloud understanding, click-based 3D object localization, hierarchical segmentation, and ablation studies. Our code is available at https://sy-ja.github.io/votesplat/
Authors:Yanran Wu, Inez Hua, Yi Ding
Abstract:
Water consumption is an increasingly critical dimension of computing sustainability, especially as AI workloads rapidly scale. However, current water impact assessment often overlooks where and when water stress is more severe. To fill in this gap, we present SCARF, the first general framework that evaluates water impact of computing by factoring in both spatial and temporal variations in water stress. SCARF calculates an Adjusted Water Impact (AWI) metric that considers both consumption volume and local water stress over time. Through three case studies on LLM serving, datacenters, and semiconductor fabrication plants, we show the hidden opportunities for reducing water impact by optimizing location and time choices, paving the way for water-sustainable computing. The code is available at https://github.com/jojacola/SCARF.
中文: SCARF作为首个考虑时空水资源压力变化的计算水影响评估框架,通过案例研究揭示了优化时空选择以实现水资源可持续计算的潜力。
English: SCARF is the first framework to assess computing's water impact by incorporating spatial and temporal water stress variations, revealing optimization opportunities for water-sustainable computing through case studies.
Authors:Brian Mak, Jeffrey Flanigan
Abstract:
The residual stream acts as a memory bus where transformer layers both store and access features (Elhage et al., 2021). We consider changing the mechanism for retrieving and storing information in the residual stream, and replace the residual stream of the transformer with an outer product memory matrix (Kohonen, 1972, Anderson, 1972). We call this model the Residual Matrix Transformer (RMT). We find that the RMT enjoys a number of attractive properties: 1) the size of the residual stream can be scaled independently of compute and model size, improving performance, 2) the RMT can achieve the same loss as the transformer with 58% fewer FLOPS, 25% fewer parameters, and 41% fewer training tokens tokens, and 3) the RMT outperforms the transformer on downstream evaluations. We theoretically analyze the transformer and the RMT, and show that the RMT allows for more efficient scaling of the residual stream, as well as improved variance propagation properties. Code for this project can be found at https://github.com/bmac3/residual-matrix-transformer.
残差矩阵变换器(RMT)用外积记忆矩阵替代了标准残差流,以更少的资源实现了更高的效率和性能,同时允许残差流独立扩展。
The Residual Matrix Transformer (RMT) replaces the standard residual stream with an outer product memory matrix, achieving superior efficiency and performance with fewer resources while enabling independent scaling of the residual stream.
Authors:Hassan Baker, Matthew S. Emigh, Austin J. Brockmeier
Abstract:
As a computer vision task, automatic object segmentation remains challenging in specialized image domains without massive labeled data, such as synthetic aperture sonar images, remote sensing, biomedical imaging, etc. In any domain, obtaining pixel-wise segmentation masks is expensive. In this work, we propose a method for training a masking network to perform binary object segmentation using weak supervision in the form of image-wise presence or absence of an object of interest, which provides less information but may be obtained more quickly from manual or automatic labeling. A key step in our method is that the segmented objects can be placed into background-only images to create realistic, images of the objects with counterfactual backgrounds. To create a contrast between the original and counterfactual background images, we propose to first cluster the background-only images, and then during learning create counterfactual images that blend objects segmented from their original source backgrounds to backgrounds chosen from a targeted cluster. One term in the training loss is the divergence between these counterfactual images and the real object images with backgrounds of the target cluster. The other term is a supervised loss for background-only images. While an adversarial critic could provide the divergence, we use sample-based divergences. We conduct experiments on side-scan and synthetic aperture sonar in which our approach succeeds compared to previous unsupervised segmentation baselines that were only tested on natural images. Furthermore, to show generality we extend our experiments to natural images, obtaining reasonable performance with our method that avoids pretrained networks, generative networks, and adversarial critics. The basecode for this work can be found at \href{GitHub}{https://github.com/bakerhassan/WSOS}.
Chinese: 本文提出了一种弱监督的二元目标分割方法,利用图像级目标存在标签并通过将分割目标与聚类背景融合生成反事实图像,在声纳等专业领域和自然图像上实现了优于无监督基线的性能,且无需预训练网络或对抗训练。
English: This paper introduces a weakly supervised method for binary object segmentation that uses image-level object presence labels and creates counterfactual images by blending segmented objects with clustered backgrounds, achieving improved performance on specialized domains like sonar and natural images without relying on pretrained networks or adversarial training.
Authors:Hassan Baker, Austin J. Brockmeier
Abstract:
Detecting brain lesions as abnormalities observed in magnetic resonance imaging (MRI) is essential for diagnosis and treatment. In the search of abnormalities, such as tumors and malformations, radiologists may benefit from computer-aided diagnostics that use computer vision systems trained with machine learning to segment normal tissue from abnormal brain tissue. While supervised learning methods require annotated lesions, we propose a new unsupervised approach (Patch2Loc) that learns from normal patches taken from structural MRI. We train a neural network model to map a patch back to its spatial location within a slice of the brain volume. During inference, abnormal patches are detected by the relatively higher error and/or variance of the location prediction. This generates a heatmap that can be integrated into pixel-wise methods to achieve finer-grained segmentation. We demonstrate the ability of our model to segment abnormal brain tissues by applying our approach to the detection of tumor tissues in MRI on T2-weighted images from BraTS2021 and MSLUB datasets and T1-weighted images from ATLAS and WMH datasets. We show that it outperforms the state-of-the art in unsupervised segmentation. The codebase for this work can be found on our \href{https://github.com/bakerhassan/Patch2Loc}{GitHub page}.
中文: 本研究提出Patch2Loc这一无监督方法,通过训练神经网络预测脑部MRI图像中斑块的位置,并利用预测误差识别病变区域,在性能上超越了现有最先进的无监督分割技术。
English: This study introduces Patch2Loc, an unsupervised method that detects brain lesions in MRI by training a neural network to predict patch locations and identifying abnormalities through prediction errors, outperforming current state-of-the-art techniques.
Authors:Weizhi Gao, Zhichao Hou, Junqi Yin, Feiyi Wang, Linyu Peng, Xiaorui Liu
Abstract:
Diffusion models have emerged as powerful generative models, but their high computation cost in iterative sampling remains a significant bottleneck. In this work, we present an in-depth and insightful study of state-of-the-art acceleration techniques for diffusion models, including caching and quantization, revealing their limitations in computation error and generation quality. To break these limits, this work introduces Modulated Diffusion (MoDiff), an innovative, rigorous, and principled framework that accelerates generative modeling through modulated quantization and error compensation. MoDiff not only inherents the advantages of existing caching and quantization methods but also serves as a general framework to accelerate all diffusion models. The advantages of MoDiff are supported by solid theoretical insight and analysis. In addition, extensive experiments on CIFAR-10 and LSUN demonstrate that MoDiff significant reduces activation quantization from 8 bits to 3 bits without performance degradation in post-training quantization (PTQ). Our code implementation is available at https://github.com/WeizhiGao/MoDiff.
中文: 本文提出MoDiff这一创新框架,通过调制量化和误差补偿加速扩散模型,在训练后量化中将激活位从8位降至3位且不损失性能。
English: This paper introduces MoDiff, a novel framework that accelerates diffusion models through modulated quantization and error compensation, effectively reducing activation bits from 8 to 3 without performance loss in post-training quantization.
Authors:Youkang Wang, Jian Wang, Rubing Chen, Xiao-Yong Wei
Abstract:
Inference-time scaling has emerged as a powerful technique for enhancing the reasoning performance of Large Language Models (LLMs). However, existing approaches often rely on heuristic strategies for parallel sampling, lacking a principled foundation. To address this gap, we propose a probabilistic framework that formalizes the optimality of inference-time scaling under the assumption that parallel samples are independently and identically distributed (i.i.d.), and where the Best-of-N selection strategy follows a probability distribution that can be estimated. Within this framework, we derive a theoretical lower bound on the required number of samples to achieve a target performance level, providing the first principled guidance for compute-efficient scaling. Leveraging this insight, we develop OptScale, a practical algorithm that dynamically determines the optimal number of sampled responses. OptScale employs a language model-based predictor to estimate probabilistic prior parameters, enabling the decision of the minimal number of samples needed that satisfy predefined performance thresholds and confidence levels. Extensive experiments on mathematical reasoning benchmarks (including MATH-500, GSM8K, AIME, and AMC) demonstrate that OptScale significantly reduces sampling overhead while remaining better or on par with state-of-the-art reasoning performance. Our work offers both a theoretical foundation and a practical solution for principled inference-time scaling, addressing a critical gap in the efficient deployment of LLMs for complex reasoning. The source code is publicly available at https://github.com/Albertwyk/OptScale.
中文摘要:本文提出了OptScale概率框架,首次从理论上推导出实现目标推理性能所需的最小采样数量,在显著降低计算开销的同时保持最优性能,为LLM的高效推理提供了原理性指导。
English Summary: This paper introduces OptScale, a probabilistic framework that theoretically determines the minimum number of parallel samples needed for efficient inference-time scaling in LLMs, significantly reducing computational overhead while maintaining state-of-the-art reasoning performance.
Authors:Adiba Ejaz, Elias Bareinboim
Abstract:
Greedy Equivalence Search (GES) is a classic score-based algorithm for causal discovery from observational data. In the sample limit, it recovers the Markov equivalence class of graphs that describe the data. Still, it faces two challenges in practice: computational cost and finite-sample accuracy. In this paper, we develop Less Greedy Equivalence Search (LGES), a variant of GES that retains its theoretical guarantees while partially addressing these limitations. LGES modifies the greedy step: rather than always applying the highest-scoring insertion, it avoids edge insertions between variables for which the score implies some conditional independence. This more targeted search yields up to a \(10\)-fold speed-up and a substantial reduction in structural error relative to GES. Moreover, LGES can guide the search using prior assumptions, while correcting these assumptions when contradicted by the data. Finally, LGES can exploit interventional data to refine the learned observational equivalence class. We prove that LGES recovers the true equivalence class in the sample limit from observational and interventional data, even with misspecified prior assumptions. Experiments demonstrate that LGES outperforms GES and other baselines in speed, accuracy, and robustness to misspecified assumptions. Our code is available at https://github.com/CausalAILab/lges.
Chinese: 较少贪心等价搜索(LGES)是GES的改进版本,通过避免不必要的边插入来加速因果发现,在保持理论保证的同时实现了高达10倍的速度提升和更高的准确性。
English: Less Greedy Equivalence Search (LGES) is an improved variant of GES that speeds up causal discovery by avoiding unnecessary edge insertions, achieving up to a 10-fold acceleration and higher accuracy while maintaining theoretical guarantees.
Authors:Chen Wang, Lai Wei, Yanzhi Zhang, Chenyang Shao, Zedong Dan, Weiran Huang, Yue Wang, Yuzhi Zhang
Abstract:
Recent advances in reinforcement learning (RL) have significantly enhanced the reasoning capabilities of large language models (LLMs). Group Relative Policy Optimization (GRPO), an efficient variant of PPO that lowers RL's computational cost, still faces limited exploration, low sample efficiency and instability, constraining its performance on complex reasoning tasks. To address these limitations, we introduce EFRame, an Exploration-Filter-Replay framework that systematically augments GRPO along three critical dimensions. EFRame performs additional rollouts to explore high-quality trajectories, applies online filtering to eliminate low-quality samples that introduce noise and variance, and leverages experience replay to repeatedly exploit rare but informative samples. EFRame establishes a complete and stable learning cycle, guiding the model through a structured transition from exploration to convergence. Our experiments across a variety of reasoning benchmarks demonstrate that EFRame not only improves the robustness and efficiency of training, but also enables access to deeper reasoning capabilities that remain unattainable under vanilla GRPO. Furthermore, EFRame not only enables fine-grained categorization of training samples for deeper insight into their contributions, but also introduces an efficient and precise mechanism for entropy control, which is critical for balancing exploration and convergence in RL training. Our code is available at https://github.com/597358816/EFRame.
中文: EFRame框架通过增强探索、过滤低质量样本和重放关键经验,提升了GRPO在复杂推理任务中的性能,在Geometry3K基准上相对改进37.9%,为语言模型的深度推理提供了稳健解决方案。
English: EFRame, an Exploration-Filter-Replay framework, enhances GRPO by enabling targeted exploration, stabilizing training through sample filtering, and amplifying rare trajectories, achieving a 37.9% improvement on Geometry3K and advancing reasoning in LLMs.
Authors:Hyeongji Kim, Stine Hansen, Michael Kampffmeyer
Abstract:
Common prototype-based medical image few-shot segmentation (FSS) methods model foreground and background classes using class-specific prototypes. However, given the high variability of the background, a more promising direction is to focus solely on foreground modeling, treating the background as an anomaly -- an approach introduced by ADNet. Yet, ADNet faces three key limitations: dependence on a single prototype per class, a focus on binary classification, and fixed thresholds that fail to adapt to patient and organ variability. To address these shortcomings, we propose the Tied Prototype Model (TPM), a principled reformulation of ADNet with tied prototype locations for foreground and background distributions. Building on its probabilistic foundation, TPM naturally extends to multiple prototypes and multi-class segmentation while effectively separating non-typical background features. Notably, both extensions lead to improved segmentation accuracy. Finally, we leverage naturally occurring class priors to define an ideal target for adaptive thresholds, boosting segmentation performance. Taken together, TPM provides a fresh perspective on prototype-based FSS for medical image segmentation. The code can be found at https://github.com/hjk92g/TPM-FSS.
中文: TPM模型通过绑定前景和背景的原型位置,支持多类别分割和自适应阈值,有效解决了医学图像中因患者和器官差异导致的背景多变问题。
English: The Tied Prototype Model (TPM) improves upon ADNet by introducing tied prototypes for foreground and background, enabling multi-class segmentation and adaptive thresholds to address variability in medical images.
Authors:Tianhao Chen, Xin Xu, Zijing Liu, Pengxiang Li, Xinyuan Song, Ajay Kumar Jaiswal, Fan Zhang, Jishan Hu, Yang Wang, Hao Chen, Shizhe Diao, Shiwei Liu, Yu Li, Lu Yin, Can Yang
Abstract:
Modern Large Language Models, such as the LLaMA, Qwen and DeepSeek series, predominantly adopt the Pre-LayerNorm (Pre-LN) Transformer architecture. While being stable during pretraining and scalable to large model sizes, Pre-LN suffers from an exponential growth in activation variance across layers, causing the shortcut to dominate over sub-layer outputs in the residual connection and limiting the learning capacity of deeper layers. To mitigate this issue, we propose Gradient-Preserving Activation Scaling (GPAS), a simple technique that can be used in combination with existing approaches. GPAS works by scaling down the intermediate activations while keeping their gradients unchanged. This leaves information in the activations intact, and avoids the gradient vanishing problem associated with gradient downscaling. Extensive experiments across various model sizes from 71M to 1B show that GPAS achieves consistent performance gains. Beyond enhancing Pre-LN Transformers, GPAS also shows promise in improving alternative architectures such as Sandwich-LN and DeepNorm, demonstrating its versatility and potential for improving training dynamics in a wide range of settings. Our code is available at https://github.com/dandingsky/GPAS.
中文摘要:本文提出梯度保持激活缩放(GPAS)方法,通过缩放中间激活值同时保持梯度不变,解决Pre-LayerNorm Transformer中激活方差指数增长问题,在不同规模模型中均实现了性能提升。
English Summary: The paper introduces Gradient-Preserving Activation Scaling (GPAS), a technique that addresses the exponential activation variance growth in Pre-LayerNorm Transformers by scaling down intermediate activations while preserving gradients, achieving consistent performance improvements across various model sizes.
Authors:Lu Han, Yu Liu, Qiwen Deng, Jian Jiang, Yinbo Sun, Zhe Yu, Binfeng Wang, Xingyu Lu, Lintao Ma, Han-Jia Ye, De-Chuan Zhan
Abstract:
Time Series Foundation Models (TSFMs) have achieved remarkable success through large-scale pretraining. However, their design primarily targets real-valued series, limiting their ability to handle general forecasting tasks involving diverse and often heterogeneous covariates--such as categorical variables and multimodal data (e.g., images, text)--which are typically task-specific and difficult to leverage during pretraining. To address this gap, we propose Unified Covariate Adaptation (UniCA), a framework to bridge TSFMs with general covariate-aware forecasting. UniCA first performs covariate homogenization to transform heterogeneous covariates into high-level homogeneous series representations and then fuses them via a unified attention-based fusion mechanism. UniCA is compatible and universal for adaptation with both homogeneous and heterogeneous covariates, incorporating extra covariate information while preserving the generalization ability of TSFMs.Extensive experiments on multiple unimodal and multimodal covariate-aware forecasting benchmarks demonstrate the superiority of UniCA, highlighting the promise of covariate-aware TSFM adaptation in real-world forecasting scenarios. Codes are released on https://github.com/hanlu-nju/UniCA.
中文: 时间序列基础模型虽在大规模预训练中表现出色,但难以处理多样化协变量,为此提出的UniCA框架通过同质化处理和融合机制,有效整合异构数据以提升预测性能,同时保持模型的泛化能力。
English: Time Series Foundation Models (TSFMs) excel in large-scale pretraining but struggle with diverse covariates, prompting the development of UniCA, a framework that homogenizes and fuses heterogeneous data to enhance forecasting while preserving TSFMs' generalization.
Authors:Kunjal Panchal, Sunav Choudhary, Yuriy Brun, Hui Guan
Abstract:
Forward-mode automatic differentiation (FmAD) and zero-order (ZO) optimization have been proposed as memory-efficient alternatives to backpropagation (BP) for gradient computation, especially in low-resource settings. However, their practical benefits remain unclear due to two key gaps: a lack of comparison against memory-efficient BP variants, such as activation checkpointing, and a lack of a unified theoretical analysis. This work presents a comprehensive theoretical and empirical comparison of BP, FmAD, and ZO methods. Our theoretical analysis shows that while FmAD, and ZO can reduce memory usage, they incur significant costs in accuracy, convergence speed, and computation compared to BP with checkpointing. These drawbacks worsen with larger models or constrained perturbation budgets. Empirical experiments on large language and vision-language models show that BP with checkpointing outperforms FmAD and ZO variants, including those enhanced with variance reduction, achieving up to 31.1% higher accuracy, 34.8% faster convergence, and 3.8x fewer computations at comparable memory usage. Our results highlight fundamental limitations of FmAD and ZO, and reaffirm BP with checkpointing as the most effective strategy for model training under memory-constrained settings. Our code is available at https://github.com/Astuary/The_Cost_of_Avoiding_Backpropagation.
中文: 研究表明,尽管前向自动微分和零阶优化能降低内存使用,但与采用检查点的反向传播相比,它们在精度、收敛速度和计算效率上存在显著不足,后者仍是内存受限场景下最优的训练策略。
English: This study demonstrates that while forward-mode automatic differentiation and zero-order optimization reduce memory usage, they significantly compromise accuracy, convergence speed, and computational efficiency compared to backpropagation with checkpointing, which remains superior for memory-constrained model training.
Authors:Rafael Sterzinger, Marco Peer, Robert Sablatnig
Abstract:
As rich sources of history, maps provide crucial insights into historical changes, yet their diverse visual representations and limited annotated data pose significant challenges for automated processing. We propose a simple yet effective approach for few-shot segmentation of historical maps, leveraging the rich semantic embeddings of large vision foundation models combined with parameter-efficient fine-tuning. Our method outperforms the state-of-the-art on the Siegfried benchmark dataset in vineyard and railway segmentation, achieving +5% and +13% relative improvements in mIoU in 10-shot scenarios and around +20% in the more challenging 5-shot setting. Additionally, it demonstrates strong performance on the ICDAR 2021 competition dataset, attaining a mean PQ of 67.3% for building block segmentation, despite not being optimized for this shape-sensitive metric, underscoring its generalizability. Notably, our approach maintains high performance even in extremely low-data regimes (10- & 5-shot), while requiring only 689k trainable parameters - just 0.21% of the total model size. Our approach enables precise segmentation of diverse historical maps while drastically reducing the need for manual annotations, advancing automated processing and analysis in the field. Our implementation is publicly available at: https://github.com/RafaelSterzinger/few-shot-map-segmentation.
Chinese: 本研究提出了一种简单有效的历史地图少样本分割方法,通过结合大型视觉基础模型与参数高效微调,在基准数据集上实现了最优性能,同时大幅减少了对标注数据和可训练参数的需求。
English: This study introduces a simple yet effective few-shot segmentation method for historical maps by combining large vision foundation models with parameter-efficient fine-tuning, achieving state-of-the-art performance on benchmark datasets while requiring minimal annotated data and trainable parameters.
Authors:Fuying Wang, Jiacheng Xu, Lequan Yu
Abstract:
Electrocardiograms (ECGs) play a vital role in monitoring cardiac health and diagnosing heart diseases. However, traditional deep learning approaches for ECG analysis rely heavily on large-scale manual annotations, which are both time-consuming and resource-intensive to obtain. To overcome this limitation, self-supervised learning (SSL) has emerged as a promising alternative, enabling the extraction of robust ECG representations that can be efficiently transferred to various downstream tasks. While previous studies have explored SSL for ECG pretraining and multi-modal ECG-language alignment, they often fail to capture the multi-scale nature of ECG signals. As a result, these methods struggle to learn generalized representations due to their inability to model the hierarchical structure of ECG data. To address this gap, we introduce MELP, a novel Multi-scale ECG-Language Pretraining (MELP) model that fully leverages hierarchical supervision from ECG-text pairs. MELP first pretrains a cardiology-specific language model to enhance its understanding of clinical text. It then applies three levels of cross-modal supervision-at the token, beat, and rhythm levels-to align ECG signals with textual reports, capturing structured information across different time scales. We evaluate MELP on three public ECG datasets across multiple tasks, including zero-shot ECG classification, linear probing, and transfer learning. Experimental results demonstrate that MELP outperforms existing SSL methods, underscoring its effectiveness and adaptability across diverse clinical applications. Our code is available at https://github.com/HKU-MedAI/MELP.
中文:MELP提出了一种新颖的多尺度心电-语言预训练模型,通过分层跨模态监督捕捉结构化心电信号与文本的对齐关系,在多种临床任务中优于现有自监督方法。
English: MELP introduces a novel multi-scale ECG-language pretraining model that leverages hierarchical cross-modal supervision to capture structured ECG-text alignments, outperforming existing self-supervised methods across diverse clinical tasks.
Authors:Tianrong Chen, Huangjie Zheng, David Berthelot, Jiatao Gu, Josh Susskind, Shuangfei Zhai
Abstract:
Diffusion models have demonstrated exceptional capabilities in generating high-fidelity images but typically suffer from inefficient sampling. Many solver designs and noise scheduling strategies have been proposed to dramatically improve sampling speeds. In this paper, we introduce a new sampling method that is up to $186\%$ faster than the current state of the art solver for comparative FID on ImageNet512. This new sampling method is training-free and uses an ordinary differential equation (ODE) solver. The key to our method resides in using higher-dimensional initial noise, allowing to produce more detailed samples with less function evaluations from existing pretrained diffusion models. In addition, by design our solver allows to control the level of detail through a simple hyper-parameter at no extra computational cost. We present how our approach leverages momentum dynamics by establishing a fundamental equivalence between momentum diffusion models and conventional diffusion models with respect to their training paradigms. Moreover, we observe the use of higher-dimensional noise naturally exhibits characteristics similar to stochastic differential equations (SDEs). Finally, we demonstrate strong performances on a set of representative pretrained diffusion models, including EDM, EDM2, and Stable-Diffusion 3, which cover models in both pixel and latent spaces, as well as class and text conditional settings. The code is available at https://github.com/apple/ml-tada.
中文摘要:本文提出一种无需训练的采样方法,通过使用高维初始噪声和常微分方程求解器,将扩散模型的图像生成速度最高提升186%,并在多种预训练模型上实现优异性能且不增加计算成本。
English Summary: This paper introduces a training-free sampling method that accelerates diffusion model image generation by up to 186% through higher-dimensional initial noise and ODE solver techniques, achieving superior performance across multiple pretrained models without additional computational cost.
Authors:Yash Akhauri, Bryan Lewandowski, Cheng-Hsi Lin, Adrian N. Reyes, Grant C. Forbes, Arissa Wongpanich, Bangding Yang, Mohamed S. Abdelfattah, Sagi Perel, Xingyou Song
Abstract:
In many industries, predicting metric outcomes of large systems is a fundamental problem, driven largely by traditional tabular regression. However, such methods struggle on complex systems data in the wild such as configuration files or system logs, where feature engineering is often infeasible. We propose text-to-text regression as a general, scalable alternative. For predicting resource efficiency on Borg, Google's massive compute cluster scheduling system, a 60M parameter encoder-decoder, trained from random initialization, achieves up to a near perfect 0.99 (0.9 average) rank correlation across the entire fleet, and 100x lower MSE than tabular approaches. The model also easily adapts to new tasks in only 500 few-shot examples and captures the densities of complex outcome distributions. Ablation studies highlight the importance of using encoders, increasing sequence length, and the model's inherent uncertainty quantification. These findings pave the way for universal simulators of real-world outcomes.
Chinese: 文本到文本回归作为一种可扩展且高效的替代方案,在预测如谷歌Borg等复杂系统的资源效率方面实现了近乎完美的准确性,并能以少量数据轻松适应新任务。
English: Text-to-text regression offers a scalable and effective alternative to traditional tabular methods, achieving near-perfect accuracy in predicting resource efficiency on complex systems like Google's Borg and adapting easily to new tasks with minimal data.
Authors:Oron Nir, Jay Tenenbaum, Ariel Shamir
Abstract:
Density-based clustering methods often surpass centroid-based counterparts, when addressing data with noise or arbitrary data distributions common in real-world problems. In this study, we reveal a key property intrinsic to density-based clustering methods regarding the relation between the number of clusters and the neighborhood radius of core points - we empirically show that it is nearly unimodal, and support this claim theoretically in a specific setting. We leverage this property to devise new strategies for finding appropriate values for the radius more efficiently based on the Ternary Search algorithm. This is especially important for large scale data that is high-dimensional, where parameter tuning is computationally intensive. We validate our methodology through extensive applications across a range of high-dimensional, large-scale NLP, Audio, and Computer Vision tasks, demonstrating its practical effectiveness and robustness. This work not only offers a significant advancement in parameter control for density-based clustering but also broadens the understanding regarding the relations between their guiding parameters. Our code is available at https://github.com/oronnir/UnimodalStrategies.
中文摘要:本研究揭示了密度聚类中簇数量与核心点邻域半径存在近似单峰关系,基于此提出利用三分搜索算法高效确定最佳半径参数的方法,并在大规模高维NLP、音频和视觉任务中验证了其有效性与鲁棒性。
English Summary: This study identifies a nearly unimodal relationship between the number of clusters and neighborhood radius in density-based clustering, enabling more efficient parameter tuning via Ternary Search for large-scale high-dimensional data across NLP, audio, and vision tasks.
Authors:Minjie Hong, Zirun Guo, Yan Xia, Zehan Wang, Ziang Zhang, Tao Jin, Zhou Zhao
Abstract:
Multimodal Large Language Models (MLLMs) are powerful at integrating diverse data, but they often struggle with complex reasoning. While Reinforcement learning (RL) can boost reasoning in LLMs, applying it to MLLMs is tricky. Common issues include a drop in performance on general tasks and the generation of overly detailed or "overthinking" reasoning. Our work investigates how the KL penalty and overthinking affect RL training in MLLMs. We propose Asymmetric Policy Optimization (APO) to address these issues, which divides the sampled responses into positive and negative groups. For positive samples, Difficulty-Adaptive Divergence Shaping (DADS) is introduced to dynamically adjust the KL divergence weight based on their difficulty. This method prevents policy entropy from dropping sharply, improves training stability, utilizes samples better, and preserves the model's existing knowledge. For negative samples, Suboptimal Trajectory Complexity Regularization (STCR) is proposed to penalize overly long responses. This helps mitigate overthinking and encourages more concise reasoning while preserving the model's explorative capacity. We apply our method to Qwen2.5-VL-3B, creating View-R1-3B. View-R1-3B significantly enhances reasoning capabilities, showing an average 7\% gain over the base model and outperforming larger MLLMs (7-11B) on various reasoning benchmarks. Importantly, unlike other reasoning-tuned MLLMs that often degrade on general tasks, View-R1-3B maintains consistent improvement, demonstrating superior generalization. These results highlight the effectiveness and broad applicability of our DADS and STCR techniques for advancing complex multimodal reasoning in MLLMs. The code will be made available at https://github.com/Indolent-Kawhi/View-R1.
中文: 本研究提出非对称策略优化(APO)方法,结合难度自适应散度塑形(DADS)和次优轨迹复杂度正则化(STCR),有效提升了多模态大语言模型的推理能力,在保持泛化性能的同时显著优于基准模型及更大规模模型。
English: This research introduces Asymmetric Policy Optimization (APO) with Difficulty-Adaptive Divergence Shaping (DADS) and Suboptimal Trajectory Complexity Regularization (STCR) to enhance multimodal reasoning in MLLMs, achieving significant performance gains while maintaining generalization across tasks.
Authors:Yixin Sun, Li Li, Wenke E, Amir Atapour-Abarghouei, Toby P. Breckon
Abstract:
Detecting traversable pathways in unstructured outdoor environments remains a significant challenge for autonomous robots, especially in critical applications such as wide-area search and rescue, as well as incident management scenarios like forest fires. Existing datasets and models primarily target urban settings or wide, vehicle-traversable off-road tracks, leaving a substantial gap in addressing the complexity of narrow, trail-like off-road scenarios. To address this, we introduce the Trail-based Off-road Multimodal Dataset (TOMD), a comprehensive dataset specifically designed for such environments. TOMD features high-fidelity multimodal sensor data -- including 128-channel LiDAR, stereo imagery, GNSS, IMU, and illumination measurements -- collected through repeated traversals under diverse conditions. We also propose a dynamic multiscale data fusion model for accurate traversable pathway prediction. The study analyzes the performance of early, cross, and mixed fusion strategies under varying illumination levels. Results demonstrate the effectiveness of our approach and the relevance of illumination in segmentation performance. We publicly release TOMD at https://github.com/yyyxs1125/TMOD to support future research in trail-based off-road navigation.
中文摘要:本研究针对非结构化户外环境中可通行路径检测的难题,推出了专门设计的越野小径多模态数据集TOMD和动态多尺度融合模型,通过不同光照条件下的实验验证了方法的有效性。
English Summary: The study introduces the Trail-based Off-road Multimodal Dataset (TOMD) and a dynamic multiscale fusion model to address autonomous robots' challenges in detecting traversable pathways in unstructured outdoor environments, demonstrating improved performance under varying illumination conditions.
Authors:Varun Mannam, Fang Wang, Xin Chen
Abstract:
Current evaluation frameworks for multimodal generative AI struggle to establish trustworthiness, hindering enterprise adoption where reliability is paramount. We introduce a systematic, quantitative benchmarking framework to measure the trustworthiness of progressively integrating cross-modal inputs such as text, images, captions, and OCR within VisualRAG systems for enterprise document intelligence. Our approach establishes quantitative relationships between technical metrics and user-centric trust measures. Evaluation reveals that optimal modality weighting with weights of 30% text, 15% image, 25% caption, and 30% OCR improves performance by 57.3% over text-only baselines while maintaining computational efficiency. We provide comparative assessments of foundation models, demonstrating their differential impact on trustworthiness in caption generation and OCR extraction-a vital consideration for reliable enterprise AI. This work advances responsible AI deployment by providing a rigorous framework for quantifying and enhancing trustworthiness in multimodal RAG for critical enterprise applications.
Authors:Baqer M. Merzah, Tania Taami, Salman Asoudeh, Saeed Mirzaee, Amir reza Hossein pour, Amir Ali Bengari
Abstract:
Large Language Models (LLMs) have recently gained attention in the life sciences due to their capacity to model, extract, and apply complex biological information. Beyond their classical use as chatbots, these systems are increasingly used for complex analysis and problem-solving in specialized fields, including bioinformatics. First, we introduce BIOPARS-BENCH, a dataset from over 10,000 scientific articles, textbooks, and medical websites. BioParsQA was also introduced to evaluate the proposed model, which consists of 5,231 Persian medical questions and answers. This study then introduces BioPars, a simple but accurate measure designed to assess LLMs for three main abilities: acquiring subject-specific knowledge, interpreting and synthesizing such knowledge, and demonstrating proper evidence. Comparing ChatGPT, Llama, and Galactica, our study highlights their ability to remember and retrieve learned knowledge but also reveals shortcomings in addressing higher-level, real-world questions and fine-grained inferences. These findings indicate the need for further fine-tuning to address the capabilities of LLM in bioinformatics tasks. To our knowledge, BioPars is the first application of LLM in Persian medical QA, especially for generating long answers. Evaluation of four selected medical QA datasets shows that BioPars has achieved remarkable results compared to comparative approaches. The model on BioParsQA achieved a ROUGE-L score of 29.99, which is an improvement over GPT-4 1.0. The model achieved a BERTScore of 90.87 with the MMR method. The MoverScore and BLEURT values were also higher in this model than the other three models. In addition, the reported scores for the model are MoverScore=60.43 and BLEURT=50.78. BioPars is an ongoing project and all resources related to its development will be made available via the following GitHub repository: https://github.com/amirap80/BioPars.
中文摘要:本研究提出BioPars框架用于评估大型语言模型在波斯语医学问答中的表现,结果显示其优于现有模型,同时揭示了模型在处理复杂生物医学推理方面的局限性。
English Summary: This study introduces BioPars, a framework for evaluating large language models in Persian medical question-answering, demonstrating superior performance over existing models while highlighting current limitations in handling complex biomedical reasoning.
Authors:Yutong Bai, Danny Tran, Amir Bar, Yann LeCun, Trevor Darrell, Jitendra Malik
Abstract:
We train models to Predict Ego-centric Video from human Actions (PEVA), given the past video and an action represented by the relative 3D body pose. By conditioning on kinematic pose trajectories, structured by the joint hierarchy of the body, our model learns to simulate how physical human actions shape the environment from a first-person point of view. We train an auto-regressive conditional diffusion transformer on Nymeria, a large-scale dataset of real-world egocentric video and body pose capture. We further design a hierarchical evaluation protocol with increasingly challenging tasks, enabling a comprehensive analysis of the model's embodied prediction and control abilities. Our work represents an initial attempt to tackle the challenges of modeling complex real-world environments and embodied agent behaviors with video prediction from the perspective of a human.
Authors:Xinzhuo Li, Adheesh Juvekar, Xingyou Liu, Muntasir Wahed, Kiet A. Nguyen, Ismini Lourentzou
Abstract:
Recent progress in vision-language segmentation has significantly advanced grounded visual understanding. However, these models often exhibit hallucinations by producing segmentation masks for objects not grounded in the image content or by incorrectly labeling irrelevant regions. Existing evaluation protocols for segmentation hallucination primarily focus on label or textual hallucinations without manipulating the visual context, limiting their capacity to diagnose critical failures. In response, we introduce HalluSegBench, the first benchmark specifically designed to evaluate hallucinations in visual grounding through the lens of counterfactual visual reasoning. Our benchmark consists of a novel dataset of 1340 counterfactual instance pairs spanning 281 unique object classes, and a set of newly introduced metrics that quantify hallucination sensitivity under visually coherent scene edits. Experiments on HalluSegBench with state-of-the-art vision-language segmentation models reveal that vision-driven hallucinations are significantly more prevalent than label-driven ones, with models often persisting in false segmentation, highlighting the need for counterfactual reasoning to diagnose grounding fidelity.
Authors:Mohammed Baharoon, Jun Ma, Congyu Fang, Augustin Toma, Bo Wang
Abstract:
Multimodal Large Language Models (MLLMs) have emerged as a promising way to automate Radiology Report Generation (RRG). In this work, we systematically investigate the design space of 3D MLLMs, including visual input representation, projectors, Large Language Models (LLMs), and fine-tuning techniques for 3D CT report generation. We also introduce two knowledge-based report augmentation methods that improve performance on the GREEN score by up to 10%, achieving the 2nd place on the MICCAI 2024 AMOS-MM challenge. Our results on the 1,687 cases from the AMOS-MM dataset show that RRG is largely independent of the size of LLM under the same training protocol. We also show that larger volume size does not always improve performance if the original ViT was pre-trained on a smaller volume size. Lastly, we show that using a segmentation mask along with the CT volume improves performance. The code is publicly available at https://github.com/bowang-lab/AMOS-MM-Solution
中文摘要:本研究系统探索了用于自动化CT报告生成的3D多模态大语言模型设计,提出的基于知识的报告增强方法在MICCAI 2024挑战赛中荣获第二名,并证明模型性能更取决于训练方案而非大语言模型规模。
English Summary: This study systematically explores the design of 3D multimodal large language models for automated CT report generation, introducing knowledge-based augmentation methods that achieved second place in a MICCAI 2024 challenge while demonstrating that model performance depends more on training protocols than LLM size.
Authors:Marek Å uppa, Andrej Ridzik, Daniel Hládek, Tomáš Javůrek, Viktória Ondrejová, KristÃna Sásiková, Martin Tamajka, Marián Å imko
Abstract:
In this work, we introduce skLEP, the first comprehensive benchmark specifically designed for evaluating Slovak natural language understanding (NLU) models. We have compiled skLEP to encompass nine diverse tasks that span token-level, sentence-pair, and document-level challenges, thereby offering a thorough assessment of model capabilities. To create this benchmark, we curated new, original datasets tailored for Slovak and meticulously translated established English NLU resources. Within this paper, we also present the first systematic and extensive evaluation of a wide array of Slovak-specific, multilingual, and English pre-trained language models using the skLEP tasks. Finally, we also release the complete benchmark data, an open-source toolkit facilitating both fine-tuning and evaluation of models, and a public leaderboard at https://github.com/slovak-nlp/sklep in the hopes of fostering reproducibility and drive future research in Slovak NLU.
Chinese: 本文介绍了skLEP,这是首个专为评估斯洛伐克自然语言理解模型设计的综合基准,包含九项多样化任务和原始数据集,通过对多种模型进行全面评估并公开所有资源,旨在推动该领域的未来研究。
English: This paper introduces skLEP, the first comprehensive benchmark for evaluating Slovak natural language understanding models, featuring nine diverse tasks and original datasets, along with an extensive evaluation of various models and the release of all resources to promote future research.
Authors:Istabrak Abbes, Gabriele Prato, Quentin Fournier, Fernando Rodriguez, Alaa Boukhary, Adam Elwood, Sarath Chandar
Abstract:
Augmenting large language models (LLMs) with external context significantly improves their performance in natural language processing (NLP) tasks. However, LLMs struggle to answer queries reliably when the provided context lacks information, often resorting to ungrounded speculation or internal knowledge. Groundedness - generating responses strictly supported by the context - is essential for ensuring factual consistency and trustworthiness. This study focuses on detecting whether a given query is grounded in a document provided in context before the costly answer generation by LLMs. Such a detection mechanism can significantly reduce both inference time and resource consumption. We show that lightweight, task specific encoder models such as RoBERTa and NomicBERT, fine-tuned on curated datasets, can achieve accuracy comparable to state-of-the-art LLMs, such as Llama3 8B and GPT4o, in groundedness detection while reducing inference latency by orders of magnitude. The code is available at : https://github.com/chandarlab/Hallucinate-less
中文摘要:增强大型语言模型的外部上下文可提升其自然语言处理性能,但确保回答严格基于上下文仍具挑战;本研究提出一种轻量级编码器模型,在昂贵的LLM生成答案前检测查询是否基于文档,能以极低延迟和资源消耗达到与先进LLMs相当的检测精度。
English Summary: Augmenting LLMs with external context boosts NLP performance, but ensuring grounded responses remains challenging; this study introduces a lightweight encoder model that detects query grounding in documents before costly LLM processing, achieving comparable accuracy to advanced LLMs while drastically cutting latency and resource use.
Authors:He Li, Haoang Chi, Mingyu Liu, Wanrong Huang, Liyang Xu, Wenjing Yang
Abstract:
The real world naturally has dimensions of time and space. Therefore, estimating the counterfactual outcomes with spatial-temporal attributes is a crucial problem. However, previous methods are based on classical statistical models, which still have limitations in performance and generalization. This paper proposes a novel framework for estimating counterfactual outcomes with spatial-temporal attributes using the Transformer, exhibiting stronger estimation ability. Under mild assumptions, the proposed estimator within this framework is consistent and asymptotically normal. To validate the effectiveness of our approach, we conduct simulation experiments and real data experiments. Simulation experiments show that our estimator has a stronger estimation capability than baseline methods. Real data experiments provide a valuable conclusion to the causal effect of conflicts on forest loss in Colombia. The source code is available at https://github.com/lihe-maxsize/DeppSTCI_Release_Version-master.
中文: 本文提出了一种基于Transformer的新颖框架,用于估计具有时空特征的反事实结果,通过模拟实验和对哥伦比亚冲突影响的真实数据分析,证明了该方法优于基线模型的性能。
English: This paper introduces a novel Transformer-based framework for estimating counterfactual outcomes with spatial-temporal attributes, demonstrating superior performance over baseline methods through simulations and real-world experiments on conflict impacts in Colombia.
Authors:Ziwei Wang, Hongbin Wang, Tianwang Jia, Xingyi He, Siyang Li, Dongrui Wu
Abstract:
Electroencephalography (EEG)-based brain-computer interfaces (BCIs) transform spontaneous/evoked neural activity into control commands for external communication. While convolutional neural networks (CNNs) remain the mainstream backbone for EEG decoding, their inherently short receptive field makes it difficult to capture long-range temporal dependencies and global inter-channel relationships. Recent CNN-Transformer (Conformer) hybrids partially address this issue, but most adopt a serial design, resulting in suboptimal integration of local and global features, and often overlook explicit channel-wise modeling. To address these limitations, we propose DBConformer, a dual-branch convolutional Transformer network tailored for EEG decoding. It integrates a temporal Conformer to model long-range temporal dependencies and a spatial Conformer to extract inter-channel interactions, capturing both temporal dynamics and spatial patterns in EEG signals. A lightweight channel attention module further refines spatial representations by assigning data-driven importance to EEG channels. Extensive experiments under four evaluation settings on three paradigms, including motor imagery, seizure detection, and steady-state visual evoked potential, demonstrated that DBConformer consistently outperformed 13 competitive baseline models, with over an eight-fold reduction in parameters than current high-capacity EEG Conformer architecture. Furthermore, the visualization results confirmed that the features extracted by DBConformer are physiologically interpretable and aligned with prior knowledge. The superior performance and interpretability of DBConformer make it reliable for accurate, robust, and explainable EEG decoding. Code is publicized at https://github.com/wzwvv/DBConformer.
中文:提出的DBConformer模型通过结合时序与空间Transformer及通道注意力机制,显著提升了EEG解码性能,在多种实验范式中均表现出优越的准确性和生理可解释性,同时大幅减少了参数量。
English: The proposed DBConformer model enhances EEG decoding by combining temporal and spatial Transformers with channel attention, achieving superior performance and interpretability across multiple paradigms while significantly reducing parameters.
Authors:Luosheng Xu, Dalin Zhang, Zhaohui Song
Abstract:
Remote sensing change detection is essential for monitoring urban expansion, disaster assessment, and resource management, offering timely, accurate, and large-scale insights into dynamic landscape transformations. While deep learning has revolutionized change detection, the increasing complexity and computational demands of modern models have not necessarily translated into significant accuracy gains. Instead of following this trend, this study explores a more efficient approach, focusing on lightweight models that maintain high accuracy while minimizing resource consumption, which is an essential requirement for on-satellite processing. To this end, we propose FlickCD, which means quick flick then get great results, pushing the boundaries of the performance-resource trade-off. FlickCD introduces an Enhanced Difference Module (EDM) to amplify critical feature differences between temporal phases while suppressing irrelevant variations such as lighting and weather changes, thereby reducing computational costs in the subsequent change decoder. Additionally, the FlickCD decoder incorporates Local-Global Fusion Blocks, leveraging Shifted Window Self-Attention (SWSA) and Efficient Global Self-Attention (EGSA) to effectively capture semantic information at multiple scales, preserving both coarse- and fine-grained changes. Extensive experiments on four benchmark datasets demonstrate that FlickCD reduces computational and storage overheads by more than an order of magnitude while achieving state-of-the-art (SOTA) performance or incurring only a minor (<1% F1) accuracy trade-off. The implementation code is publicly available at https://github.com/xulsh8/FlickCD.
中文: 本研究提出轻量级遥感变化检测模型FlickCD,通过增强差异模块和局部-全局融合块在显著降低计算与存储开销的同时保持高精度,以最小资源消耗实现了最优性能。
English: This study introduces FlickCD, a lightweight remote sensing change detection model that significantly reduces computational and storage demands while maintaining high accuracy through its Enhanced Difference Module and Local-Global Fusion Blocks, achieving state-of-the-art performance with minimal resource consumption.
Authors:Tim Lawson, Laurence Aitchison
Abstract:
Conditional computation is a popular strategy to make Transformers more efficient. Existing methods often target individual modules (e.g., mixture-of-experts layers) or skip layers independently of one another. However, interpretability research has demonstrated that the middle layers of Transformers exhibit greater redundancy, and that early layers aggregate information into token positions. Guided by these insights, we propose a novel architecture that dynamically skips a variable number of layers from the middle outward. In particular, a learned gating mechanism determines whether to bypass a symmetric span of central blocks based on the input, and a gated attention mechanism prevents subsequent tokens from attending to skipped token positions. Residual norms are controlled with a 'sandwich' or 'perilayernorm' scheme and gate sparsity with an adaptive regularization loss. We had aimed to reduce compute requirements for 'simpler' tokens and potentially foster an emergent multi-level representational hierarchy but, at the scales investigated, our approach does not achieve improvements in the trade-off between validation cross-entropy and estimated FLOPs compared to dense baselines with fewer layers. We release our code at https://github.com/tim-lawson/skip-middle.
中文: 本文提出了一种新型Transformer架构,通过学习的门控机制动态跳过中间层的可变数量,但与层数更少的基线模型相比,该方法未能在效率与准确性的权衡中取得改进。
English: This paper introduces a novel Transformer architecture that dynamically skips variable numbers of middle layers using a learned gating mechanism, though it fails to improve the efficiency-accuracy trade-off compared to fewer-layer baselines.
Authors:Yann Kerzreho
Abstract:
This paper introduces a new approach for approximating the learning dynamics of multiple reinforcement learning (RL) agents interacting in a finite-state Markov game. The idea is to rescale the learning process by simultaneously reducing the learning rate and increasing the update frequency, effectively treating the agent's parameters as a slow-evolving variable influenced by the fast-mixing game state. Under mild assumptions-ergodicity of the state process and continuity of the updates-we prove the convergence of this rescaled process to an ordinary differential equation (ODE). This ODE provides a tractable, deterministic approximation of the agent's learning dynamics. An implementation of the framework is available at\,: https://github.com/yannKerzreho/MarkovGameApproximation
中文: 本文提出一种重标度学习方法,通过将智能体参数视为慢变变量来近似马尔可夫博弈中的多智能体强化学习动态,并在遍历性和连续性假设下证明了该方法会收敛到确定性常微分方程。
English: This paper presents a rescaled learning method that approximates multi-agent reinforcement learning dynamics in Markov games by treating agent parameters as slow variables, with proven convergence to a deterministic ODE under ergodicity and continuity assumptions.
Authors:Fangyuan Zhang, Zhengjun Huang, Yingli Zhou, Qintian Guo, Zhixun Li, Wensheng Luo, Di Jiang, Yixiang Fang, Xiaofang Zhou
Abstract:
Graph-based Retrieval-Augmented Generation (Graph-RAG) enhances large language models (LLMs) by structuring retrieval over an external corpus. However, existing approaches typically assume a static corpus, requiring expensive full-graph reconstruction whenever new documents arrive, limiting their scalability in dynamic, evolving environments. To address these limitations, we introduce EraRAG, a novel multi-layered Graph-RAG framework that supports efficient and scalable dynamic updates. Our method leverages hyperplane-based Locality-Sensitive Hashing (LSH) to partition and organize the original corpus into hierarchical graph structures, enabling efficient and localized insertions of new data without disrupting the existing topology. The design eliminates the need for retraining or costly recomputation while preserving high retrieval accuracy and low latency. Experiments on large-scale benchmarks demonstrate that EraRag achieves up to an order of magnitude reduction in update time and token consumption compared to existing Graph-RAG systems, while providing superior accuracy performance. This work offers a practical path forward for RAG systems that must operate over continually growing corpora, bridging the gap between retrieval efficiency and adaptability. Our code and data are available at https://github.com/EverM0re/EraRAG-Official.
中文摘要:EraRAG提出了一种动态图检索增强生成框架,通过分层图结构和局部敏感哈希分区实现对不断更新的文档库的高效维护,在无需全图重建的情况下显著提升了更新速度与检索精度。
English Summary: EraRAG introduces a dynamic Graph-RAG framework that enables efficient updates to evolving document corpora through hierarchical graph structures and LSH partitioning, achieving significant improvements in update speed and accuracy without full-graph reconstruction.
Authors:Jiameng Chen, Xiantao Cai, Jia Wu, Wenbin Hu
Abstract:
Antibody design remains a critical challenge in therapeutic and diagnostic development, particularly for complex antigens with diverse binding interfaces. Current computational methods face two main limitations: (1) capturing geometric features while preserving symmetries, and (2) generalizing novel antigen interfaces. Despite recent advancements, these methods often fail to accurately capture molecular interactions and maintain structural integrity. To address these challenges, we propose \textbf{AbMEGD}, an end-to-end framework integrating \textbf{M}ulti-scale \textbf{E}quivariant \textbf{G}raph \textbf{D}iffusion for antibody sequence and structure co-design. Leveraging advanced geometric deep learning, AbMEGD combines atomic-level geometric features with residue-level embeddings, capturing local atomic details and global sequence-structure interactions. Its E(3)-equivariant diffusion method ensures geometric precision, computational efficiency, and robust generalizability for complex antigens. Furthermore, experiments using the SAbDab database demonstrate a 10.13\% increase in amino acid recovery, 3.32\% rise in improvement percentage, and a 0.062~Ã
reduction in root mean square deviation within the critical CDR-H3 region compared to DiffAb, a leading antibody design model. These results highlight AbMEGD's ability to balance structural integrity with improved functionality, establishing a new benchmark for sequence-structure co-design and affinity optimization. The code is available at: https://github.com/Patrick221215/AbMEGD.
中文: 提出的AbMEGD框架通过整合多尺度等变图扩散技术,实现了抗体序列与结构的协同设计,在氨基酸恢复率和结构精度方面均优于现有模型,为复杂抗原的抗体设计确立了新标准。
English: The proposed AbMEGD framework integrates multi-scale equivariant graph diffusion to overcome current limitations in antibody design by co-designing sequences and structures, achieving superior performance in amino acid recovery and structural accuracy compared to existing models.
Authors:Lucius Bushnaq, Dan Braun, Lee Sharkey
Abstract:
A key step in reverse engineering neural networks is to decompose them into simpler parts that can be studied in relative isolation. Linear parameter decomposition -- a framework that has been proposed to resolve several issues with current decomposition methods -- decomposes neural network parameters into a sum of sparsely used vectors in parameter space. However, the current main method in this framework, Attribution-based Parameter Decomposition (APD), is impractical on account of its computational cost and sensitivity to hyperparameters. In this work, we introduce \textit{Stochastic Parameter Decomposition} (SPD), a method that is more scalable and robust to hyperparameters than APD, which we demonstrate by decomposing models that are slightly larger and more complex than was possible to decompose with APD. We also show that SPD avoids other issues, such as shrinkage of the learned parameters, and better identifies ground truth mechanisms in toy models. By bridging causal mediation analysis and network decomposition methods, this demonstration opens up new research possibilities in mechanistic interpretability by removing barriers to scaling linear parameter decomposition methods to larger models. We release a library for running SPD and reproducing our experiments at https://github.com/goodfire-ai/spd/tree/spd-paper.
中文: 本文提出随机参数分解(SPD)方法,相比现有技术更具可扩展性和鲁棒性,能够分解更复杂的神经网络模型,为机制可解释性研究开辟了新途径。
English: This paper introduces Stochastic Parameter Decomposition (SPD), a more scalable and robust method than previous approaches, enabling decomposition of larger neural networks and advancing mechanistic interpretability research.
Authors:Huangyuan Su, Mujin Kwun, Stephanie Gil, Sham Kakade, Nikhil Anand
Abstract:
Training large language models is an expensive, compute-bound process that must be repeated as models scale, algorithms improve, and new data is collected. To address this, next-generation hardware accelerators increasingly support lower-precision arithmetic formats, such as the Microscaling (MX) formats introduced in NVIDIA's Blackwell architecture. These formats use a shared scale within blocks of parameters to extend representable range and perform forward/backward GEMM operations in reduced precision for efficiency gains. In this work, we investigate the challenges and viability of block-scaled precision formats during model training. Across nearly one thousand language models trained from scratch -- spanning compute budgets from $2 \times 10^{17}$ to $4.8 \times 10^{19}$ FLOPs and sweeping over a broad range of weight-activation precision combinations -- we consistently observe that training in MX formats exhibits sharp, stochastic instabilities in the loss, particularly at larger compute scales. To explain this phenomenon, we conduct controlled experiments and ablations on a smaller proxy model that exhibits similar behavior as the language model, sweeping across architectural settings, hyperparameters, and precision formats. These experiments motivate a simple model in which multiplicative gradient bias introduced by the quantization of layer-norm affine parameters and a small fraction of activations can trigger runaway divergence. Through \emph{in situ} intervention experiments on our proxy model, we demonstrate that instabilities can be averted or delayed by modifying precision schemes mid-training. Guided by these findings, we evaluate stabilization strategies in the LLM setting and show that certain hybrid configurations recover performance competitive with full-precision training. We release our code at https://github.com/Hither1/systems-scaling.
Chinese: 采用块缩放精度格式(如微缩放MX)训练大型语言模型会因量化参数引入梯度偏差,导致损失出现随机不稳定性,但通过混合精度策略可有效稳定训练,实现与全精度方法相媲美的性能。
English: Training large language models with block-scaled precision formats like Microscaling (MX) introduces stochastic instabilities in loss, primarily due to gradient bias from quantized parameters, but hybrid precision strategies can stabilize training and achieve performance comparable to full-precision methods.
Authors:Hoa La, Ahan Gupta, Alex Morehead, Jianlin Cheng, Minjia Zhang
Abstract:
Protein structure prediction models such as AlphaFold3 (AF3) push the frontier of biomolecular modeling by incorporating science-informed architectural changes to the transformer architecture. However, these advances come at a steep system cost, introducing: compute- and memory-intensive operators, 2D attention mechanisms, and retrieval-augmented data pipelines, which collectively hinder the scalability of AF3 training. In this work, we present MegaFold, a cross-platform system to accelerate AF3 training. MegaFold tackles key bottlenecks through ahead-of-time caching to eliminate GPU idle time from the retrieval-augmented data pipeline, Triton-based kernels for memory-efficient EvoAttention on heterogeneous devices, and deep fusion for common and critical small operators in AF3. Evaluation on both NVIDIA H200 and AMD MI250 GPUs shows that MegaFold reduces peak memory usage of AF3 training by up to 1.23$\times$ and improves per-iteration training time by up-to 1.73$\times$ and 1.62$\times$ respectively. More importantly, MegaFold enables training on 1.35$\times$ longer sequence lengths compared to PyTorch baselines without running out-of-memory, significantly improving the scalability of modern protein folding models. We open source our code at https://github.com/Supercomputing-System-AI-Lab/MegaFold/.
中文:MegaFold是一个跨平台系统,通过优化数据管道、注意力机制和算子融合来加速AlphaFold3训练,在英伟达和AMD GPU上显著提升了内存效率和训练速度。
English: MegaFold is a cross-platform system that accelerates AlphaFold3 training by optimizing data pipelines, attention mechanisms, and operator fusion, achieving significant memory and speed improvements on both NVIDIA and AMD GPUs.
Authors:Jacopo Dapueto, Vito Paolo Pastore, Nicoletta Noceti, Francesca Odone
Abstract:
Microscopy image analysis is fundamental for different applications, from diagnosis to synthetic engineering and environmental monitoring. Modern acquisition systems have granted the possibility to acquire an escalating amount of images, requiring a consequent development of a large collection of deep learning-based automatic image analysis methods. Although deep neural networks have demonstrated great performance in this field, interpretability, an essential requirement for microscopy image analysis, remains an open challenge.
This work proposes a Disentangled Representation Learning (DRL) methodology to enhance model interpretability for microscopy image classification. Exploiting benchmark datasets from three different microscopic image domains (plankton, yeast vacuoles, and human cells), we show how a DRL framework, based on transferring a representation learnt from synthetic data, can provide a good trade-off between accuracy and interpretability in this domain.
中文摘要:本研究提出了一种解耦表示学习方法,通过三个显微图像数据集验证了该方法能在保持分类精度的同时有效提升模型可解释性。
English Summary: This study introduces a Disentangled Representation Learning approach to improve interpretability in microscopy image classification, demonstrating through three datasets that it effectively balances accuracy with explainability.
Authors:Sijie Li, Weiwei Sun, Shanda Li, Ameet Talwalkar, Yiming Yang
Abstract:
Large language model-based machine learning (ML) agents have shown great promise in automating ML research. However, existing agents typically operate in isolation on a given research problem, without engaging with the broader research community, where human researchers often gain insights and contribute by sharing knowledge. To bridge this gap, we introduce MLE-Live, a live evaluation framework designed to assess an agent's ability to communicate with and leverage collective knowledge from a simulated Kaggle research community. Building on this framework, we propose CoMind, a novel agent that excels at exchanging insights and developing novel solutions within a community context. CoMind achieves state-of-the-art performance on MLE-Live and outperforms 79.2% human competitors on average across four ongoing Kaggle competitions. Our code is released at https://github.com/comind-ml/CoMind.
中文: MLE-Live是一个评估机器学习代理在模拟研究社区中协作能力的框架,基于该框架开发的CoMind新型代理在Kaggle竞赛中表现出色,平均超越79.2%的人类参赛者。
English: MLE-Live is a framework for evaluating ML agents' ability to collaborate with a simulated research community, and CoMind, a novel agent developed on this framework, demonstrates superior performance by outperforming most human competitors in Kaggle competitions.
Authors:Manyi Li, Renshuai Tao, Yufan Liu, Chuangchuang Tan, Haotong Qin, Bing Li, Yunchao Wei, Yao Zhao
Abstract:
With the rapid advancement of deep learning, particularly through generative adversarial networks (GANs) and diffusion models (DMs), AI-generated images, or ``deepfakes", have become nearly indistinguishable from real ones. These images are widely shared across Online Social Networks (OSNs), raising concerns about their misuse. Existing deepfake detection methods overlook the ``block effects" introduced by compression in OSNs, which obscure deepfake artifacts, and primarily focus on raw images, rarely encountered in real-world scenarios. To address these challenges, we propose PLADA (Pay Less Attention to Deceptive Artifacts), a novel framework designed to tackle the lack of paired data and the ineffective use of compressed images. PLADA consists of two core modules: Block Effect Eraser (B2E), which uses a dual-stage attention mechanism to handle block effects, and Open Data Aggregation (ODA), which processes both paired and unpaired data to improve detection. Extensive experiments across 26 datasets demonstrate that PLADA achieves a remarkable balance in deepfake detection, outperforming SoTA methods in detecting deepfakes on OSNs, even with limited paired data and compression. More importantly, this work introduces the ``block effect" as a critical factor in deepfake detection, providing a robust solution for open-world scenarios. Our code is available at https://github.com/ManyiLee/PLADA.
中文:提出的PLADA框架通过消除块效应并利用配对与非配对数据,有效解决了在线社交网络中压缩图像深度伪造检测的难题,在多种数据集上实现了卓越性能。
English: The proposed PLADA framework effectively addresses deepfake detection challenges in compressed images from online social networks by eliminating block effects and leveraging both paired and unpaired data, achieving superior performance across diverse datasets.
Authors:Hongzhen Huang, Kunming Zhang, Hanlong Liao, Kui Wu, Guoming Tang
Abstract:
The rapid advancement of AI, particularly large language models (LLMs), has raised significant concerns about the energy use and carbon emissions associated with model training and inference. However, existing tools for measuring and reporting such impacts are often fragmented, lacking systematic metric integration and offering limited support for correlation analysis among them. This paper presents WattsOnAI, a comprehensive software toolkit for the measurement, analysis, and visualization of energy use, power draw, hardware performance, and carbon emissions across AI workloads. By seamlessly integrating with existing AI frameworks, WattsOnAI offers standardized reports and exports fine-grained time-series data to support benchmarking and reproducibility in a lightweight manner. It further enables in-depth correlation analysis between hardware metrics and model performance and thus facilitates bottleneck identification and performance enhancement. By addressing critical limitations in existing tools, WattsOnAI encourages the research community to weigh environmental impact alongside raw performance of AI workloads and advances the shift toward more sustainable "Green AI" practices. The code is available at https://github.com/SusCom-Lab/WattsOnAI.
中文: 本文介绍了WattsOnAI工具包,它通过集成测量、分析和可视化AI工作负载的能耗与碳排放,解决了现有工具零散的问题,并借助标准化报告和关联分析推动可持续的绿色AI实践。
English: This paper introduces WattsOnAI, a comprehensive toolkit that addresses the fragmentation in existing tools by providing integrated measurement, analysis, and visualization of energy use and carbon emissions in AI workloads, promoting sustainable practices through standardized reporting and correlation analysis.
Authors:Haoze Wu, Yunzhi Yao, Wenhao Yu, Huajun Chen, Ningyu Zhang
Abstract:
Large Language Models (LLMs) exhibit remarkable code generation capabilities but falter when adapting to frequent updates in external library APIs. This critical limitation, stemming from reliance on outdated API knowledge from their training data, even with access to current documentation, impedes reliable code generation in dynamic environments. To tackle this issue, we propose ReCode (rule-based Reinforcement learning for Code Update), a novel framework that mimics human programmer adaptation to API changes. Specifically, we construct a dataset of approximately 2,000 data entries to train the LLMs to perform version migration based on updated information. Then, we introduce a modified string similarity metric for code evaluation as the reward for reinforcement learning. Our experiments demonstrate that ReCode substantially boosts LLMs' code generation performance in dynamic API scenarios, especially on the unseen CodeUpdateArena task. Crucially, compared to supervised fine-tuning, ReCode has less impact on LLMs' general code generation abilities. We apply ReCode on various LLMs and reinforcement learning algorithms (GRPO and DAPO), all achieving consistent improvements. Notably, after training, Qwen2.5-Coder-7B outperforms that of the 32B parameter code instruction-tuned model and the reasoning model with the same architecture. Code is available at https://github.com/zjunlp/ReCode.
中文摘要:ReCode是一种新颖的强化学习框架,通过基于版本迁移数据训练大语言模型,显著提升其在动态API环境中的代码生成可靠性,同时对其通用编程能力影响较小。
English Summary: ReCode is a novel reinforcement learning framework that significantly enhances large language models' ability to generate reliable code in dynamic API environments by training them on version migration data while minimizing impact on their general coding capabilities.
Authors:Francesco Carzaniga, Michael Hersche, Abu Sebastian, Kaspar Schindler, Abbas Rahimi
Abstract:
Learning from multi-variate time-series with heterogeneous channel configurations remains a fundamental challenge for deep neural networks, particularly in clinical domains such as intracranial electroencephalography (iEEG), where channel setups vary widely across subjects. In this work, we introduce multi-variate parallel attention (MVPA), a novel self-attention mechanism that disentangles content, temporal, and spatial attention, enabling flexible, generalizable, and efficient modeling of time-series data with varying channel counts and configurations. We use MVPA to build MVPFormer, a generative foundation model for human electrophysiology, trained to predict the evolution of iEEG signals across diverse subjects. To support this and future efforts by the community, we release the SWEC iEEG dataset, the largest publicly available iEEG dataset to date, comprising nearly 10,000 hours of recordings from heterogeneous clinical sources. MVPFormer leverages MVPA to achieve strong generalization across subjects, demonstrating expert-level performance in several iEEG tasks. MVPFormer surpasses state-of-the-art Transformer baselines in seizure detection across the SWEC, the MAYO, and the FNUSA datasets, while also achieving state-of-the-art performance on four Brain TreeBank iEEG decoding tasks. We further validate MVPA on standard time-series forecasting and classification tasks, where it matches or exceeds the performance of existing attention-based models. Together, our contributions establish MVPA as a general-purpose attention mechanism for heterogeneous time-series and MVPFormer as the first open-source, open-weights, and open-data iEEG foundation model with SOTA clinical performance. The code is available at https://github.com/IBM/multi-variate-parallel-transformer. The SWEC iEEG dataset is available at https://huggingface.co/datasets/NeuroTec/SWEC_iEEG_Dataset.
中文: 本研究提出多元并行注意力机制和MVPFormer模型,能有效处理异构时间序列数据如颅内脑电图,在多项临床任务和数据集上实现顶尖性能,并发布了最大的公开iEEG数据集以推动相关研究。
English: The study introduces a multi-variate parallel attention (MVPA) mechanism and MVPFormer model, which effectively handle heterogeneous time-series data like iEEG, achieving state-of-the-art performance across multiple clinical tasks and datasets while releasing the largest public iEEG dataset to support further research.
Authors:Kejia Chen, Jiawen Zhang, Jiacong Hu, Yu Wang, Jian Lou, Zunlei Feng, Mingli Song
Abstract:
Quantized large language models (LLMs) have gained increasing attention and significance for enabling deployment in resource-constrained environments. However, emerging studies on a few calibration dataset-free quantization methods suggest that quantization may compromise the safety capabilities of LLMs, underscoring the urgent need for systematic safety evaluations and effective mitigation strategies. In this paper, we present comprehensive safety evaluations across various mainstream quantization techniques and diverse calibration datasets, utilizing widely accepted safety benchmarks. To address the identified safety vulnerabilities, we propose a quantization-aware safety patching framework, Q-resafe, to efficiently restore the safety capabilities of quantized LLMs while minimizing any adverse impact on utility. Extensive experimental results demonstrate that Q-resafe successfully re-aligns the safety of quantized LLMs with their pre-quantization counterparts, even under challenging evaluation scenarios. Project page is available at: https://github.com/Thecommonirin/Qresafe.
中文: 本文对量化大语言模型进行了全面的安全评估,并提出Q-resafe这一量化感知安全补丁框架,能在不影响实用性的前提下有效恢复模型的安全防护能力。
English: This paper introduces comprehensive safety evaluations of quantized large language models (LLMs) and proposes Q-resafe, a quantization-aware safety patching framework that effectively restores safety capabilities without compromising utility.
Authors:Hirad Daneshvar, Reza Samavi
Abstract:
Graph Neural Networks (GNNs) have shown remarkable performance in the healthcare domain. However, what remained challenging is quantifying the predictive uncertainty of GNNs, which is an important aspect of trustworthiness in clinical settings. While Bayesian and ensemble methods can be used to quantify uncertainty, they are computationally expensive. Additionally, the disagreement metric used by ensemble methods to compute uncertainty cannot capture the diversity of models in an ensemble network. In this paper, we propose a novel method, based on knowledge distillation, to quantify GNNs' uncertainty more efficiently and with higher precision. We apply self-distillation, where the same network serves as both the teacher and student models, thereby avoiding the need to train several networks independently. To ensure the impact of self-distillation, we develop an uncertainty metric that captures the diverse nature of the network by assigning different weights to each GNN classifier. We experimentally evaluate the precision, performance, and ability of our approach in distinguishing out-of-distribution data on two graph datasets: MIMIC-IV and Enzymes. The evaluation results demonstrate that the proposed method can effectively capture the predictive uncertainty of the model while having performance similar to that of the MC Dropout and ensemble methods. The code is publicly available at https://github.com/tailabTMU/UQ_GNN.
中文: 本文提出了一种基于知识蒸馏的新方法,通过自蒸馏和加权不确定性指标来高效量化图神经网络的预测不确定性,在医疗数据集上取得了与现有方法相当的性能。
English: This paper introduces a knowledge distillation-based method that efficiently quantifies Graph Neural Networks' predictive uncertainty using self-distillation and a weighted uncertainty metric, achieving comparable performance to existing methods on healthcare datasets.
Authors:Salva Rühling Cachay, Miika Aittala, Karsten Kreis, Noah Brenowitz, Arash Vahdat, Morteza Mardani, Rose Yu
Abstract:
Diffusion models are a powerful tool for probabilistic forecasting, yet most applications in high-dimensional chaotic systems predict future snapshots one-by-one. This common approach struggles to model complex temporal dependencies and fails to explicitly account for the progressive growth of uncertainty inherent to such systems. While rolling diffusion frameworks, which apply increasing noise to forecasts at longer lead times, have been proposed to address this, their integration with state-of-the-art, high-fidelity diffusion techniques remains a significant challenge. We tackle this problem by introducing Elucidated Rolling Diffusion Models (ERDM), the first framework to successfully unify a rolling forecast structure with the principled, performant design of Elucidated Diffusion Models (EDM). To do this, we adapt the core EDM components-its noise schedule, network preconditioning, and Heun sampler-to the rolling forecast setting. The success of this integration is driven by three key contributions: (i) a novel loss weighting scheme that focuses model capacity on the mid-range forecast horizons where determinism gives way to stochasticity; (ii) an efficient initialization strategy using a pre-trained EDM for the initial window; and (iii) a bespoke hybrid sequence architecture for robust spatiotemporal feature extraction under progressive denoising. On 2D Navier-Stokes simulations and ERA5 global weather forecasting at 1.5^\circ resolution, ERDM consistently outperforms key diffusion-based baselines, including conditional autoregressive EDM. ERDM offers a flexible and powerful general framework for tackling diffusion-based sequence generation problems where modeling escalating uncertainty is paramount. Code is available at: https://github.com/salvaRC/erdm
Chinese: 作者提出了Elucidated Rolling Diffusion Models (ERDM)框架,通过将滚动预测结构与精细化扩散模型相结合,有效解决了混沌系统中不确定性递增的建模难题,在流体模拟和全球天气预报任务中均优于现有基线方法。
English: The authors introduce Elucidated Rolling Diffusion Models (ERDM), a novel framework that integrates rolling forecast mechanisms with advanced diffusion techniques to better model escalating uncertainty in chaotic systems, outperforming existing methods in both fluid dynamics simulations and global weather forecasting.
Authors:Haochen Zhang, Tianyi Zhang, Junze Yin, Oren Gal, Anshumali Shrivastava, Vladimir Braverman
Abstract:
Recommender systems play a pivotal role in providing relevant content to users. With the rapid development of large language models (LLMs), researchers have begun utilizing LLMs to build more powerful recommender systems. However, existing approaches that focus on aligning LLMs with recommendation tasks do not fully leverage their sequential information processing capabilities, leading to suboptimal performance.
In this paper, we propose a novel system called compressed vocabulary expansion (CoVE). In CoVE, each item is assigned a unique ID within the expanded vocabulary. Our framework effectively capitalizes on sequence understanding abilities of LLMs, significantly enhancing their performance on recommendation tasks. Additionally, we compress the embedding layer, making CoVE practical for large-scale industrial applications. The effectiveness and performance of CoVE are demonstrated through comprehensive experiments on multiple recommendation datasets and comparisons with prior works. Our code can be found at https://github.com/HaochenZhang717/CoVE-official-Repo.
中文: 本文提出CoVE系统,通过压缩词汇扩展和嵌入层压缩,充分利用大语言模型的序列理解能力来增强推荐系统性能,在多数据集实验中展现出卓越效果。
English: This paper introduces CoVE, a novel system that enhances recommender systems by leveraging large language models' sequential processing capabilities through compressed vocabulary expansion and embedding layer compression, demonstrating superior performance in comprehensive experiments.
Authors:Shuchen Xue, Tianyu Xie, Tianyang Hu, Zijin Feng, Jiacheng Sun, Kenji Kawaguchi, Zhenguo Li, Zhi-Ming Ma
Abstract:
Large language models (LLMs) predominantly use autoregressive (AR) approaches, but masked diffusion models (MDMs) are emerging as viable alternatives. A key challenge in comparing AR and MDM paradigms is their typical architectural difference: AR models are often decoder-only, while MDMs have largely been encoder-only. This practice of changing both the modeling paradigm and architecture simultaneously makes direct comparisons unfair, as it's hard to distinguish whether observed differences stem from the paradigm itself or the architectural shift. This research evaluates MDMs within a decoder-only framework to: (1) equitably compare MDM (as Any-Order AR, or AO-AR) and standard AR paradigms. Our investigation suggests that the standard AO-AR objective, which averages over all token permutations, may benefit from refinement, as many permutations appear less informative compared to the language's inherent left-to-right structure. (2) Investigate architectural influences (decoder-only vs. encoder-only) within MDMs. We demonstrate that while encoder-only MDMs model a simpler conditional probability space, decoder-only MDMs can achieve dramatic generation speedups ($\sim25\times$) and comparable perplexity with temperature annealing despite modeling a vastly larger space, highlighting key trade-offs. This work thus decouples core paradigm differences from architectural influences, offering insights for future model design. Code is available at https://github.com/scxue/AO-GPT-MDM.
中文摘要:本研究在仅解码器架构下比较掩码扩散模型与自回归模型,发现掩码扩散模型能实现显著生成加速且保持性能,同时强调了区分核心范式差异与架构影响的重要性。
English Summary: This study compares masked diffusion models (MDMs) with autoregressive models using a decoder-only architecture, revealing that MDMs can achieve significant generation speedups while maintaining performance, and highlights the importance of separating paradigm differences from architectural influences.
Authors:Xingyang Li, Muyang Li, Tianle Cai, Haocheng Xi, Shuo Yang, Yujun Lin, Lvmin Zhang, Songlin Yang, Jinbo Hu, Kelly Peng, Maneesh Agrawala, Ion Stoica, Kurt Keutzer, Song Han
Abstract:
Recent advances in diffusion models have enabled high-quality video generation, but the additional temporal dimension significantly increases computational costs, making training and inference on long videos prohibitively expensive. In this paper, we identify a phenomenon we term Spatiotemporal Energy Decay in video diffusion models: post-softmax attention scores diminish as spatial and temporal distance between tokens increase, akin to the physical decay of signal or waves over space and time in nature. Motivated by this, we propose Radial Attention, a scalable sparse attention mechanism with $O(n \log n)$ complexity that translates energy decay into exponentially decaying compute density, which is significantly more efficient than standard $O(n^2)$ dense attention and more expressive than linear attention. Specifically, Radial Attention employs a simple, static attention mask where each token attends to spatially nearby tokens, with the attention window size shrinking with temporal distance. Moreover, it allows pre-trained video diffusion models to extend their generation length with efficient LoRA-based fine-tuning. Extensive experiments show that Radial Attention maintains video quality across Wan2.1-14B, HunyuanVideo, and Mochi 1, achieving up to a 1.9$\times$ speedup over the original dense attention. With minimal tuning, it enables video generation up to 4$\times$ longer while reducing training costs by up to 4.4$\times$ compared to direct fine-tuning and accelerating inference by up to 3.7$\times$ compared to dense attention inference.
Chinese: 本文提出径向注意力机制,通过利用时空能量衰减现象,在保持视频质量的同时显著降低了视频扩散模型的计算复杂度,实现了高效的训练和推理。
English: This paper introduces Radial Attention, a scalable sparse attention mechanism that reduces computational complexity in video diffusion models by leveraging spatiotemporal energy decay, achieving significant efficiency gains while maintaining video quality.
Authors:Yichao Fu, Rui Ge, Zelei Shao, Zhijie Deng, Hao Zhang
Abstract:
Reasoning models excel by generating long chain-of-thoughts, but decoding the resulting thousands of tokens is slow. Token-level speculative decoding (SD) helps, but its benefit is capped, because the chance that an entire $γ$-token guess is correct falls exponentially as $γ$ grows. This means allocating more compute for longer token drafts faces an algorithmic ceiling -- making the speedup modest and hardware-agnostic. We raise this ceiling with Lookahead Reasoning, which exploits a second, step-level layer of parallelism. Our key insight is that reasoning models generate step-by-step, and each step needs only to be semantically correct, not exact token matching. In Lookahead Reasoning, a lightweight draft model proposes several future steps; the target model expands each proposal in one batched pass, and a verifier keeps semantically correct steps while letting the target regenerate any that fail. Token-level SD still operates within each reasoning step, so the two layers of parallelism multiply. We show Lookahead Reasoning lifts the peak speedup of SD both theoretically and empirically. Across GSM8K, AIME, and other benchmarks, Lookahead Reasoning improves the speedup of SD from 1.4x to 2.1x while preserving answer quality, and its speedup scales better with additional GPU throughput. Our code is available at https://github.com/hao-ai-lab/LookaheadReasoning
Chinese: 前瞻推理通过引入步骤级并行性改进了推测解码,允许草稿模型提出多个未来推理步骤并验证其语义正确性,与令牌级并行性相乘,从而在不影响答案质量的前提下显著提升了解码速度。
English: Lookahead Reasoning enhances speculative decoding by introducing step-level parallelism, allowing a draft model to propose multiple future reasoning steps and verifying their semantic correctness, which multiplies with token-level parallelism to significantly boost decoding speed without compromising answer quality.
Authors:Baochang Ren, Shuofei Qiao, Wenhao Yu, Huajun Chen, Ningyu Zhang
Abstract:
Large Language Models (LLMs), particularly slow-thinking models, often exhibit severe hallucination, outputting incorrect content due to an inability to accurately recognize knowledge boundaries during reasoning. While Reinforcement Learning (RL) can enhance complex reasoning abilities, its outcome-oriented reward mechanism often lacks factual supervision over the thinking process, further exacerbating the hallucination problem. To address the high hallucination in slow-thinking models, we propose Knowledge-enhanced RL, KnowRL. KnowRL guides models to perform fact-based slow thinking by integrating a factuality reward, based on knowledge verification, into the RL training process, helping them recognize their knowledge boundaries. KnowRL guides models to perform fact-based slow thinking by integrating a factuality reward, based on knowledge verification, into the RL training process, helping them recognize their knowledge boundaries. This targeted factual input during RL training enables the model to learn and internalize fact-based reasoning strategies. By directly rewarding adherence to facts within the reasoning steps, KnowRL fosters a more reliable thinking process. Experimental results on three hallucination evaluation datasets and two reasoning evaluation datasets demonstrate that KnowRL effectively mitigates hallucinations in slow-thinking models while maintaining their original strong reasoning capabilities. Our code is available at https://github.com/zjunlp/KnowRL.
Chinese: 针对慢思考大语言模型的严重幻觉问题,我们提出KnowRL方法,通过知识增强的强化学习引入事实性奖励机制,引导模型进行基于事实的慢思考,在保持推理能力的同时有效减少错误输出。
English: To address severe hallucinations in slow-thinking Large Language Models, we propose KnowRL, a knowledge-enhanced reinforcement learning method that integrates factuality rewards to guide fact-based reasoning and reduce incorrect outputs while preserving reasoning capabilities.
Authors:Yuqi Zhu, Yi Zhong, Jintian Zhang, Ziheng Zhang, Shuofei Qiao, Yujie Luo, Lun Du, Da Zheng, Ningyu Zhang, Huajun Chen
Abstract:
Large Language Models (LLMs) hold promise in automating data analysis tasks, yet open-source models face significant limitations in these kinds of reasoning-intensive scenarios. In this work, we investigate strategies to enhance the data analysis capabilities of open-source LLMs. By curating a seed dataset of diverse, realistic scenarios, we evaluate model behavior across three core dimensions: data understanding, code generation, and strategic planning. Our analysis reveals three key findings: (1) Strategic planning quality serves as the primary determinant of model performance; (2) Interaction design and task complexity significantly influence reasoning capabilities; (3) Data quality demonstrates a greater impact than diversity in achieving optimal performance. We leverage these insights to develop a data synthesis methodology, demonstrating significant improvements in open-source LLMs' analytical reasoning capabilities. Code is available at https://github.com/zjunlp/DataMind.
中文摘要:本研究通过揭示战略规划是性能关键驱动因素,开发了一种数据合成方法,显著提升了开源大语言模型的分析推理能力。
English Summary: This study enhances open-source LLMs' data analysis capabilities by identifying strategic planning as the key performance driver and developing a data synthesis method that significantly improves analytical reasoning.
Authors:Yuhui Sun, Xiyao Wang, Zixi Li, Zhenlong Yuan, Jinman Zhao
Abstract:
Large language models (LLMs) demonstrate strong generalization across a wide range of language tasks, but often generate outputs that misalign with human preferences. Reinforcement Learning from Human Feedback (RLHF) addresses this by optimizing models toward human preferences using a learned reward function and reinforcement learning, yielding improved alignment but suffering from high computational cost and instability. Direct Preference Optimization (DPO) simplifies the process by treating alignment as a classification task over binary preference pairs, reducing training overhead while achieving competitive performance. However, it assumes fixed, single-dimensional preferences and only supports pairwise supervision.
To address these limitations, we propose Multi-Preference Lambda-weighted Listwise DPO, which allows the model to learn from more detailed human feedback and flexibly balance multiple goals such as helpfulness, honesty, and fluency. Our method models full-ranked preference distributions rather than binary comparisons, enabling more informative learning signals. The lambda vector controls the relative importance of different alignment goals, allowing the model to generalize across diverse human objectives. During inference, lambda can be adjusted without retraining, providing controllable alignment behavior for downstream use. We also introduce a learned scheduler that dynamically samples performant lambda configurations to improve robustness.
Notably, our method requires only 20GB of GPU memory for training, making it suitable for compute-constrained settings such as academic labs, educational tools, or on-device assistants. Experiments on 1B-2B scale models show that our method consistently outperforms standard DPO on alignment benchmarks while enabling efficient, controllable, and fine-grained adaptation suitable for real-world deployment.
中文: 提出的多偏好Lambda加权列表化DPO方法通过列表化偏好建模和可调节的lambda向量,使语言模型能从更细致的人类反馈中学习并灵活平衡多个对齐目标,在低计算需求下实现更优性能。
English: The proposed Multi-Preference Lambda-weighted Listwise DPO method enables language models to learn from detailed human feedback and flexibly balance multiple alignment goals through listwise preference modeling and adjustable lambda vectors, achieving superior performance with low computational requirements.
Authors:Yihong Luo, Shuchen Xue, Tianyang Hu, Jing Tang
Abstract:
The pursuit of efficient and controllable high-quality content generation remains a central challenge in artificial intelligence-generated content (AIGC). While one-step generators, enabled by diffusion distillation techniques, offer excellent generation quality and computational efficiency, adapting them to new control conditions--such as structural constraints, semantic guidelines, or external inputs--poses a significant challenge. Conventional approaches often necessitate computationally expensive modifications to the base model and subsequent diffusion distillation. This paper introduces Noise Consistency Training (NCT), a novel and lightweight approach to directly integrate new control signals into pre-trained one-step generators without requiring access to original training images or retraining the base diffusion model. NCT operates by introducing an adapter module and employs a noise consistency loss in the noise space of the generator. This loss aligns the adapted model's generation behavior across noises that are conditionally dependent to varying degrees, implicitly guiding it to adhere to the new control. Theoretically, this training objective can be understood as minimizing the distributional distance between the adapted generator and the conditional distribution induced by the new conditions. NCT is modular, data-efficient, and easily deployable, relying only on the pre-trained one-step generator and a control signal model. Extensive experiments demonstrate that NCT achieves state-of-the-art controllable generation in a single forward pass, surpassing existing multi-step and distillation-based methods in both generation quality and computational efficiency. Code is available at https://github.com/Luo-Yihong/NCT
中文: 本文提出的噪声一致性训练(NCT)是一种轻量级方法,可在无需重新训练的情况下将新控制信号集成到预训练单步生成器中,实现了在生成质量和计算效率方面均超越现有方法的最优可控生成效果。
English: This paper introduces Noise Consistency Training (NCT), a lightweight method that enables pre-trained one-step generators to incorporate new control signals without retraining, achieving state-of-the-art controllable generation with superior quality and efficiency.
Authors:Jungwoo Park, Taewhoo Lee, Chanwoong Yoon, Hyeon Hwang, Jaewoo Kang
Abstract:
Extreme activation outliers in Large Language Models (LLMs) critically degrade quantization performance, hindering efficient on-device deployment. While channel-wise operations and adaptive gradient scaling are recognized causes, practical mitigation remains challenging. We introduce Outlier-Safe Pre-Training (OSP), a practical guideline that proactively prevents outlier formation rather than relying on post-hoc mitigation. OSP combines three key innovations: (1) the Muon optimizer, eliminating privileged bases while maintaining training efficiency; (2) Single-Scale RMSNorm, preventing channel-wise amplification; and (3) a learnable embedding projection, redistributing activation magnitudes originating from embedding matrices. We validate OSP by training a 1.4B-parameter model on 1 trillion tokens, which is the first production-scale LLM trained without such outliers. Under aggressive 4-bit quantization, our OSP model achieves a 35.7 average score across 10 benchmarks (compared to 26.5 for an Adam-trained model), with only a 2% training overhead. Remarkably, OSP models exhibit near-zero excess kurtosis (0.04) compared to extreme values (1818.56) in standard models, fundamentally altering LLM quantization behavior. Our work demonstrates that outliers are not inherent to LLMs but are consequences of training strategies, paving the way for more efficient LLM deployment. The source code and pretrained checkpoints are available at https://github.com/dmis-lab/Outlier-Safe-Pre-Training.
中文: 离群值安全预训练(OSP)通过三项关键创新主动预防大语言模型中的极端激活离群值,在激进4位量化下实现卓越性能且仅增加2%训练开销,从根本上改变了模型量化行为。
English: Outlier-Safe Pre-Training (OSP) is a novel training strategy that proactively prevents extreme activation outliers in LLMs through three key innovations, enabling superior 4-bit quantization performance with minimal training overhead and fundamentally changing LLM deployment efficiency.
Authors:Mihnea Ghitu, Vihari Piratla, Matthew Wicker
Abstract:
Controlling the patterns a model learns is essential to preventing reliance on irrelevant or misleading features. Such reliance on irrelevant features, often called shortcut features, has been observed across domains, including medical imaging and natural language processing, where it may lead to real-world harms. A common mitigation strategy leverages annotations (provided by humans or machines) indicating which features are relevant or irrelevant. These annotations are compared to model explanations, typically in the form of feature salience, and used to guide the loss function during training. Unfortunately, recent works have demonstrated that feature salience methods are unreliable and therefore offer a poor signal to optimize. In this work, we propose a simplified objective that simultaneously optimizes for explanation robustness and mitigation of shortcut learning. Unlike prior objectives with similar aims, we demonstrate theoretically why our approach ought to be more effective. Across a comprehensive series of experiments, we show that our approach consistently reduces test-time misclassifications by 20% compared to state-of-the-art methods. We also extend prior experimental settings to include natural language processing tasks. Additionally, we conduct novel ablations that yield practical insights, including the relative importance of annotation quality over quantity. Code for our method and experiments is available at: https://github.com/Mihneaghitu/ModelGuidanceViaRobustFeatureAttribution.
Chinese: 本研究提出了一种简化目标,通过增强解释鲁棒性并缓解捷径学习,理论上证明了其有效性,实验表明测试时误分类减少20%,同时扩展至自然语言处理任务,并提供了关于标注质量优于数量的实用见解。
English: This study introduces a streamlined objective that enhances explanation robustness and mitigates shortcut learning, theoretically justifying its effectiveness and demonstrating a 20% reduction in test-time misclassifications across experiments, while also extending evaluations to NLP tasks and providing practical insights on annotation quality.
Authors:Alan N. Amin, Andres Potapczynski, Andrew Gordon Wilson
Abstract:
To understand how genetic variants in human genomes manifest in phenotypes -- traits like height or diseases like asthma -- geneticists have sequenced and measured hundreds of thousands of individuals. Geneticists use this data to build models that predict how a genetic variant impacts phenotype given genomic features of the variant, like DNA accessibility or the presence of nearby DNA-bound proteins. As more data and features become available, one might expect predictive models to improve. Unfortunately, training these models is bottlenecked by the need to solve expensive linear algebra problems because variants in the genome are correlated with nearby variants, requiring inversion of large matrices. Previous methods have therefore been restricted to fitting small models, and fitting simplified summary statistics, rather than the full likelihood of the statistical model. In this paper, we leverage modern fast linear algebra techniques to develop DeepWAS (Deep genome Wide Association Studies), a method to train large and flexible neural network predictive models to optimize likelihood. Notably, we find that larger models only improve performance when using our full likelihood approach; when trained by fitting traditional summary statistics, larger models perform no better than small ones. We find larger models trained on more features make better predictions, potentially improving disease predictions and therapeutic target identification.
中文摘要:遗传学家利用大规模基因组数据构建预测模型以关联基因变异与表型,但面临矩阵求逆的计算瓶颈;DeepWAS采用快速线性代数方法训练更大的神经网络,通过全似然优化提升预测性能,有望改进疾病预测和治疗靶点识别。
English Summary: Geneticists use large-scale genomic data to build predictive models linking genetic variants to phenotypes, but face computational bottlenecks from matrix inversions, which DeepWAS overcomes using fast linear algebra to train larger neural networks that improve predictions with full likelihood optimization.
Authors:Tao Huang, Zhekun Liu, Rui Wang, Yang Zhang, Liping Jing
Abstract:
Despite the remarkable multimodal capabilities of Large Vision-Language Models (LVLMs), discrepancies often occur between visual inputs and textual outputs--a phenomenon we term visual hallucination. This critical reliability gap poses substantial risks in safety-critical Artificial Intelligence (AI) applications, necessitating a comprehensive evaluation benchmark and effective detection methods. Firstly, we observe that existing visual-centric hallucination benchmarks mainly assess LVLMs from a perception perspective, overlooking hallucinations arising from advanced reasoning capabilities. We develop the Perception-Reasoning Evaluation Hallucination (PRE-HAL) dataset, which enables the systematic evaluation of both perception and reasoning capabilities of LVLMs across multiple visual semantics, such as instances, scenes, and relations. Comprehensive evaluation with this new benchmark exposed more visual vulnerabilities, particularly in the more challenging task of relation reasoning. To address this issue, we propose, to the best of our knowledge, the first Dempster-Shafer theory (DST)-based visual hallucination detection method for LVLMs through uncertainty estimation. This method aims to efficiently capture the degree of conflict in high-level features at the model inference phase. Specifically, our approach employs simple mass functions to mitigate the computational complexity of evidence combination on power sets. We conduct an extensive evaluation of state-of-the-art LVLMs, LLaVA-v1.5, mPLUG-Owl2 and mPLUG-Owl3, with the new PRE-HAL benchmark. Experimental results indicate that our method outperforms five baseline uncertainty metrics, achieving average AUROC improvements of 4%, 10%, and 7% across three LVLMs. Our code is available at https://github.com/HT86159/Evidential-Conflict.
中文摘要:大型视觉语言模型存在视觉幻觉导致可靠性风险,为此开发了PRE-HAL评估基准和基于Dempster-Shafer理论的检测方法,该方法在多项测试中优于现有基准指标。
English Summary: Large Vision-Language Models suffer from visual hallucinations causing reliability risks, prompting the creation of the PRE-HAL benchmark for systematic evaluation and a novel Dempster-Shafer theory-based detection method that outperforms existing metrics.
Authors:Aleksandr Algazinov, Matt Laing, Paul Laban
Abstract:
Accessibility remains a critical concern in today's society, as many technologies are not developed to support the full range of user needs. Existing multi-agent systems (MAS) often cannot provide comprehensive assistance for users in need due to the lack of customization stemming from closed-source designs. Consequently, individuals with disabilities frequently encounter significant barriers when attempting to interact with digital environments. We introduce MATE, a multimodal accessibility MAS, which performs the modality conversions based on the user's needs. The system is useful for assisting people with disabilities by ensuring that data will be converted to an understandable format. For instance, if the user cannot see well and receives an image, the system converts this image to its audio description. MATE can be applied to a wide range of domains, industries, and areas, such as healthcare, and can become a useful assistant for various groups of users. The system supports multiple types of models, ranging from LLM API calling to using custom machine learning (ML) classifiers. This flexibility ensures that the system can be adapted to various needs and is compatible with a wide variety of hardware. Since the system is expected to run locally, it ensures the privacy and security of sensitive information. In addition, the framework can be effectively integrated with institutional technologies (e.g., digital healthcare service) for real-time user assistance. Furthermore, we introduce ModCon-Task-Identifier, a model that is capable of extracting the precise modality conversion task from the user input. Numerous experiments show that ModCon-Task-Identifier consistently outperforms other LLMs and statistical models on our custom data. Our code and data are publicly available at https://github.com/AlgazinovAleksandr/Multi-Agent-MATE.
中文: MATE是一种多模态无障碍多智能体系统,能根据用户需求进行数据格式转换,通过可定制的模态转换帮助残障人士克服数字障碍,同时本地运行确保隐私安全。
English: MATE is a multimodal accessibility multi-agent system that converts data formats based on user needs, enabling people with disabilities to overcome digital barriers through customizable modality conversions while ensuring local operation for privacy protection.
Authors:Yuelin Zhang, Jiacheng Cen, Jiaqi Han, Wenbing Huang
Abstract:
Equivariant Graph Neural Networks (GNNs) have achieved remarkable success across diverse scientific applications. However, existing approaches face critical efficiency challenges when scaling to large geometric graphs and suffer significant performance degradation when the input graphs are sparsified for computational tractability. To address these limitations, we introduce FastEGNN and DistEGNN, two novel enhancements to equivariant GNNs for large-scale geometric graphs. FastEGNN employs a key innovation: a small ordered set of virtual nodes that effectively approximates the large unordered graph of real nodes. Specifically, we implement distinct message passing and aggregation mechanisms for different virtual nodes to ensure mutual distinctiveness, and minimize Maximum Mean Discrepancy (MMD) between virtual and real coordinates to achieve global distributedness. This design enables FastEGNN to maintain high accuracy while efficiently processing large-scale sparse graphs. For extremely large-scale geometric graphs, we present DistEGNN, a distributed extension where virtual nodes act as global bridges between subgraphs in different devices, maintaining consistency while dramatically reducing memory and computational overhead. We comprehensively evaluate our models across four challenging domains: N-body systems (100 nodes), protein dynamics (800 nodes), Water-3D (8,000 nodes), and our new Fluid113K benchmark (113,000 nodes). Results demonstrate superior efficiency and performance, establishing new capabilities in large-scale equivariant graph learning. Code is available at https://github.com/GLAD-RUC/DistEGNN.
Chinese: FastEGNN和DistEGNN通过虚拟节点有效近似大型几何图,在保持高精度的同时显著提升处理效率,成功应用于多种大规模科学计算场景。
English: FastEGNN and DistEGNN enhance equivariant GNNs by using virtual nodes to efficiently approximate large geometric graphs, maintaining high accuracy and scalability across diverse large-scale applications.
Authors:Ziyu Zheng, Yaming Yang, Ziyu Guan, Wei Zhao, Weigang Lu
Abstract:
Masked Graph Auto-Encoder, a powerful graph self-supervised training paradigm, has recently shown superior performance in graph representation learning. Existing works typically rely on node contextual information to recover the masked information. However, they fail to generalize well to heterophilic graphs where connected nodes may be not similar, because they focus only on capturing the neighborhood information and ignoring the discrepancy information between different nodes, resulting in indistinguishable node representations. In this paper, to address this issue, we propose a Discrepancy-Aware Graph Mask Auto-Encoder (DGMAE). It obtains more distinguishable node representations by reconstructing the discrepancy information of neighboring nodes during the masking process. We conduct extensive experiments on 17 widely-used benchmark datasets. The results show that our DGMAE can effectively preserve the discrepancies of nodes in low-dimensional space. Moreover, DGMAE significantly outperforms state-of-the-art graph self-supervised learning methods on three graph analytic including tasks node classification, node clustering, and graph classification, demonstrating its remarkable superiority. The code of DGMAE is available at https://github.com/zhengziyu77/DGMAE.
中文摘要:本文提出的差异感知图掩码自编码器(DGMAE)通过在掩码过程中重构节点差异信息,解决了现有方法在异配图上的局限性,在多项图分析任务中展现出卓越性能。
English Summary: The proposed Discrepancy-Aware Graph Mask Auto-Encoder (DGMAE) addresses limitations in existing graph auto-encoders by reconstructing node discrepancy information during masking, achieving superior performance on heterophilic graphs across multiple graph analytic tasks.
Authors:Barry Wang, Avi Schwarzschild, Alexander Robey, Ali Payani, Charles Fleming, Mingjie Sun, Daphne Ippolito
Abstract:
Retrofitting large language models (LLMs) with new behaviors typically requires full finetuning or distillation-costly steps that must be repeated for every architecture. In this work, we introduce Command-V, a backpropagation-free behavior transfer method that copies an existing residual activation adapter from a donor model and pastes its effect into a recipient model. Command-V profiles layer activations on a small prompt set, derives linear converters between corresponding layers, and applies the donor intervention in the recipient's activation space. This process does not require access to the original training data and needs minimal compute. In three case studies-safety-refusal enhancement, jailbreak facilitation, and automatic chain-of-thought reasoning--Command-V matches or exceeds the performance of direct finetuning while using orders of magnitude less compute. Our code and data are accessible at https://github.com/GithuBarry/Command-V/.
中文: Command-V是一种无需反向传播的高效方法,通过复制残差激活适配器并应用线性转换器在大型语言模型间迁移行为,以极低计算成本达到或超越微调效果。
English: Command-V is a computationally efficient, backpropagation-free method that transfers behaviors between large language models by copying residual activation adapters and applying linear converters, matching or surpassing fine-tuning performance with minimal resources.
Authors:Weizhi Zhang, Yangning Li, Yuanchen Bei, Junyu Luo, Guancheng Wan, Liangwei Yang, Chenxuan Xie, Yuyao Yang, Wei-Chieh Huang, Chunyu Miao, Henry Peng Zou, Xiao Luo, Yusheng Zhao, Yankai Chen, Chunkit Chan, Peilin Zhou, Xinyang Zhang, Chenwei Zhang, Jingbo Shang, Ming Zhang, Yangqiu Song, Irwin King, Philip S. Yu
Abstract:
Information retrieval is a cornerstone of modern knowledge acquisition, enabling billions of queries each day across diverse domains. However, traditional keyword-based search engines are increasingly inadequate for handling complex, multi-step information needs. Our position is that Large Language Models (LLMs), endowed with reasoning and agentic capabilities, are ushering in a new paradigm termed Agentic Deep Research. These systems transcend conventional information search techniques by tightly integrating autonomous reasoning, iterative retrieval, and information synthesis into a dynamic feedback loop. We trace the evolution from static web search to interactive, agent-based systems that plan, explore, and learn. We also introduce a test-time scaling law to formalize the impact of computational depth on reasoning and search. Supported by benchmark results and the rise of open-source implementations, we demonstrate that Agentic Deep Research not only significantly outperforms existing approaches, but is also poised to become the dominant paradigm for future information seeking. All the related resources, including industry products, research papers, benchmark datasets, and open-source implementations, are collected for the community in https://github.com/DavidZWZ/Awesome-Deep-Research.
中文摘要:具备推理能力的大语言模型正在开创“智能深度研究”新范式,通过自主推理与迭代检索的深度融合,显著超越了传统搜索方法,有望成为未来信息获取的主导模式。
English Summary: Large Language Models with reasoning capabilities are pioneering Agentic Deep Research, a new paradigm that integrates autonomous reasoning and iterative retrieval to significantly outperform traditional search methods and redefine future information seeking.
Authors:Zihan Wang, Rui Pan, Jiarui Yao, Robert Csordas, Linjie Li, Lu Yin, Jiajun Wu, Tong Zhang, Manling Li, Shiwei Liu
Abstract:
We propose Chain-of-Experts (CoE), a new Mixture-of-Experts (MoE) architecture that introduces sequential expert communication within each layer. Unlike traditional MoE models, where experts operate independently in parallel, CoE processes tokens iteratively across a chain of experts inside a layer. To support dynamic expert selection across iterations, CoE employs a dedicated router at each iteration step within a layer. This design allows tokens to re-evaluate and select different experts during each iteration, rather than being statically assigned. As a result, CoE introduces a flexible routing mechanism that increases the diversity of expert combinations and enriches the model's representational capacity. CoE demonstrates improved performance under fixed compute: on math reasoning tasks, it reduces validation loss from 1.20 to 1.12 compared to a standard MoE. Beyond performance, CoE offers a new scaling axis: depth through expert iteration, which complements conventional width/depth scaling. For example, using 2x iterations matches the performance of 3x expert selections (in width), while reducing memory usage by 17.6-42% relative to other scaling strategies. Our analysis reveals that CoE's benefits stem from its iterative residual structure and enhanced expert specialization empowered by iterative routing, which together unlock more expressive representations. Code is available at https://github.com/ZihanWang314/coe.
中文: Chain-of-Experts (CoE) 提出在层内实现专家顺序通信的新架构,通过动态路由和迭代处理增强模型表达能力,在数学推理任务上将验证损失从1.20降至1.12,相比传统混合专家模型内存使用减少17.6-42%。
English: Chain-of-Experts (CoE) introduces sequential expert communication within layers, enabling dynamic routing and iterative processing that enhances model capacity and reduces validation loss from 1.20 to 1.12 on math tasks while cutting memory usage by 17.6-42% compared to traditional MoE architectures.
Authors:Shuang Ao, Yi Dong, Jinwei Hu, Sarvapali Ramchurn
Abstract:
Fine-tuning Large Language Models (LLMs) with Low-Rank Adaptation (LoRA) enhances adaptability while reducing computational costs. However, fine-tuning can compromise safety alignment, even with benign data, increasing susceptibility to harmful outputs. Existing safety alignment methods struggle to capture complex parameter shifts, leading to suboptimal safety-utility trade-offs. To address this issue, we propose Safe Pruning LoRA (SPLoRA), a novel pruning-based approach that selectively removes LoRA layers that weaken safety alignment, improving safety while preserving performance. At its core, we introduce Empirical-DIEM (E-DIEM), a dimension-insensitive similarity metric that effectively detects safety misalignment in LoRA-adapted models. We conduct extensive experiments on LLMs fine-tuned with mixed of benign and malicious data, and purely benign datasets, evaluating SPLoRA across utility, safety, and reliability metrics. Results demonstrate that SPLoRA outperforms state-of-the-art safety alignment techniques, significantly reducing safety risks while maintaining or improving model performance and reliability. Additionally, SPLoRA reduces inference overhead, making it a scalable and efficient solution for deploying safer and more reliable LLMs. The code is available at https://github.com/AoShuang92/SPLoRA.
中文摘要:SPLoRA是一种基于剪枝的新方法,通过E-DIEM度量选择性移除不安全的LoRA层,在提升微调后大语言模型安全性的同时保持性能优势,并降低推理开销。
English Summary: SPLoRA is a novel pruning-based method that enhances the safety of fine-tuned LLMs by selectively removing unsafe LoRA layers using the E-DIEM metric, achieving superior safety-utility balance with reduced inference overhead.
Authors:Suyash Gaurav, Muhammad Farhan Humayun, Jukka Heikkonen, Jatin Chaudhary
Abstract:
The evolution of Vision Transformers has led to their widespread adaptation to different domains. Despite large-scale success, there remain significant challenges including their reliance on extensive computational and memory resources for pre-training on huge datasets as well as difficulties in task-specific transfer learning. These limitations coupled with energy inefficiencies mainly arise due to the computation-intensive self-attention mechanism. To address these issues, we propose a novel Super-Pixel Based Patch Pooling (SPPP) technique that generates context-aware, semantically rich, patch embeddings to effectively reduce the architectural complexity and improve efficiency. Additionally, we introduce the Light Latent Attention (LLA) module in our pipeline by integrating latent tokens into the attention mechanism allowing cross-attention operations to significantly reduce the time and space complexity of the attention module. By leveraging the data-intuitive patch embeddings coupled with dynamic positional encodings, our approach adaptively modulates the cross-attention process to focus on informative regions while maintaining the global semantic structure. This targeted attention improves training efficiency and accelerates convergence. Notably, the SPPP module is lightweight and can be easily integrated into existing transformer architectures. Extensive experiments demonstrate that our proposed architecture provides significant improvements in terms of computational efficiency while achieving comparable results with the state-of-the-art approaches, highlighting its potential for energy-efficient transformers suitable for edge deployment. (The code is available on our GitHub repository: https://github.com/zser092/Focused-Attention-ViT).
中文摘要:本文提出基于超像素的补丁池化技术和轻量潜在注意力模块,有效降低视觉Transformer的复杂度,在保持与先进方法相当性能的同时显著提升计算效率。
English Summary: This paper introduces a Super-Pixel Based Patch Pooling technique and Light Latent Attention module to reduce Vision Transformer complexity, improving computational efficiency while maintaining competitive performance with state-of-the-art methods.
Authors:Jie Li, Shifei Ding, Lili Guo, Xuan Li
Abstract:
Emotion Recognition in Conversation (ERC) aims to detect the emotions of individual utterances within a conversation. Generating efficient and modality-specific representations for each utterance remains a significant challenge. Previous studies have proposed various models to integrate features extracted using different modality-specific encoders. However, they neglect the varying contributions of modalities to this task and introduce high complexity by aligning modalities at the frame level. To address these challenges, we propose the Multi-modal Anchor Gated Transformer with Knowledge Distillation (MAGTKD) for the ERC task. Specifically, prompt learning is employed to enhance textual modality representations, while knowledge distillation is utilized to strengthen representations of weaker modalities. Furthermore, we introduce a multi-modal anchor gated transformer to effectively integrate utterance-level representations across modalities. Extensive experiments on the IEMOCAP and MELD datasets demonstrate the effectiveness of knowledge distillation in enhancing modality representations and achieve state-of-the-art performance in emotion recognition. Our code is available at: https://github.com/JieLi-dd/MAGTKD.
Chinese: 提出的MAGTKD模型通过提示学习和知识蒸馏增强模态表示,并利用多模态锚点门控变换器有效整合各模态信息,在基准数据集上实现了对话中情感识别的最新性能。
English: The proposed MAGTKD model enhances emotion recognition in conversations by using prompt learning and knowledge distillation to improve modality representations and integrates them effectively with a multi-modal anchor gated transformer, achieving state-of-the-art results on benchmark datasets.
Authors:Yuchang Zhu, Jintang Li, Huizhe Zhang, Liang Chen, Zibin Zheng
Abstract:
Individual fairness (IF) in graph neural networks (GNNs), which emphasizes the need for similar individuals should receive similar outcomes from GNNs, has been a critical issue. Despite its importance, research in this area has been largely unexplored in terms of (1) a clear understanding of what induces individual unfairness in GNNs and (2) a comprehensive consideration of identifying similar individuals. To bridge these gaps, we conduct a preliminary analysis to explore the underlying reason for individual unfairness and observe correlations between IF and similarity consistency, a concept introduced to evaluate the discrepancy in identifying similar individuals based on graph structure versus node features. Inspired by our observations, we introduce two metrics to assess individual similarity from two distinct perspectives: topology fusion and feature fusion. Building upon these metrics, we propose Similarity-aware GNNs for Individual Fairness, named SaGIF. The key insight behind SaGIF is the integration of individual similarities by independently learning similarity representations, leading to an improvement of IF in GNNs. Our experiments on several real-world datasets validate the effectiveness of our proposed metrics and SaGIF. Specifically, SaGIF consistently outperforms state-of-the-art IF methods while maintaining utility performance. Code is available at: https://github.com/ZzoomD/SaGIF.
中文: 本研究针对图神经网络中的个体公平性问题,提出了两种相似性度量指标和SaGIF方法,通过独立学习相似性表示来提升公平性,同时保持模型性能。
English: This study addresses individual fairness in graph neural networks by introducing two similarity metrics and proposing SaGIF, a method that improves fairness through independent similarity representation learning while maintaining performance.
Authors:Kurt Butler, Guanchao Feng, Tong Chen, Petar Djuric
Abstract:
Probabilistic models are often used to make predictions in regions of the data space where no observations are available, but it is not always clear whether such predictions are well-informed by previously seen data. In this paper, we propose a knowledge score for predictions from Gaussian process regression (GPR) models that quantifies the extent to which observing data have reduced our uncertainty about a prediction. The knowledge score is interpretable and naturally bounded between 0 and 1. We demonstrate in several experiments that the knowledge score can anticipate when predictions from a GPR model are accurate, and that this anticipation improves performance in tasks such as anomaly detection, extrapolation, and missing data imputation. Source code for this project is available online at https://github.com/KurtButler/GP-knowledge.
Chinese: 本文为高斯过程回归模型提出了一种知识评分,用于量化预测中不确定性的减少程度,并证明该评分在异常检测和数据插补等任务中能有效提升性能。
English: This paper introduces a knowledge score for Gaussian process regression models that measures uncertainty reduction in predictions, demonstrating its effectiveness in improving tasks like anomaly detection and data imputation.
Authors:Mauricio Byrd Victorica, György Dán, Henrik Sandberg
Abstract:
State-of-the-art convolutional neural network models for object detection and image classification are vulnerable to physically realizable adversarial perturbations, such as patch attacks. Existing defenses have focused, implicitly or explicitly, on single-patch attacks, leaving their sensitivity to the number of patches as an open question or rendering them computationally infeasible or inefficient against attacks consisting of multiple patches in the worst cases. In this work, we propose SpaNN, an attack detector whose computational complexity is independent of the expected number of adversarial patches. The key novelty of the proposed detector is that it builds an ensemble of binarized feature maps by applying a set of saliency thresholds to the neural activations of the first convolutional layer of the victim model. It then performs clustering on the ensemble and uses the cluster features as the input to a classifier for attack detection. Contrary to existing detectors, SpaNN does not rely on a fixed saliency threshold for identifying adversarial regions, which makes it robust against white box adversarial attacks. We evaluate SpaNN on four widely used data sets for object detection and classification, and our results show that SpaNN outperforms state-of-the-art defenses by up to 11 and 27 percentage points in the case of object detection and the case of image classification, respectively. Our code is available at https://github.com/gerkbyrd/SpaNN.
中文: 提出的SpaNN检测器通过使用二值化特征图集合和聚类技术,有效识别对抗性补丁攻击,其计算效率与补丁数量无关,性能显著优于现有防御方法。
English: The proposed SpaNN detector effectively identifies adversarial patch attacks by using an ensemble of binarized feature maps and clustering, outperforming existing defenses with computational efficiency independent of patch count.
Authors:Aniss Bessalah, Hatem Mohamed Abdelmoumen, Karima Benatchba, Hadjer Benmeziane
Abstract:
Analog In-memory Computing (AIMC) has emerged as a highly efficient paradigm for accelerating Deep Neural Networks (DNNs), offering significant energy and latency benefits over conventional digital hardware. However, state-of-the-art neural networks are not inherently designed for AIMC, as they fail to account for its unique non-idealities. Neural Architecture Search (NAS) is thus needed to systematically discover neural architectures optimized explicitly for AIMC constraints. However, comparing NAS methodologies and extracting insights about robust architectures for AIMC requires a dedicated NAS benchmark that explicitly accounts for AIMC-specific hardware non-idealities. To address this, we introduce AnalogNAS-Bench, the first NAS benchmark tailored specifically for AIMC. Our study reveals three key insights: (1) standard quantization techniques fail to capture AIMC-specific noises, (2) robust architectures tend to feature wider and branched blocks, (3) skip connections improve resilience to temporal drift noise. These insights highlight the limitations of current NAS benchmarks for AIMC and pave the way for future analog-aware NAS. All the implementations used in this paper can be found at https://github.com/IBM/analog-nas/tree/main/analognasbench.
模拟内存计算(AIMC)能高效加速深度神经网络,但现有网络和架构搜索基准未考虑其特殊非理想特性,因此推出AnalogNAS-Bench基准来发现稳健架构并揭示关键设计规律。
Analog In-memory Computing (AIMC) accelerates deep neural networks efficiently, but current neural networks and NAS benchmarks overlook AIMC-specific non-idealities, prompting the introduction of AnalogNAS-Bench to identify robust architectures and reveal key design insights.
Authors:Lixin Wu, Na Cai, Qiao Cheng, Jiachen Wang, Yitao Duan
Abstract:
We introduce Confucius3-Math, an open-source large language model with 14B parameters that (1) runs efficiently on a single consumer-grade GPU; (2) achieves SOTA performances on a range of mathematical reasoning tasks, outperforming many models with significantly larger sizes. In particular, as part of our mission to enhancing education and knowledge dissemination with AI, Confucius3-Math is specifically committed to mathematics learning for Chinese K-12 students and educators. Built via post-training with large-scale reinforcement learning (RL), Confucius3-Math aligns with national curriculum and excels at solving main-stream Chinese K-12 mathematical problems with low cost. In this report we share our development recipe, the challenges we encounter and the techniques we develop to overcome them. In particular, we introduce three technical innovations: Targeted Entropy Regularization, Recent Sample Recovery and Policy-Specific Hardness Weighting. These innovations encompass a new entropy regularization, a novel data scheduling policy, and an improved group-relative advantage estimator. Collectively, they significantly stabilize the RL training, improve data efficiency, and boost performance. Our work demonstrates the feasibility of building strong reasoning models in a particular domain at low cost. We open-source our model and code at https://github.com/netease-youdao/Confucius3-Math.
Chinese: Confucius3-Math是一个开源140亿参数模型,可在消费级GPU上高效运行,在数学推理任务中达到顶尖水平,并针对中国K-12数学教育通过强化学习技术创新实现了低成本高性能。
English: Confucius3-Math is an open-source 14B-parameter model that efficiently runs on consumer GPUs and achieves state-of-the-art performance in mathematical reasoning, specifically tailored for Chinese K-12 education with innovations in reinforcement learning training.
Authors:Muhammad Usama, Hee-Deok Jang, Soham Shanbhag, Yoo-Chang Sung, Seung-Jun Bae, Dong Eui Chang
Abstract:
This paper addresses the dual challenge of improving anomaly detection and signal integrity in high-speed dynamic random access memory signals. To achieve this, we propose a joint training framework that integrates an autoencoder with a classifier to learn more distinctive latent representations by focusing on valid data features. Our approach is evaluated across three anomaly detection algorithms and consistently outperforms two baseline methods. Detailed ablation studies further support these findings. Furthermore, we introduce a signal integrity enhancement algorithm that improves signal integrity by an average of 11.3%. The source code and data used in this study are available at https://github.com/Usama1002/learning-latent-representations.
本文提出了一种结合自动编码器和分类器的联合训练框架,用于提升高速动态随机存取存储器信号的异常检测与信号完整性,其性能优于基线方法并实现了11.3%的平均信号改善。
This paper introduces a joint training framework combining an autoencoder and classifier to enhance anomaly detection and signal integrity in high-speed DRAM signals, demonstrating superior performance over baselines and achieving an 11.3% average signal improvement.
Authors:Tianyu Yu, Bo Ji, Shouli Wang, Shu Yao, Zefan Wang, Ganqu Cui, Lifan Yuan, Ning Ding, Yuan Yao, Zhiyuan Liu, Maosong Sun, Tat-Seng Chua
Abstract:
Reinforcement Learning with Verifiable Rewards (RLVR) demonstrates promising potential in advancing the reasoning capabilities of LLMs. However, its success remains largely confined to mathematical and code domains. This primary limitation stems from the heavy reliance on domain-specific verifiers, which results in prohibitive complexity and limited scalability. To address the challenge, our key observation is that LLM's intrinsic probability of generating a correct free-form answer directly indicates its own evaluation of the reasoning reward (i.e., how well the reasoning process leads to the correct answer). Building on this insight, we propose RLPR, a simple verifier-free framework that extrapolates RLVR to broader general domains. RLPR uses the LLM's own token probability scores for reference answers as the reward signal and maximizes the expected reward during training. We find that addressing the high variance of this noisy probability reward is crucial to make it work, and propose prob-to-reward and stabilizing methods to ensure a precise and stable reward from LLM intrinsic probabilities. Comprehensive experiments in four general-domain benchmarks and three mathematical benchmarks show that RLPR consistently improves reasoning capabilities in both areas for Gemma, Llama, and Qwen based models. Notably, RLPR outperforms concurrent VeriFree by 7.6 points on TheoremQA and 7.5 points on Minerva, and even surpasses strong verifier-model-dependent approaches General-Reasoner by 1.6 average points across seven benchmarks.
Chinese: RLPR是一种无需验证器的强化学习框架,它利用大语言模型自身对参考答案的标记概率作为奖励信号,有效提升了在通用领域和数学领域的推理能力,并超越了现有方法的性能。
English: RLPR is a verifier-free reinforcement learning framework that uses an LLM's own token probability scores for reference answers as the reward signal, effectively enhancing reasoning capabilities across both general and mathematical domains while outperforming existing methods.
Authors:Zih-Hao Huang, You-Teng Lin, Hung-Hsuan Chen
Abstract:
This paper introduces Decoupled Supervised Learning with Information Regularization (DeInfoReg), a novel approach that transforms a long gradient flow into multiple shorter ones, thereby mitigating the vanishing gradient problem. Integrating a pipeline strategy, DeInfoReg enables model parallelization across multiple GPUs, significantly improving training throughput. We compare our proposed method with standard backpropagation and other gradient flow decomposition techniques. Extensive experiments on diverse tasks and datasets demonstrate that DeInfoReg achieves superior performance and better noise resistance than traditional BP models and efficiently utilizes parallel computing resources. The code for reproducibility is available at: https://github.com/ianzih/Decoupled-Supervised-Learning-for-Information-Regularization/.
中文: 本文提出的DeInfoReg方法通过将长梯度流分解为短片段解决梯度消失问题,并利用多GPU流水线并行策略显著提升训练效率和模型性能。
English: This paper presents DeInfoReg, a method that breaks long gradient flows into shorter segments to combat vanishing gradients and leverages pipeline parallelism across GPUs for enhanced training efficiency and performance.
Authors:Donghyun Lee, Yuhang Li, Ruokai Yin, Shiting Xiao, Priyadarshini Panda
Abstract:
State Space Models (SSMs) have emerged as powerful alternatives to attention-based Transformers, with Mamba demonstrating impressive efficiency and scalability. As these models grow increasingly larger, the need for Parameter-Efficient Fine-Tuning (PEFT) methods becomes critical to adapt pre-trained Mamba to downstream tasks without prohibitive computational costs. However, previous approaches simply apply traditional Transformer-tailored PEFT methods without addressing the unique temporal processing dynamics of SSMs. To address this limitation, we propose Memba, a membrane-driven PEFT approach specifically designed for Mamba. Memba introduces Leaky Integrate Membrane (LIM) neurons as bio-inspired gating mechanisms that naturally accumulate membrane potentials over time, enhancing selective information retention. By strategically combining LIM neurons with Low-Rank Adaptations (LoRA) and cross-layer membrane transfer, our approach significantly improves Mamba's temporal modeling capabilities. Extensive experiments across language and vision tasks demonstrate that Memba achieves substantial improvements over existing PEFT methods. The code is available at https://github.com/Intelligent-Computing-Lab-Yale/Memba.
中文: 状态空间模型如Mamba是Transformer的高效替代方案,而提出的Memba方法通过引入具有膜电位的仿生门控机制,显著提升了参数高效微调中的时序建模能力。
English: State Space Models like Mamba offer efficient alternatives to Transformers, and the proposed Memba method introduces bio-inspired gating mechanisms with membrane potentials to enhance temporal modeling through Parameter-Efficient Fine-Tuning.
Authors:Xiangfei Qiu, Zhe Li, Wanghui Qiu, Shiyan Hu, Lekui Zhou, Xingjian Wu, Zhengyu Li, Chenjuan Guo, Aoying Zhou, Zhenli Sheng, Jilin Hu, Christian S. Jensen, Bin Yang
Abstract:
Time series anomaly detection (TSAD) plays an important role in many domains such as finance, transportation, and healthcare. With the ongoing instrumentation of reality, more time series data will be available, leading also to growing demands for TSAD. While many TSAD methods already exist, new and better methods are still desirable. However, effective progress hinges on the availability of reliable means of evaluating new methods and comparing them with existing methods. We address deficiencies in current evaluation procedures related to datasets and experimental settings and protocols. Specifically, we propose a new time series anomaly detection benchmark, called TAB. First, TAB encompasses 29 public multivariate datasets and 1,635 univariate time series from different domains to facilitate more comprehensive evaluations on diverse datasets. Second, TAB covers a variety of TSAD methods, including Non-learning, Machine learning, Deep learning, LLM-based, and Time-series pre-trained methods. Third, TAB features a unified and automated evaluation pipeline that enables fair and easy evaluation of TSAD methods. Finally, we employ TAB to evaluate existing TSAD methods and report on the outcomes, thereby offering a deeper insight into the performance of these methods. Besides, all datasets and code are available at https://github.com/decisionintelligence/TAB.
中文: 该摘要介绍了TAB这一新型时间序列异常检测基准,它整合了多样化数据集、多种检测方法类型和统一评估流程,旨在解决现有评估缺陷并提供公平的性能比较。
English: The abstract introduces TAB, a new benchmark for time series anomaly detection that includes diverse datasets, multiple method types, and a unified evaluation pipeline to address current evaluation deficiencies and provide fair performance comparisons.
Authors:Jianyu Wang, Zhiqiang Hu, Lidong Bing
Abstract:
We propose a novel prompt design paradigm that challenges conventional wisdom in large language model (LLM) prompting. While conventional wisdom prioritizes well-crafted instructions and demonstrations for in-context learning (ICL), we show that pruning random demonstrations into seemingly incoherent "gibberish" can remarkably improve performance across diverse tasks. Notably, the "gibberish" always matches or surpasses state-of-the-art automatic prompt optimization techniques, achieving substantial gains regardless of LLM alignment. Nevertheless, discovering an effective pruning strategy is non-trivial, as existing attribution methods and prompt compression algorithms fail to deliver robust results, let alone human intuition. In terms of this, we propose a self-discover prompt optimization framework, PromptQuine, an evolutionary search framework that automatically searches for the pruning strategy by itself using only low-data regimes. Much like the emergent complexity in nature--such as symbiosis and self-organization--arising in response to resource constraints, our framework evolves and refines unconventional yet highly effective prompts by leveraging only the tokens present within the context. We demonstrate its effectiveness across classification, multi-choice question answering, generation and math reasoning tasks across LLMs, while achieving decent runtime efficiency. We hope our findings can guide mechanistic studies on in-context learning, and provide a call to action, to pave the way for more open-ended search algorithms for more effective LLM prompting.
中文: 本研究提出了一种新颖的提示设计范式,证明将随机示例修剪为看似不连贯的"乱码"能在多种任务中超越传统提示方法和最优自动优化技术,同时开发了PromptQuine进化框架,仅需少量数据即可自主发现有效的修剪策略。
English: This study introduces a novel prompt design paradigm that demonstrates pruning random demonstrations into seemingly incoherent "gibberish" can outperform conventional prompting methods and state-of-the-art optimization techniques across various tasks, while also proposing PromptQuine, an evolutionary framework that autonomously discovers effective pruning strategies with minimal data.
Authors:Jianghong Huang, Luping Ji, Xin Ma, Mao Ye
Abstract:
Conveyor belts are important equipment in modern industry, widely applied in production and manufacturing. Their health is much critical to operational efficiency and safety. Cracks are a major threat to belt health. Currently, considering safety, how to intelligently detect belt cracks is catching an increasing attention. To implement the intelligent detection with machine learning, real crack samples are believed to be necessary. However, existing crack datasets primarily focus on pavement scenarios or synthetic data, no real-world industrial belt crack datasets at all. Cracks are a major threat to belt health. Furthermore, to validate usability and effectiveness, we propose a special baseline method with triple-domain ($i.e.$, time-space-frequency) feature hierarchical fusion learning for the two whole-new datasets. Experimental results demonstrate the availability and effectiveness of our dataset. Besides, they also show that our baseline is obviously superior to other similar detection methods. Our datasets and source codes are available at https://github.com/UESTC-nnLab/BeltCrack.
中文: 本研究提出了两个全新的工业传送带真实裂纹数据集及基于时-空-频三域特征分层融合的基线方法,实验证明该方法显著优于同类检测方案。
English: This study introduces two novel real-world industrial conveyor belt crack datasets and a baseline method using triple-domain feature fusion learning, which proves more effective than existing detection approaches.
Authors:Chenghao Yang, Ari Holtzman
Abstract:
Despite their impressive capabilities, aligned large language models (LLMs) often generate outputs that lack diversity. What drives this stability in the generation? We investigate this phenomenon through the lens of probability concentration in the model's output distribution. To quantify this concentration, we introduce the Branching Factor (BF) -- a token-invariant measure of the effective number of plausible next steps during generation. Our empirical analysis reveals two key findings: (1) BF often decreases as generation progresses, suggesting that LLMs become more predictable as they generate. (2) alignment tuning substantially sharpens the model's output distribution from the outset, reducing BF by nearly an order of magnitude (e.g., from 12 to 1.2) relative to base models. This stark reduction helps explain why aligned models often appear less sensitive to decoding strategies. Building on this insight, we find this stability has surprising implications for complex reasoning. Aligned Chain-of-Thought (CoT) models (e.g., DeepSeek-distilled models), for instance, leverage this effect; by generating longer reasoning chains, they push generation into later, more deterministic (lower BF) stages, resulting in more stable outputs. We hypothesize that alignment tuning does not fundamentally change a model's behavior, but instead steers it toward stylistic tokens (e.g., "Sure") that unlock low-entropy trajectories already present in the base model. This view is supported by nudging experiments, which show that prompting base models with such tokens can similarly reduce BF. Together, our findings establish BF as a powerful diagnostic for understanding and controlling LLM outputs - clarifying how alignment reduces variability, how CoT promotes stable generations, and how base models can be steered away from diversity.
Chinese: 对齐后的大型语言模型输出多样性降低源于概率集中现象,通过分支因子量化发现生成过程中分支因子递减且对齐调优使其大幅降低,这解释了模型稳定性及对解码策略不敏感的原因,并揭示了思维链模型通过延长推理进入确定性阶段实现稳定输出的机制。
English: Aligned large language models exhibit reduced output diversity due to probability concentration, quantified by the Branching Factor which decreases during generation and is substantially lowered by alignment tuning, explaining their stability and insensitivity to decoding strategies.
Authors:Jianhang Xie, Chuntao Ding, Xiaqing Li, Shenyuan Ren, Yidong Li, Zhichao Lu
Abstract:
Deploying quantized deep neural network (DNN) models with resource adaptation capabilities on ubiquitous Internet of Things (IoT) devices to provide high-quality AI services can leverage the benefits of compression and meet multi-scenario resource requirements. However, existing dynamic/mixed precision quantization requires retraining or special hardware, whereas post-training quantization (PTQ) has two limitations for resource adaptation: (i) The state-of-the-art PTQ methods only provide one fixed bitwidth model, which makes it challenging to adapt to the dynamic resources of IoT devices; (ii) Deploying multiple PTQ models with diverse bitwidths consumes large storage resources and switching overheads. To this end, this paper introduces a resource-friendly post-training integer-nesting quantization, i.e., NestQuant, for on-device quantized model switching on IoT devices. The proposed NestQuant incorporates the integer weight decomposition, which bit-wise splits quantized weights into higher-bit and lower-bit weights of integer data types. It also contains a decomposed weights nesting mechanism to optimize the higher-bit weights by adaptive rounding and nest them into the original quantized weights. In deployment, we can send and store only one NestQuant model and switch between the full-bit/part-bit model by paging in/out lower-bit weights to adapt to resource changes and reduce consumption. Experimental results on the ImageNet-1K pretrained DNNs demonstrated that the NestQuant model can achieve high performance in top-1 accuracy, and reduce in terms of data transmission, storage consumption, and switching overheads. In particular, the ResNet-101 with INT8 nesting INT6 can achieve 78.1% and 77.9% accuracy for full-bit and part-bit models, respectively, and reduce switching overheads by approximately 78.1% compared with diverse bitwidths PTQ models.
中文: 本文提出NestQuant这一资源友好的训练后量化方法,通过整数权重分解与嵌套机制实现物联网设备上的动态模型切换,在保持高精度的同时显著降低了存储与传输开销。
English: This paper introduces NestQuant, a resource-friendly post-training quantization method that enables dynamic model switching on IoT devices by decomposing weights into integer-nested formats, reducing storage and transmission overhead while maintaining high accuracy.
Authors:Suyash Gaurav, Jukka Heikkonen, Jatin Chaudhary
Abstract:
Continual learning systems face the dual challenge of preventing catastrophic forgetting while maintaining energy efficiency, particularly in resource-constrained environments. This paper introduces Pathway-based Progressive Inference (PaPI), a novel theoretical framework that addresses these challenges through a mathematically rigorous approach to pathway selection and adaptation. We formulate continual learning as an energy-constrained optimization problem and provide formal convergence guarantees for our pathway routing mechanisms. Our theoretical analysis demonstrates that PaPI achieves an $\mathcal{O}(K)$ improvement in the stability-plasticity trade-off compared to monolithic architectures, where $K$ is the number of pathways. We derive tight bounds on forgetting rates using Fisher Information Matrix analysis and prove that PaPI's energy consumption scales with the number of active parameters rather than the total model size. Comparative theoretical analysis shows that PaPI provides stronger guarantees against catastrophic forgetting than Elastic Weight Consolidation (EWC) while maintaining better energy efficiency than both EWC and Gradient Episodic Memory (GEM). Our experimental validation confirms these theoretical advantages across multiple benchmarks, demonstrating PaPI's effectiveness for continual learning in energy-constrained settings. Our codes are available at https://github.com/zser092/PAPI_FILES.
中文: 本文提出基于路径的渐进推理(PaPI)理论框架,通过路径选择和适应的数学方法,在持续学习中优化稳定性与可塑性的平衡,并显著提高能源效率,同时降低遗忘率。
English: This paper introduces Pathway-based Progressive Inference (PaPI), a theoretical framework that enhances continual learning by improving the stability-plasticity trade-off and energy efficiency through pathway selection and adaptation, with proven convergence and reduced forgetting rates.
Authors:Pratik Kunapuli, Jake Welde, Dinesh Jayaraman, Vijay Kumar
Abstract:
Learning-based control approaches like reinforcement learning (RL) have recently produced a slew of impressive results for tasks like quadrotor trajectory tracking and drone racing. Naturally, it is common to demonstrate the advantages of these new controllers against established methods like analytical controllers. We observe, however, that reliably comparing the performance of such very different classes of controllers is more complicated than might appear at first sight. As a case study, we take up the problem of agile tracking of an end-effector for a quadrotor with a fixed arm. We develop a set of best practices for synthesizing the best-in-class RL and geometric controllers (GC) for benchmarking. In the process, we resolve widespread RL-favoring biases in prior studies that provide asymmetric access to: (1) the task definition, in the form of an objective function, (2) representative datasets, for parameter optimization, and (3) feedforward information, describing the desired future trajectory. The resulting findings are the following: our improvements to the experimental protocol for comparing learned and classical controllers are critical, and each of the above asymmetries can yield misleading conclusions. Prior works have claimed that RL outperforms GC, but we find the gaps between the two controller classes are much smaller than previously published when accounting for symmetric comparisons. Geometric control achieves lower steady-state error than RL, while RL has better transient performance, resulting in GC performing better in relatively slow or less agile tasks, but RL performing better when greater agility is required. Finally, we open-source implementations of geometric and RL controllers for these aerial vehicles, implementing best practices for future development. Website and code is available at https://pratikkunapuli.github.io/rl-vs-gc/
Authors:Shahab Rahimirad, Guven Gergerli, Lucia Romero, Angela Qian, Matthew Lyle Olson, Simon Stepputtis, Joseph Campbell
Abstract:
Social reasoning - inferring unobservable beliefs and intentions from partial observations of other agents - remains a challenging task for large language models (LLMs). We evaluate the limits of current reasoning language models in the social deduction game Avalon and find that while the largest models demonstrate strong performance, they require extensive test-time inference and degrade sharply when distilled to smaller, real-time-capable variants. To address this, we introduce a hybrid reasoning framework that externalizes belief inference to a structured probabilistic model, while using an LLM for language understanding and interaction. Our approach achieves competitive performance with much larger models in Agent-Agent play and, notably, is the first language agent to defeat human players in a controlled study - achieving a 67% win rate and receiving higher qualitative ratings than both reasoning baselines and human teammates. We release code, models, and a dataset to support future work on social reasoning in LLM agents, which can be found at https://camp-lab-purdue.github.io/bayesian-social-deduction/
Chinese Summary: 该研究提出了一种混合推理框架,将大型语言模型与结构化概率推理相结合,在社交推理游戏《阿瓦隆》中不仅取得了与更大模型相媲美的表现,还首次在受控研究中击败了人类玩家。
English Summary: The study introduces a hybrid reasoning framework that combines large language models with structured probabilistic inference, achieving competitive performance and even surpassing human players in the social deduction game Avalon.
Authors:Keigo Nishida, Eren Mehmet Kıral, Kenichi Bannai, Mohammad Emtiyaz Khan, Thomas Möllenhoff
Abstract:
Studies in neuroscience have shown that biological synapses follow a log-normal distribution whose transitioning can be explained by noisy multiplicative dynamics. Biological networks can function stably even under dynamically fluctuating conditions arising due to unreliable synaptic transmissions. Here we ask: Is it possible to design similar multiplicative training in artificial neural networks? To answer this question, we derive a Bayesian learning rule that assumes log-normal posterior distributions over weights which gives rise to a new Log-Normal Multiplicative Dynamics (LMD) algorithm. The algorithm uses multiplicative updates with both noise and regularization applied multiplicatively. The method is as easy to implement as Adam and only requires one additional vector to store. Our results show that LMD achieves stable and accurate training-from-scratch under low-precision forward operations for Vision Transformer and GPT-2. These results suggest that multiplicative dynamics, a biological feature, may enable stable low-precision inference and learning on future energy-efficient hardware.
Chinese: 该研究受生物突触分布启发,提出了一种对数正态乘法动力学(LMD)算法,能在低精度条件下稳定准确地训练视觉Transformer和GPT-2等神经网络,显示出其在未来节能硬件中的应用潜力。
English: The study introduces a Log-Normal Multiplicative Dynamics (LMD) algorithm, inspired by biological synaptic distributions, which enables stable and accurate training of neural networks like Vision Transformer and GPT-2 under low-precision conditions, suggesting its potential for energy-efficient hardware.
Authors:Fabien Furfaro
Abstract:
Transformer-based large language models (LLMs) have achieved strong performance across many natural language processing tasks. Nonetheless, their quadratic computational and memory requirements, particularly in self-attention layers, pose challenges for efficient inference on long contexts and for deployment in resource-limited environments. We present TPTT (Transforming Pretrained Transformers into Titans), a framework designed to augment pretrained Transformers with linearized attention (LiZA) and internal memory gating via Memory as Gate (MaG), applied without full retraining. TPTT supports parameter-efficient fine-tuning (LoRA) and integrates with standard toolkits such as Hugging Face Transformers. We evaluated TPTT on several pretrained models, including Llama-1B, OlMoE-1B-7B, Qwen2.5-1.5B, Gemma3-270m, OpenELM-1.3B, and Mistral-7B, in order to assess applicability across architectures of different scales. Experiments on models with approximately 1 billion parameters, evaluated primarily on the MMLU benchmark, suggest potential improvements in both efficiency and accuracy compared to baseline models. For example, Titans-Llama-1B exhibited up to a 20\% relative increase in Exact Match scores in one-shot evaluation. An additional finding is that it is possible to convert a quadratic-attention model into a purely linear-attention model using the DeltaProduct mechanism. All training runs were carried out with modest computational resources. These preliminary findings indicate that TPTT may help adapt pretrained LLMs for long-context tasks with limited overhead. Further studies on larger models and a broader set of benchmarks will be necessary to evaluate the generality and robustness of the framework. Code is available at https://github.com/fabienfrfr/tptt . Python package at https://pypi.org/project/tptt/ .
中文:TPTT框架通过线性化注意力和内存门控增强预训练Transformer,可在有限资源下高效适应长文本任务,同时保持或提升模型性能。
English: The TPTT framework enhances pretrained Transformers with linearized attention and memory gating, enabling efficient adaptation for long-context tasks while maintaining or improving performance with minimal computational resources.
Authors:Yang Wu, Yifan Zhang, Yurong Wu, Yuran Wang, Junkai Zhang, Jian Cheng
Abstract:
Large Language Models (LLMs) have revolutionized various domains but encounter substantial challenges in tackling optimization modeling tasks for Operations Research (OR), particularly when dealing with complex problem. In this work, we propose Step-Opt-Instruct, a framework that augments existing datasets and generates high-quality fine-tuning data tailored to optimization modeling. Step-Opt-Instruct employs iterative problem generation to systematically increase problem complexity and stepwise validation to rigorously verify data, preventing error propagation and ensuring the quality of the generated dataset. Leveraging this framework, we fine-tune open-source LLMs, including LLaMA-3-8B and Mistral-7B, to develop Step-Opt--a model that achieves state-of-the-art performance on benchmarks such as NL4OPT, MAMO, and IndustryOR. Extensive experiments demonstrate the superior performance of Step-Opt, especially in addressing complex OR tasks, with a notable 17.01\% improvement in micro average accuracy on difficult problems. These findings highlight the effectiveness of combining structured validation with gradual problem refinement to advance the automation of decision-making processes using LLMs.The code and dataset are available at https://github.com/samwu-learn/Step.
中文摘要:本文提出Step-Opt-Instruct框架,通过迭代式问题生成和逐步验证机制为运筹学优化建模生成高质量训练数据,基于此开发的Step-Opt模型在多项基准测试中实现最优性能,尤其在复杂问题上取得17.01%的显著准确率提升。
English Summary: This paper introduces Step-Opt-Instruct, a framework that enhances optimization modeling for Operations Research by generating high-quality training data through iterative complexity escalation and stepwise validation, resulting in the Step-Opt model which achieves state-of-the-art performance with significant accuracy improvements on complex tasks.
Authors:Furong Peng, Jinzhen Gao, Xuan Lu, Kang Liu, Yifan Huo, Sheng Wang
Abstract:
Graph Convolutional Networks (GCNs) suffer from severe performance degradation in deep architectures due to over-smoothing. While existing studies primarily attribute the over-smoothing to repeated applications of graph Laplacian operators, our empirical analysis reveals a critical yet overlooked factor: trainable linear transformations in GCNs significantly exacerbate feature collapse, even at moderate depths (e.g., 8 layers). In contrast, Simplified Graph Convolution (SGC), which removes these transformations, maintains stable feature diversity up to 32 layers, highlighting linear transformations' dual role in facilitating expressive power and inducing over-smoothing. However, completely removing linear transformations weakens the model's expressive capacity. To address this trade-off, we propose Layer-wise Gradual Training (LGT), a novel training strategy that progressively builds deep GCNs while preserving their expressiveness. LGT integrates three complementary components: (1) layer-wise training to stabilize optimization from shallow to deep layers, (2) low-rank adaptation to fine-tune shallow layers and accelerate training, and (3) identity initialization to ensure smooth integration of new layers and accelerate convergence. Extensive experiments on benchmark datasets demonstrate that LGT achieves state-of-the-art performance on vanilla GCN, significantly improving accuracy even in 32-layer settings. Moreover, as a training method, LGT can be seamlessly combined with existing methods such as PairNorm and ContraNorm, further enhancing their performance in deeper networks. LGT offers a general, architecture-agnostic training framework for scalable deep GCNs. The code is available at [https://github.com/jfklasdfj/LGT_GCN].
中文: 深度图卷积网络因可训练线性变换加剧过平滑问题,而提出的逐层渐进训练方法通过渐进构建网络、低秩适应和恒等初始化,在保持表达能力的同时显著提升了深层网络的性能。
English: Deep Graph Convolutional Networks (GCNs) face performance degradation from over-smoothing, which is exacerbated by trainable linear transformations, but the proposed Layer-wise Gradual Training (LGT) method overcomes this by progressively building deep networks while maintaining expressiveness and achieving state-of-the-art results.
Authors:Yile Gu, Rohan Kadekodi, Hoang Nguyen, Keisuke Kamahori, Yiyu Liu, Baris Kasikci
Abstract:
The recent shift in Generative AI (GenAI) applications from cloud-only environments to end-user devices introduces new challenges in resource management, system efficiency, and user experience. This paper presents ConsumerBench, a comprehensive benchmarking framework designed to evaluate the system efficiency and response time of GenAI models running on end-user devices. Unlike existing benchmarks that assume exclusive model access on dedicated GPUs, ConsumerBench simulates realistic multi-application scenarios executing concurrently on constrained hardware. Furthermore, ConsumerBench supports customizable workflows that simulate complex tasks requiring coordination among multiple applications. ConsumerBench captures both application-level metrics, including latency and Service Level Objective (SLO) attainment, and system-level metrics like CPU/GPU utilization and memory bandwidth. Through extensive experiments, ConsumerBench reveals inefficiencies in resource sharing, unfair scheduling under greedy allocation, and performance pitfalls of static model server configurations. The paper also provides practical insights for model developers and system designers, highlighting the benefits of custom kernels tailored to consumer-grade GPU architectures and the value of implementing SLO-aware scheduling strategies.
中文: 本文提出ConsumerBench基准测试框架,通过在终端设备上模拟多应用并发场景来评估生成式AI性能,揭示了资源分配低效问题,并为开发人员提供了优化建议。
English: This paper introduces ConsumerBench, a benchmarking framework that evaluates GenAI performance on end-user devices under realistic multi-application scenarios, revealing resource inefficiencies and providing optimization insights for developers.
Authors:Zachary Ravichandran, Ignacio Hounie, Fernando Cladera, Alejandro Ribeiro, George J. Pappas, Vijay Kumar
Abstract:
Large language models (LLMs) provide robots with powerful contextual reasoning abilities and a natural human interface. Yet, current LLM-enabled robots typically depend on cloud-hosted models, limiting their usability in environments with unreliable communication infrastructure, such as outdoor or industrial settings. We present PRISM, a framework for distilling small language model (SLM)-enabled robot planners that run on-device with minimal human supervision. Starting from an existing LLM-enabled planner, PRISM automatically synthesizes diverse tasks and environments, elicits plans from the LLM, and uses this synthetic dataset to distill a compact SLM as a drop-in replacement of the source model. We apply PRISM to three LLM-enabled planners for mapping and exploration, manipulation, and household assistance, and we demonstrate that PRISM improves the performance of Llama-3.2-3B from 10-20% of GPT-4o's performance to over 93% - using only synthetic data. We further demonstrate that the distilled planners generalize across heterogeneous robotic platforms (ground and aerial) and diverse environments (indoor and outdoor). We release all software, trained models, and datasets at https://zacravichandran.github.io/PRISM.
Authors:Jinhao Duan, James Diffenderfer, Sandeep Madireddy, Tianlong Chen, Bhavya Kailkhura, Kaidi Xu
Abstract:
As Large Language Models (LLMs) are integrated into safety-critical applications involving sequential decision-making in the real world, it is essential to know when to trust LLM decisions. Existing LLM Uncertainty Quantification (UQ) methods are primarily designed for single-turn question-answering formats, resulting in multi-step decision-making scenarios, e.g., LLM agentic system, being underexplored. In this paper, we introduce a principled, information-theoretic framework that decomposes LLM sequential decision uncertainty into two parts: (i) internal uncertainty intrinsic to the current decision, which is focused on existing UQ methods, and (ii) extrinsic uncertainty, a Mutual-Information (MI) quantity describing how much uncertainty should be inherited from preceding decisions. We then propose UProp, an efficient and effective extrinsic uncertainty estimator that converts the direct estimation of MI to the estimation of Pointwise Mutual Information (PMI) over multiple Trajectory-Dependent Decision Processes (TDPs). UProp is evaluated over extensive multi-step decision-making benchmarks, e.g., AgentBench and HotpotQA, with state-of-the-art LLMs, e.g., GPT-4.1 and DeepSeek-V3. Experimental results demonstrate that UProp significantly outperforms existing single-turn UQ baselines equipped with thoughtful aggregation strategies. Moreover, we provide a comprehensive analysis of UProp, including sampling efficiency, potential applications, and intermediate uncertainty propagation, to demonstrate its effectiveness. Codes will be available at https://github.com/jinhaoduan/UProp.
中文: 本文提出UProp框架,将大语言模型序列决策的不确定性分解为内在和外在两部分,在多步决策基准测试中显著优于现有方法。
English: This paper introduces UProp, a novel framework that decomposes LLM sequential decision uncertainty into intrinsic and extrinsic components, significantly outperforming existing methods in multi-step decision-making benchmarks.
Authors:Tamas Bisztray, Bilel Cherif, Richard A. Dubniczky, Nils Gruschka, Bertalan Borsos, Mohamed Amine Ferrag, Attila Kovacs, Vasileios Mavroeidis, Norbert Tihanyi
Abstract:
Detecting AI-generated code, deepfakes, and other synthetic content is an emerging research challenge. As code generated by Large Language Models (LLMs) becomes more common, identifying the specific model behind each sample is increasingly important. This paper presents the first systematic study of LLM authorship attribution for C programs. We released CodeT5-Authorship, a novel model that uses only the encoder layers from the original CodeT5 encoder-decoder architecture, discarding the decoder to focus on classification. Our model's encoder output (first token) is passed through a two-layer classification head with GELU activation and dropout, producing a probability distribution over possible authors. To evaluate our approach, we introduce LLM-AuthorBench, a benchmark of 32,000 compilable C programs generated by eight state-of-the-art LLMs across diverse tasks. We compare our model to seven traditional ML classifiers and eight fine-tuned transformer models, including BERT, RoBERTa, CodeBERT, ModernBERT, DistilBERT, DeBERTa-V3, Longformer, and LoRA-fine-tuned Qwen2-1.5B. In binary classification, our model achieves 97.56% accuracy in distinguishing C programs generated by closely related models such as GPT-4.1 and GPT-4o, and 95.40% accuracy for multi-class attribution among five leading LLMs (Gemini 2.5 Flash, Claude 3.5 Haiku, GPT-4.1, Llama 3.3, and DeepSeek-V3). To support open science, we release the CodeT5-Authorship architecture, the LLM-AuthorBench benchmark, and all relevant Google Colab scripts on GitHub: https://github.com/LLMauthorbench/.
中文: 本文提出了CodeT5-Authorship模型,用于识别C程序的具体大语言模型作者,并发布了包含32,000个可编译C程序的LLM-AuthorBench基准,在二元和多元分类任务中均实现了高准确率。
English: This paper introduces CodeT5-Authorship, a novel model for identifying the specific LLM authors of C programs, and presents LLM-AuthorBench, a benchmark of 32,000 compilable C programs, achieving high accuracy in both binary and multi-class attribution tasks.
Authors:Zhixiang Chi, Li Gu, Huan Liu, Ziqiang Wang, Yanan Wu, Yang Wang, Konstantinos N Plataniotis
Abstract:
Few-shot Test-Time Domain Adaptation focuses on adapting a model at test time to a specific domain using only a few unlabeled examples, addressing domain shift. Prior methods leverage CLIP's strong out-of-distribution (OOD) abilities by generating domain-specific prompts to guide its generalized, frozen features. However, since downstream datasets are not explicitly seen by CLIP, solely depending on the feature space knowledge is constrained by CLIP's prior knowledge. Notably, when using a less robust backbone like ViT-B/16, performance significantly drops on challenging real-world benchmarks. Departing from the state-of-the-art of inheriting the intrinsic OOD capability of CLIP, this work introduces learning directly on the input space to complement the dataset-specific knowledge for frozen CLIP. Specifically, an independent side branch is attached in parallel with CLIP and enforced to learn exclusive knowledge via revert attention. To better capture the dataset-specific label semantics for downstream adaptation, we propose to enhance the inter-dispersion among text features via greedy text ensemble and refinement. The text and visual features are then progressively fused in a domain-aware manner by a generated domain prompt to adapt toward a specific domain. Extensive experiments show our method's superiority on 5 large-scale benchmarks (WILDS and DomainNet), notably improving over smaller networks like ViT-B/16 with gains of \textbf{+5.1} in F1 for iWildCam and \textbf{+3.1\%} in WC Acc for FMoW.
中文摘要:本文提出了一种新方法,通过在CLIP冻结特征基础上增加输入空间学习分支,结合逆向注意力和改进的文本特征分散技术,显著提升了小样本测试时域自适应性能,在多个基准测试中取得突破性进展。
English Summary: This paper introduces a novel method that enhances few-shot test-time domain adaptation by learning directly in the input space alongside CLIP's frozen features, using a side branch with revert attention and improved text feature dispersion, achieving significant performance gains on challenging benchmarks.
Authors:Yijun Lin, Theresa Chen, Colby Brungard, Grunwald Sabine, Sue Ives, Matt Macander, Timm Nawrocki, Yao-Yi Chiang, Nic Jelinski
Abstract:
Fine-scale soil mapping in Alaska, traditionally relying on fieldwork and localized simulations, remains a critical yet underdeveloped task, despite the region's ecological importance and extensive permafrost coverage. As permafrost thaw accelerates due to climate change, it threatens infrastructure stability and key ecosystem services, such as soil carbon storage. High-resolution soil maps are essential for characterizing permafrost distribution, identifying vulnerable areas, and informing adaptation strategies. We present MISO, a vision-based machine learning (ML) model to produce statewide fine-scale soil maps for near-surface permafrost and soil taxonomy. The model integrates a geospatial foundation model for visual feature extraction, implicit neural representations for continuous spatial prediction, and contrastive learning for multimodal alignment and geo-location awareness. We compare MISO with Random Forest (RF), a traditional ML model that has been widely used in soil mapping applications. Spatial cross-validation and regional analysis across Permafrost Zones and Major Land Resource Areas (MLRAs) show that MISO generalizes better to remote, unseen locations and achieves higher recall than RF, which is critical for monitoring permafrost thaw and related environmental processes. These findings demonstrate the potential of advanced ML approaches for fine-scale soil mapping and provide practical guidance for future soil sampling and infrastructure planning in permafrost-affected landscapes. The project will be released at https://github.com/knowledge-computing/Peatland-permafrost.
中文: 本研究提出的MISO视觉机器学习模型,在阿拉斯加高分辨率土壤制图中优于传统方法,为气候变化下的冻土监测和基础设施规划提供了更有效的解决方案。
English: This study introduces MISO, a vision-based machine learning model that outperforms traditional methods in generating high-resolution soil maps for Alaska, enabling better permafrost monitoring and infrastructure planning amid climate change.
Authors:Satyam Mishra, Phung Thao Vi, Shivam Mishra, Vishwanath Bijalwan, Vijay Bhaskar Semwal, Abdul Manan Khan
Abstract:
We introduce SafeRL-Lite, an open-source Python library for building reinforcement learning (RL) agents that are both constrained and explainable. Existing RL toolkits often lack native mechanisms for enforcing hard safety constraints or producing human-interpretable rationales for decisions. SafeRL-Lite provides modular wrappers around standard Gym environments and deep Q-learning agents to enable: (i) safety-aware training via constraint enforcement, and (ii) real-time post-hoc explanation via SHAP values and saliency maps. The library is lightweight, extensible, and installable via pip, and includes built-in metrics for constraint violations. We demonstrate its effectiveness on constrained variants of CartPole and provide visualizations that reveal both policy logic and safety adherence. The full codebase is available at: https://github.com/satyamcser/saferl-lite.
中文: SafeRL-Lite 是一个开源 Python 库,通过模块化封装和实时解释工具,能够开发具有内置安全约束和可解释性功能的强化学习智能体。
English: SafeRL-Lite is an open-source Python library that enables the development of reinforcement learning agents with built-in safety constraints and explainability features through modular wrappers and real-time explanation tools.
Authors:Chenghan Li, Mingchen Li, Yipu Liao, Ruisheng Diao
Abstract:
Long-term time series prediction has predominantly relied on Transformer and MLP models, while the potential of convolutional networks in this domain remains underexplored. To address this gap, we introduce a novel multi-scale time series reshape module, which effectively captures the relationships among multi-period patches and variable dependencies. Building upon this module, we propose MS-TVNet, a multi-scale 3D dynamic convolutional neural network. Through comprehensive evaluations on diverse datasets, MS-TVNet demonstrates superior performance compared to baseline models, achieving state-of-the-art (SOTA) results in long-term time series prediction. Our findings highlight the effectiveness of leveraging convolutional networks for capturing complex temporal patterns, suggesting a promising direction for future research in this field.The code is realsed on https://github.com/Curyyfaust/TVNet.
Chinese: 本研究提出了MS-TVNet,一种多尺度三维动态卷积神经网络,通过有效捕捉多周期模式和变量依赖关系,在长期时间序列预测中实现了最先进的性能。
English: This study introduces MS-TVNet, a multi-scale 3D dynamic convolutional neural network that achieves state-of-the-art performance in long-term time series prediction by effectively capturing multi-period patterns and variable dependencies.
Authors:Fudong Lin, Jiadong Lou, Hao Wang, Brian Jalaian, Xu Yuan
Abstract:
Sparse attacks are to optimize the magnitude of adversarial perturbations for fooling deep neural networks (DNNs) involving only a few perturbed pixels (i.e., under the l0 constraint), suitable for interpreting the vulnerability of DNNs. However, existing solutions fail to yield interpretable adversarial examples due to their poor sparsity. Worse still, they often struggle with heavy computational overhead, poor transferability, and weak attack strength. In this paper, we aim to develop a sparse attack for understanding the vulnerability of CNNs by minimizing the magnitude of initial perturbations under the l0 constraint, to overcome the existing drawbacks while achieving a fast, transferable, and strong attack to DNNs. In particular, a novel and theoretical sound parameterization technique is introduced to approximate the NP-hard l0 optimization problem, making directly optimizing sparse perturbations computationally feasible. Besides, a novel loss function is designed to augment initial perturbations by maximizing the adversary property and minimizing the number of perturbed pixels simultaneously. Extensive experiments are conducted to demonstrate that our approach, with theoretical performance guarantees, outperforms state-of-the-art sparse attacks in terms of computational overhead, transferability, and attack strength, expecting to serve as a benchmark for evaluating the robustness of DNNs. In addition, theoretical and empirical results validate that our approach yields sparser adversarial examples, empowering us to discover two categories of noises, i.e., "obscuring noise" and "leading noise", which will help interpret how adversarial perturbation misleads the classifiers into incorrect predictions. Our code is available at https://github.com/fudong03/SparseAttack.
中文: 本文提出一种新型稀疏攻击方法,通过在l0约束下最小化初始扰动来高效生成高可解释性的对抗样本,在计算效率、可迁移性和攻击强度方面表现优异,同时揭示了两类噪声以解释深度神经网络的脆弱性。
English: This paper introduces a novel sparse attack method that efficiently generates highly interpretable adversarial examples by minimizing initial perturbations under l0 constraints, achieving superior computational efficiency, transferability, and attack strength while revealing two noise categories to explain DNN vulnerabilities.
Authors:Jianing He, Qi Zhang, Duoqian Miao, Yi Kun, Shufeng Hao, Hongyun Zhang, Zhihua Wei
Abstract:
Early exiting has demonstrated great potential in accelerating the inference of pre-trained language models (PLMs) by enabling easy samples to exit at shallow layers, eliminating the need for executing deeper layers. However, existing early exiting methods primarily rely on class-relevant logits to formulate their exiting signals for estimating prediction certainty, neglecting the detrimental influence of class-irrelevant information in the features on prediction certainty. This leads to an overestimation of prediction certainty, causing premature exiting of samples with incorrect early predictions. To remedy this, we define an NSP score to estimate prediction certainty by considering the proportion of class-irrelevant information in the features. On this basis, we propose a novel early exiting method based on the Certainty-Aware Probability (CAP) score, which integrates insights from both logits and the NSP score to enhance prediction certainty estimation, thus enabling more reliable exiting decisions. The experimental results on the GLUE benchmark show that our method can achieve an average speed-up ratio of 2.19x across all tasks with negligible performance degradation, surpassing the state-of-the-art (SOTA) ConsistentEE by 28%, yielding a better trade-off between task performance and inference efficiency. The code is available at https://github.com/He-Jianing/NSP.git.
中文摘要:本研究提出的基于确定性感知概率(CAP)的早期退出方法,通过结合类别相关logits和衡量类别无关信息的NSP评分来改进预训练语言模型的早期退出机制,在GLUE基准测试中实现了2.19倍平均加速且性能损失可忽略,以28%优势超越现有最优方法。
English Summary: The proposed Certainty-Aware Probability (CAP) method enhances early exiting in pre-trained language models by integrating class-relevant logits with a novel NSP score that measures class-irrelevant information, achieving a 2.19x average speed-up on GLUE tasks with minimal performance loss and outperforming prior methods by 28%.
Authors:Zequn Yang, Hongfa Wang, Di Hu
Abstract:
Interactions between modalities -- redundancy, uniqueness, and synergy -- collectively determine the composition of multimodal information. Understanding these interactions is crucial for analyzing information dynamics in multimodal systems, yet their accurate sample-level quantification presents significant theoretical and computational challenges. To address this, we introduce the Lightweight Sample-wise Multimodal Interaction (LSMI) estimator, rigorously grounded in pointwise information theory. We first develop a redundancy estimation framework, employing an appropriate pointwise information measure to quantify this most decomposable and measurable interaction. Building upon this, we propose a general interaction estimation method that employs efficient entropy estimation, specifically tailored for sample-wise estimation in continuous distributions. Extensive experiments on synthetic and real-world datasets validate LSMI's precision and efficiency. Crucially, our sample-wise approach reveals fine-grained sample- and category-level dynamics within multimodal data, enabling practical applications such as redundancy-informed sample partitioning, targeted knowledge distillation, and interaction-aware model ensembling. The code is available at https://github.com/GeWu-Lab/LSMI_Estimator.
中文: LSMI估计器基于逐点信息理论,可精确量化样本级别的多模态交互作用(冗余性、独特性和协同性),实现了细粒度数据分析,并支持知识蒸馏和模型集成等实际应用。
English: The LSMI estimator is introduced to accurately quantify sample-level multimodal interactions—redundancy, uniqueness, and synergy—using pointwise information theory, enabling fine-grained analysis and practical applications like knowledge distillation and model ensembling.
Authors:Yichen Luo, Jia Wang, Dapeng Lan, Yu Liu, Zhibo Pang
Abstract:
Partial Differential Equations (PDEs) are fundamental for modeling physical systems, yet solving them in a generic and efficient manner using machine learning-based approaches remains challenging due to limited multi-input and multi-scale generalization capabilities, as well as high computational costs. This paper proposes the Multi-input and Multi-scale Efficient Transformer (MMET), a novel framework designed to address the above challenges. MMET decouples mesh and query points as two sequences and feeds them into the encoder and decoder, respectively, and uses a Gated Condition Embedding (GCE) layer to embed input variables or functions with varying dimensions, enabling effective solutions for multi-scale and multi-input problems. Additionally, a Hilbert curve-based reserialization and patch embedding mechanism decrease the input length. This significantly reduces the computational cost when dealing with large-scale geometric models. These innovations enable efficient representations and support multi-scale resolution queries for large-scale and multi-input PDE problems. Experimental evaluations on diverse benchmarks spanning different physical fields demonstrate that MMET outperforms SOTA methods in both accuracy and computational efficiency. This work highlights the potential of MMET as a robust and scalable solution for real-time PDE solving in engineering and physics-based applications, paving the way for future explorations into pre-trained large-scale models in specific domains. This work is open-sourced at https://github.com/YichenLuo-0/MMET.
中文: 本文提出多输入多尺度高效变换器(MMET),通过创新的序列解耦和嵌入技术克服了偏微分方程求解中多尺度泛化和计算效率的局限,在多个基准测试中展现出卓越的精度与效率优势。
English: This paper introduces the Multi-input and Multi-scale Efficient Transformer (MMET), a novel framework that overcomes limitations in multi-scale generalization and computational efficiency for solving Partial Differential Equations (PDEs) through innovative sequence decoupling and embedding techniques, demonstrating superior accuracy and efficiency across diverse benchmarks.
Authors:Yukun Huang, Yanning Zhou, Jianan Wang, Kaiyi Huang, Xihui Liu
Abstract:
3D panorama synthesis is a promising yet challenging task that demands high-quality and diverse visual appearance and geometry of the generated omnidirectional content. Existing methods leverage rich image priors from pre-trained 2D foundation models to circumvent the scarcity of 3D panoramic data, but the incompatibility between 3D panoramas and 2D single views limits their effectiveness. In this work, we demonstrate that by applying multi-plane synchronization to the operators from 2D foundation models, their capabilities can be seamlessly extended to the omnidirectional domain. Based on this design, we further introduce DreamCube, a multi-plane RGB-D diffusion model for 3D panorama generation, which maximizes the reuse of 2D foundation model priors to achieve diverse appearances and accurate geometry while maintaining multi-view consistency. Extensive experiments demonstrate the effectiveness of our approach in panoramic image generation, panoramic depth estimation, and 3D scene generation.
Authors:Marco Jiralerspong, Esther Derman, Danilo Vucetic, Nikolay Malkin, Bilun Sun, Tianyu Zhang, Pierre-Luc Bacon, Gauthier Gidel
Abstract:
A major bottleneck in scientific discovery involves narrowing a large combinatorial set of objects, such as proteins or molecules, to a small set of promising candidates. While this process largely relies on expert knowledge, recent methods leverage reinforcement learning (RL) to enhance this filtering. They achieve this by estimating proxy reward functions from available datasets and using regularization to generate more diverse candidates. These reward functions are inherently uncertain, raising a particularly salient challenge for scientific discovery. In this work, we show that existing methods, often framed as sampling proportional to a reward function, are inadequate and yield suboptimal candidates, especially in large search spaces. To remedy this issue, we take a robust RL approach and introduce a unified operator that seeks robustness to the uncertainty of the proxy reward function. This general operator targets peakier sampling distributions while encompassing known soft RL operators. It also leads us to a novel algorithm that identifies higher-quality, diverse candidates in both synthetic and real-world tasks. Ultimately, our work offers a new, flexible perspective on discrete compositional generation tasks. Code: https://github.com/marcojira/tgm.
Chinese: 本文针对科学发现中现有强化学习方法的不足,引入了一种处理代理奖励不确定性的稳健算子,提出了一种新算法,能在组合搜索空间中生成更高质量和更多样化的候选对象。
English: This paper addresses the limitations of existing reinforcement learning methods in scientific discovery by introducing a robust operator that handles uncertainty in proxy rewards, leading to a novel algorithm that produces higher-quality and more diverse candidates in combinatorial search spaces.
Authors:Semin Kim, Yeonwoo Cha, Jaehoon Yoo, Seunghoon Hong
Abstract:
We investigate a general approach for improving user prompts in text-to-image (T2I) diffusion models by finding prompts that maximize a reward function specified at test-time. Although diverse reward models are used for evaluating image generation, existing automated prompt engineering methods typically target specific reward configurations. Consequently, these specialized designs exhibit suboptimal performance when applied to new prompt engineering scenarios involving different reward models. To address this limitation, we introduce RATTPO (Reward-Agnostic Test-Time Prompt Optimization), a flexible test-time optimization method applicable across various reward scenarios without modification. RATTPO iteratively searches for optimized prompts by querying large language models (LLMs) \textit{without} requiring reward-specific task descriptions. Instead, it uses the optimization trajectory and a novel reward-aware feedback signal (termed a "hint") as context. Empirical results demonstrate the versatility of RATTPO, effectively enhancing user prompts across diverse reward setups that assess various generation aspects, such as aesthetics, general human preference, or spatial relationships between objects. RATTPO surpasses other test-time search baselines in search efficiency, using up to 3.5 times less inference budget, and, given sufficient inference budget, achieves performance comparable to learning-based baselines that require reward-specific fine-tuning. The code is available at https://github.com/seminkim/RATTPO.
中文摘要:本文提出RATTPO方法,通过大语言模型迭代优化提示并结合奖励感知反馈,无需特定任务描述即可适应不同奖励场景,在文本到图像生成中实现了高效通用的提示优化,显著超越现有测试时搜索基线。
English Summary: This paper introduces RATTPO, a reward-agnostic test-time prompt optimization method that enhances text-to-image generation by iteratively refining prompts using LLMs and reward-aware feedback, achieving superior efficiency and adaptability across diverse reward scenarios without requiring task-specific modifications.
Authors:Zeyneddin Oz, Shreyas Korde, Marius Bock, Kristof Van Laerhoven
Abstract:
The rapid evolution of sensors and resource-efficient machine learning models has spurred the widespread adoption of wearable fitness tracking devices. Equipped with inertial sensors, such devices can continuously capture physical movements for fitness technology (FitTech), enabling applications from sports optimization to preventive healthcare. Traditional Centralized Learning approaches to detect fitness activities struggle with data privacy concerns, regulatory restrictions, and communication inefficiencies. In contrast, Federated Learning (FL) enables a decentralized model training by communicating model updates rather than potentially private wearable sensor data. Applying FL to FitTech presents unique challenges, such as data imbalance, lack of labeled data, heterogeneous user activities, and trade-offs between personalization and generalization. To simplify research on FitTech in FL, we present the FedFitTech baseline, under the Flower framework, which is publicly available and widely used by both industry and academic researchers. Additionally, to illustrate its usage, this paper presents a case study that implements a system based on the FedFitTech baseline, incorporating a client-side early stopping strategy and comparing the results. For instance, this system allows wearable devices to optimize the trade-off between capturing common fitness activities and preserving individuals' nuances, thereby enhancing both the scalability and efficiency of privacy-aware fitness tracking applications. The results show that this reduces the overall redundant communications by 13%, while maintaining the overall recognition performance at a negligible recognition cost by 1%. Thus, the FedFitTech baseline creates a foundation for a wide range of new research and development opportunities in FitTech, and it is available as open source at: https://github.com/shreyaskorde16/FedFitTech
中文摘要:联邦学习通过去中心化模型训练解决可穿戴健身设备的数据隐私和通信效率问题,FedFitTech基准在保持识别性能基本不变的同时,将通信开销降低了13%。
English Summary: Federated Learning addresses privacy and efficiency issues in wearable fitness tracking by enabling decentralized model training, with the FedFitTech baseline reducing communication overhead by 13% while maintaining near-identical recognition performance.
Authors:Kosuke Nakanishi, Akihiro Kubo, Yuji Yasui, Shin Ishii
Abstract:
Recently, robust reinforcement learning (RL) methods designed to handle adversarial input observations have received significant attention, motivated by RL's inherent vulnerabilities. While existing approaches have demonstrated reasonable success, addressing worst-case scenarios over long time horizons requires both minimizing the agent's cumulative rewards for adversaries and training agents to counteract them through alternating learning. However, this process introduces mutual dependencies between the agent and the adversary, making interactions with the environment inefficient and hindering the development of off-policy methods. In this work, we propose a novel off-policy method that eliminates the need for additional environmental interactions by reformulating adversarial learning as a soft-constrained optimization problem. Our approach is theoretically supported by the symmetric property of policy evaluation between the agent and the adversary. The implementation is available at https://github.com/nakanakakosuke/VALT_SAC.
Chinese: 本文提出了一种新颖的离线策略强化学习方法,通过将对抗学习重构为软约束优化问题,利用策略评估的对称性消除了额外环境交互的需求。
English: This paper introduces a novel off-policy reinforcement learning method that reformulates adversarial learning as a soft-constrained optimization problem, eliminating the need for extra environmental interactions by leveraging symmetric policy evaluation properties.
Authors:Guang Yin, Yitong Li, Yixuan Wang, Dale McConachie, Paarth Shah, Kunimatsu Hashimoto, Huan Zhang, Katherine Liu, Yunzhu Li
Abstract:
Natural language instructions for robotic manipulation tasks often exhibit ambiguity and vagueness. For instance, the instruction "Hang a mug on the mug tree" may involve multiple valid actions if there are several mugs and branches to choose from. Existing language-conditioned policies typically rely on end-to-end models that jointly handle high-level semantic understanding and low-level action generation, which can result in suboptimal performance due to their lack of modularity and interpretability. To address these challenges, we introduce a novel robotic manipulation framework that can accomplish tasks specified by potentially ambiguous natural language. This framework employs a Vision-Language Model (VLM) to interpret abstract concepts in natural language instructions and generates task-specific code - an interpretable and executable intermediate representation. The generated code interfaces with the perception module to produce 3D attention maps that highlight task-relevant regions by integrating spatial and semantic information, effectively resolving ambiguities in instructions. Through extensive experiments, we identify key limitations of current imitation learning methods, such as poor adaptation to language and environmental variations. We show that our approach excels across challenging manipulation tasks involving language ambiguity, contact-rich manipulation, and multi-object interactions.
Authors:Haotian Yin, Aleksander Plocharski, Michal Jan Wlodarczyk, Mikolaj Kida, Przemyslaw Musialski
Abstract:
Neural signed-distance fields (SDFs) are a versatile backbone for neural geometry representation, but enforcing CAD-style developability usually requires Gaussian-curvature penalties with full Hessian evaluation and second-order differentiation, which are costly in memory and time. We introduce an off-diagonal Weingarten loss that regularizes only the mixed shape operator term that represents the gap between principal curvatures and flattens the surface. We present two variants: a finite-difference version using six SDF evaluations plus one gradient, and an auto-diff version using a single Hessian-vector product. Both converge to the exact mixed term and preserve the intended geometric properties without assembling the full Hessian. On the ABC benchmarks the losses match or exceed Hessian-based baselines while cutting GPU memory and training time by roughly a factor of two. The method is drop-in and framework-agnostic, enabling scalable curvature-aware SDF learning for engineering-grade shape reconstruction. Our code is available at https://flatcad.github.io/.
Authors:Tara Akhound-Sadegh, Jungyoon Lee, Avishek Joey Bose, Valentin De Bortoli, Arnaud Doucet, Michael M. Bronstein, Dominique Beaini, Siamak Ravanbakhsh, Kirill Neklyudov, Alexander Tong
Abstract:
Sampling efficiently from a target unnormalized probability density remains a core challenge, with relevance across countless high-impact scientific applications. A promising approach towards this challenge is the design of amortized samplers that borrow key ideas, such as probability path design, from state-of-the-art generative diffusion models. However, all existing diffusion-based samplers remain unable to draw samples from distributions at the scale of even simple molecular systems. In this paper, we propose Progressive Inference-Time Annealing (PITA), a novel framework to learn diffusion-based samplers that combines two complementary interpolation techniques: I.) Annealing of the Boltzmann distribution and II.) Diffusion smoothing. PITA trains a sequence of diffusion models from high to low temperatures by sequentially training each model at progressively higher temperatures, leveraging engineered easy access to samples of the temperature-annealed target density. In the subsequent step, PITA enables simulating the trained diffusion model to procure training samples at a lower temperature for the next diffusion model through inference-time annealing using a novel Feynman-Kac PDE combined with Sequential Monte Carlo. Empirically, PITA enables, for the first time, equilibrium sampling of N-body particle systems, Alanine Dipeptide, and tripeptides in Cartesian coordinates with dramatically lower energy function evaluations. Code available at: https://github.com/taraak/pita
中文: 本文提出渐进推理时间退火(PITA)新框架,通过结合温度退火与扩散平滑技术,首次实现了对复杂分子系统的高效平衡采样,并大幅降低了计算成本。
English: The paper introduces Progressive Inference-Time Annealing (PITA), a novel framework that combines temperature annealing and diffusion smoothing to enable efficient equilibrium sampling of complex molecular systems with significantly reduced computational cost.
Authors:Zhiyuan Liang, Dongwen Tang, Yuhao Zhou, Xuanlei Zhao, Mingjia Shi, Wangbo Zhao, Zekai Li, Peihao Wang, Konstantin Schürholt, Damian Borth, Michael M. Bronstein, Yang You, Zhangyang Wang, Kai Wang
Abstract:
Modern Parameter-Efficient Fine-Tuning (PEFT) methods such as low-rank adaptation (LoRA) reduce the cost of customizing large language models (LLMs), yet still require a separate optimization run for every downstream dataset. We introduce \textbf{Drag-and-Drop LLMs (\textit{DnD})}, a prompt-conditioned parameter generator that eliminates per-task training by mapping a handful of unlabeled task prompts directly to LoRA weight updates. A lightweight text encoder distills each prompt batch into condition embeddings, which are then transformed by a cascaded hyper-convolutional decoder into the full set of LoRA matrices. Once trained in a diverse collection of prompt-checkpoint pairs, DnD produces task-specific parameters in seconds, yielding i) up to \textbf{12,000$\times$} lower overhead than full fine-tuning, ii) average gains up to \textbf{30\%} in performance over the strongest training LoRAs on unseen common-sense reasoning, math, coding, and multimodal benchmarks, and iii) robust cross-domain generalization despite never seeing the target data or labels. Our results demonstrate that prompt-conditioned parameter generation is a viable alternative to gradient-based adaptation for rapidly specializing LLMs. Our project is available at \href{https://jerryliang24.github.io/DnD}{https://jerryliang24.github.io/DnD}.
中文: 提出的拖拽式大语言模型(DnD)通过从任务提示直接生成LoRA参数,无需逐任务训练,比全参数微调快12,000倍,并在推理、数学和编程任务中实现性能提升。
English: The proposed Drag-and-Drop LLMs (DnD) method eliminates per-task training by generating LoRA parameters directly from task prompts, achieving up to 12,000× faster adaptation than fine-tuning while improving performance across reasoning, math, and coding tasks.
Authors:Xiaoya Lu, Zeren Chen, Xuhao Hu, Yijin Zhou, Weichen Zhang, Dongrui Liu, Lu Sheng, Jing Shao
Abstract:
Flawed planning from VLM-driven embodied agents poses significant safety hazards, hindering their deployment in real-world household tasks. However, existing static, non-interactive evaluation paradigms fail to adequately assess risks within these interactive environments, since they cannot simulate dynamic risks that emerge from an agent's actions and rely on unreliable post-hoc evaluations that ignore unsafe intermediate steps. To bridge this critical gap, we propose evaluating an agent's interactive safety: its ability to perceive emergent risks and execute mitigation steps in the correct procedural order. We thus present IS-Bench, the first multi-modal benchmark designed for interactive safety, featuring 161 challenging scenarios with 388 unique safety risks instantiated in a high-fidelity simulator. Crucially, it facilitates a novel process-oriented evaluation that verifies whether risk mitigation actions are performed before/after specific risk-prone steps. Extensive experiments on leading VLMs, including the GPT-4o and Gemini-2.5 series, reveal that current agents lack interactive safety awareness, and that while safety-aware Chain-of-Thought can improve performance, it often compromises task completion. By highlighting these critical limitations, IS-Bench provides a foundation for developing safer and more reliable embodied AI systems. Code and data are released under [this https URL](https://github.com/AI45Lab/IS-Bench).
中文:IS-Bench作为首个评估具身智能体交互安全性的多模态基准,通过对主流视觉语言模型的广泛测试,揭示了当前模型缺乏风险意识以及安全导向推理存在的任务完成度妥协问题。
English: IS-Bench is introduced as the first multimodal benchmark for evaluating interactive safety in embodied agents, revealing current models' lack of risk awareness and the trade-offs of safety-focused reasoning through extensive testing on leading VLMs.
Authors:Nikola JovanoviÄ, Ismail Labiad, Tomáš SouÄek, Martin Vechev, Pierre Fernandez
Abstract:
Watermarking the outputs of generative models has emerged as a promising approach for tracking their provenance. Despite significant interest in autoregressive image generation models and their potential for misuse, no prior work has attempted to watermark their outputs at the token level. In this work, we present the first such approach by adapting language model watermarking techniques to this setting. We identify a key challenge: the lack of reverse cycle-consistency (RCC), wherein re-tokenizing generated image tokens significantly alters the token sequence, effectively erasing the watermark. To address this and to make our method robust to common image transformations, neural compression, and removal attacks, we introduce (i) a custom tokenizer-detokenizer finetuning procedure that improves RCC, and (ii) a complementary watermark synchronization layer. As our experiments demonstrate, our approach enables reliable and robust watermark detection with theoretically grounded p-values.
中文: 本研究首次提出了针对自回归图像生成模型的令牌级水印方法,通过定制化分词器微调和水印同步层解决了逆向循环一致性问题,有效提升了水印对图像变换与攻击的鲁棒性。
English: This study introduces the first token-level watermarking method for autoregressive image generation models, addressing the challenge of reverse cycle-consistency and enhancing robustness against image transformations and attacks through custom tokenizer finetuning and a watermark synchronization layer.
Authors:Abdulvahap Mutlu, Åengül DoÄan, Türker Tuncer
Abstract:
Amyotrophic Lateral Sclerosis (ALS) is a rare neurodegenerative disease, and high-quality EEG data from ALS patients are scarce. This data scarcity, coupled with severe class imbalance between ALS and healthy control recordings, poses a challenge for training reliable machine learning classifiers. In this work, we address these issues by generating synthetic EEG signals for ALS patients using a Conditional Wasserstein Generative Adversarial Network (CWGAN). We train CWGAN on a private EEG dataset (ALS vs. non-ALS) to learn the distribution of ALS EEG signals and produce realistic synthetic samples. We preprocess and normalize EEG recordings, and train a CWGAN model to generate synthetic ALS signals. The CWGAN architecture and training routine are detailed, with key hyperparameters chosen for stable training. Qualitative evaluation of generated signals shows that they closely mimic real ALS EEG patterns. The CWGAN training converged with generator and discriminator loss curves stabilizing, indicating successful learning. The synthetic EEG signals appear realistic and have potential use as augmented data for training classifiers, helping to mitigate class imbalance and improve ALS detection accuracy. We discuss how this approach can facilitate data sharing and enhance diagnostic models.
中文: 本研究采用条件Wasserstein生成对抗网络为ALS患者生成逼真的合成脑电信号,通过缓解数据稀缺和类别不平衡问题来提升ALS检测分类器的训练效果。
English: This study uses a Conditional Wasserstein Generative Adversarial Network to generate realistic synthetic EEG signals for ALS patients, addressing data scarcity and class imbalance to improve machine learning classifier training for ALS detection.
Authors:Yi Chen, Yuying Ge, Rui Wang, Yixiao Ge, Junhao Cheng, Ying Shan, Xihui Liu
Abstract:
Recent reinforcement learning approaches, such as outcome-supervised GRPO, have advanced Chain-of-Thought reasoning in large language models (LLMs), yet their adaptation to multimodal LLMs (MLLMs) is unexplored. To address the lack of rigorous evaluation for MLLM post-training methods, we introduce SEED-Bench-R1, a benchmark with complex real-world videos requiring balanced perception and reasoning. It offers a large training set and evaluates generalization across three escalating challenges: in-distribution, cross-environment, and cross-environment-task scenarios. Using SEED-Bench-R1, we find that standard GRPO, while improving answer accuracy, often reduces logical coherence between reasoning steps and answers, with only a 57.9% consistency rate. This stems from reward signals focusing solely on final answers, encouraging shortcuts, and strict KL penalties limiting exploration.To address this, we propose GRPO-CARE, a consistency-aware RL framework optimizing both answer correctness and reasoning coherence without explicit supervision. GRPO-CARE introduces a two-tiered reward: (1) a base reward for answer correctness, and (2) an adaptive consistency bonus, computed by comparing the model's reasoning-to-answer likelihood (via a slowly-evolving reference model) against group peers.This dual mechanism amplifies rewards for reasoning paths that are both correct and logically consistent. Replacing KL penalties with this adaptive bonus, GRPO-CARE outperforms standard GRPO on SEED-Bench-R1, achieving a 6.7% performance gain on the hardest evaluation level and a 24.5% improvement in consistency. It also shows strong transferability, improving model performance across diverse video understanding benchmarks. Our work contributes a systematically designed benchmark and a generalizable post-training framework, advancing the development of more interpretable and robust MLLMs.
中文: 本研究提出了用于评估多模态大语言模型后训练方法的基准SEED-Bench-R1,并开发了GRPO-CARE一致性感知强化学习框架,该框架在无需显式监督的情况下显著提升了答案准确性和推理连贯性,实现了明显的性能改进。
English: This study introduces SEED-Bench-R1, a benchmark for evaluating multimodal large language models' post-training methods, and proposes GRPO-CARE, a consistency-aware reinforcement learning framework that enhances both answer accuracy and reasoning coherence, achieving significant performance gains without explicit supervision.
Authors:Tianle Gu, Kexin Huang, Zongqi Wang, Yixu Wang, Jie Li, Yuanqi Yao, Yang Yao, Yujiu Yang, Yan Teng, Yingchun Wang
Abstract:
Safety alignment is a key requirement for building reliable Artificial General Intelligence. Despite significant advances in safety alignment, we observe that minor latent shifts can still trigger unsafe responses in aligned models. We argue that this stems from the shallow nature of existing alignment methods, which focus on surface-level refusal behaviors without sufficiently altering internal representations. Consequently, small shifts in hidden activations can re-trigger harmful behaviors embedded in the latent space. To explore the robustness of safety alignment to latent perturbations, we introduce a probing method that measures the Negative Log-Likelihood of the original response generated by the model. This probe quantifies local sensitivity in the latent space, serving as a diagnostic tool for identifying vulnerable directions. Based on this signal, we construct effective jailbreak trajectories, giving rise to the Activation Steering Attack (ASA). More importantly, these insights offer a principled foundation for improving alignment robustness. To this end, we introduce Layer-wise Adversarial Patch Training~(LAPT), a fine-tuning strategy that inject controlled perturbations into hidden representations during training. Experimental results highlight that LAPT strengthen alignment robustness without compromising general capabilities. Our findings reveal fundamental flaws in current alignment paradigms and call for representation-level training strategies that move beyond surface-level behavior supervision. Codes and results are available at https://github.com/Carol-gutianle/LatentSafety.
中文摘要:现有AI模型的安全对齐方法过于浅层,易受潜在偏移影响而引发不安全响应,但提出的分层对抗补丁训练(LAPT)通过针对性优化内部表征,有效提升了安全鲁棒性。
English Summary: Current safety alignment methods for AI models are shallow, making them vulnerable to latent shifts that can trigger unsafe responses, but the proposed Layer-wise Adversarial Patch Training (LAPT) enhances robustness by addressing these internal weaknesses.
Authors:Qianru Zhang, Honggang Wen, Ming Li, Dong Huang, Siu-Ming Yiu, Christian S. Jensen, Pietro Liò
Abstract:
Time series forecasting requires architectures that simultaneously achieve three competing objectives: (1) strict temporal causality for reliable predictions, (2) sub-quadratic complexity for practical scalability, and (3) multi-scale pattern recognition for accurate long-horizon forecasting. We introduce AutoHFormer, a hierarchical autoregressive transformer that addresses these challenges through three key innovations: 1) Hierarchical Temporal Modeling: Our architecture decomposes predictions into segment-level blocks processed in parallel, followed by intra-segment sequential refinement. This dual-scale approach maintains temporal coherence while enabling efficient computation. 2) Dynamic Windowed Attention: The attention mechanism employs learnable causal windows with exponential decay, reducing complexity while preserving precise temporal relationships. This design avoids both the anti-causal violations of standard transformers and the sequential bottlenecks of RNN hybrids. 3) Adaptive Temporal Encoding: a novel position encoding system is adopted to capture time patterns at multiple scales. It combines fixed oscillating patterns for short-term variations with learnable decay rates for long-term trends. Comprehensive experiments demonstrate that AutoHFormer 10.76X faster training and 6.06X memory reduction compared to PatchTST on PEMS08, while maintaining consistent accuracy across 96-720 step horizons in most of cases. These breakthroughs establish new benchmarks for efficient and precise time series modeling. Implementations of our method and all baselines in hierarchical autoregressive mechanism are available at https://github.com/lizzyhku/Autotime.
中文摘要:AutoHFormer是一种分层自回归变换器,通过多尺度时序建模、动态窗口注意力和自适应编码技术,在保持严格因果性的同时实现了高效准确的时间序列预测。
English Summary: AutoHFormer is a hierarchical autoregressive transformer that achieves efficient and accurate time series forecasting through multi-scale temporal modeling, dynamic windowed attention, and adaptive encoding while maintaining strict causality.
Authors:Vinicius Yuiti Fukase, Heitor Gama, Barbara Bueno, Lucas Libanio, Anna Helena Reali Costa, Artur Jordao
Abstract:
Critical Learning Periods comprehend an important phenomenon involving deep learning, where early epochs play a decisive role in the success of many training recipes, such as data augmentation. Existing works confirm the existence of this phenomenon and provide useful insights. However, the literature lacks efforts to precisely identify when critical periods occur. In this work, we fill this gap by introducing a systematic approach for identifying critical periods during the training of deep neural networks, focusing on eliminating computationally intensive regularization techniques and effectively applying mechanisms for reducing computational costs, such as data pruning. Our method leverages generalization prediction mechanisms to pinpoint critical phases where training recipes yield maximum benefits to the predictive ability of models. By halting resource-intensive recipes beyond these periods, we significantly accelerate the learning phase and achieve reductions in training time, energy consumption, and CO$_2$ emissions. Experiments on standard architectures and benchmarks confirm the effectiveness of our method. Specifically, we achieve significant milestones by reducing the training time of popular architectures by up to 59.67%, leading to a 59.47% decrease in CO$_2$ emissions and a 60% reduction in financial costs, without compromising performance. Our work enhances understanding of training dynamics and paves the way for more sustainable and efficient deep learning practices, particularly in resource-constrained environments. In the era of the race for foundation models, we believe our method emerges as a valuable framework. The repository is available at https://github.com/baunilhamarga/critical-periods
中文: 本研究提出了一种系统性方法,用于识别深度神经网络训练中的关键学习期,从而可在这些阶段后停止资源密集型技术,实现训练时间减少高达59.67%、二氧化碳排放降低59.47%、成本下降60%,且不影响模型性能。
English: This study introduces a systematic method to identify critical learning periods in deep neural network training, enabling the cessation of resource-intensive techniques after these phases to reduce training time by up to 59.67%, cut CO₂ emissions by 59.47%, and lower costs by 60% without performance loss.
Authors:Sheng Liu, Tianlang Chen, Pan Lu, Haotian Ye, Yizheng Chen, Lei Xing, James Zou
Abstract:
Test-time compute has emerged as a powerful paradigm for improving the performance of large language models (LLMs), where generating multiple outputs or refining individual chains can significantly boost answer accuracy. However, existing methods like Best-of-N, majority voting, and self-reflection typically apply reasoning in a uniform way across inputs, overlooking the fact that different problems may require different levels of reasoning depth. In this work, we propose Fractional Reasoning, a training-free and model-agnostic framework that enables continuous control over reasoning intensity at inference time, going beyond the limitations of fixed instructional prompts. Our method operates by extracting the latent steering vector associated with deeper reasoning and reapplying it with a tunable scaling factor, allowing the model to tailor its reasoning process to the complexity of each input. This supports two key modes of test-time scaling: (1) improving output quality in breadth-based strategies (e.g., Best-of-N, majority voting), and (2) enhancing the correctness of individual reasoning chains in depth-based strategies (e.g., self-reflection). Experiments on GSM8K, MATH500, and GPQA demonstrate that Fractional Reasoning consistently improves performance across diverse reasoning tasks and models.
中文:Fractional Reasoning是一种无需训练的框架,通过在推理时缩放潜在引导向量来实现对推理强度的连续控制,从而在多种推理任务中同步提升广度策略和深度策略的测试时计算效果。
English: Fractional Reasoning is a training-free framework that enables continuous control over reasoning intensity during inference by scaling latent steering vectors, improving both breadth-based and depth-based test-time compute strategies across various reasoning tasks.
Authors:Wangzhi Zhan, Jianpeng Chen, Dongqi Fu, Dawei Zhou
Abstract:
Metamaterials are artificial materials that are designed to meet unseen properties in nature, such as ultra-stiffness and negative materials indices. In mechanical metamaterial design, three key modalities are typically involved, i.e., 3D topology, density condition, and mechanical property. Real-world complex application scenarios place the demanding requirements on machine learning models to consider all three modalities together. However, a comprehensive literature review indicates that most existing works only consider two modalities, e.g., predicting mechanical properties given the 3D topology or generating 3D topology given the required properties. Therefore, there is still a significant gap for the state-of-the-art machine learning models capturing the whole. Hence, we propose a unified model named UNIMATE, which consists of a modality alignment module and a synergetic diffusion generation module. Experiments indicate that UNIMATE outperforms the other baseline models in topology generation task, property prediction task, and condition confirmation task by up to 80.2%, 5.1%, and 50.2%, respectively. We opensource our proposed UNIMATE model and corresponding results at https://github.com/wzhan24/UniMate.
Chinese: UNIMATE作为一种统一的机器学习模型,通过同时处理3D拓扑、密度条件和力学性能,填补了机械超材料设计领域的空白,在生成和预测任务中性能超越基线模型高达80.2%。
English: UNIMATE is a unified machine learning model that addresses the gap in mechanical metamaterial design by simultaneously handling 3D topology, density condition, and mechanical property, outperforming baseline models by up to 80.2% in generation and prediction tasks.
Authors:Junqi Gao, Zhichang Guo, Dazhi Zhang, Dong Li, Runze Liu, Pengfei Li, Kai Tian, Biqing Qi
Abstract:
Heterogeneous Large Language Model (LLM) fusion integrates the strengths of multiple source LLMs with different architectures into a target LLM with low computational overhead. While promising, existing methods suffer from two major limitations: 1) reliance on real data from limited domain for knowledge fusion, preventing the target LLM from fully acquiring knowledge across diverse domains, and 2) fixed data allocation proportions across domains, failing to dynamically adjust according to the target LLM's varying capabilities across domains, leading to a capability imbalance. To overcome these limitations, we propose Bohdi, a synthetic-data-only heterogeneous LLM fusion framework. Through the organization of knowledge domains into a hierarchical tree structure, Bohdi enables automatic domain exploration and multi-domain data generation through multi-model collaboration, thereby comprehensively extracting knowledge from source LLMs. By formalizing domain expansion and data sampling proportion allocation on the knowledge tree as a Hierarchical Multi-Armed Bandit problem, Bohdi leverages the designed DynaBranches mechanism to adaptively adjust sampling proportions based on the target LLM's performance feedback across domains. Integrated with our proposed Introspection-Rebirth (IR) mechanism, DynaBranches dynamically tracks capability shifts during target LLM's updates via Sliding Window Binomial Likelihood Ratio Testing (SWBLRT), further enhancing its online adaptation capability. Comparative experimental results on a comprehensive suite of benchmarks demonstrate that Bohdi significantly outperforms existing baselines on multiple target LLMs, exhibits higher data efficiency, and virtually eliminates the imbalance in the target LLM's capabilities. Our code is available at https://github.com/gjq100/Bohdi.git.
中文: Bohdi是一个仅使用合成数据的异构大语言模型融合框架,通过自动领域探索和自适应数据采样克服了现有方法的局限,显著提升了目标模型的性能和数据效率,同时消除了能力不平衡问题。
English: Bohdi is a synthetic-data-only heterogeneous LLM fusion framework that overcomes limitations of existing methods by enabling automatic domain exploration and adaptive data sampling, significantly enhancing the target LLM's performance and data efficiency while eliminating capability imbalance.
Authors:Hanyu Pei, Jing-Xiao Liao, Qibin Zhao, Ting Gao, Shijun Zhang, Xiaoge Zhang, Feng-Lei Fan
Abstract:
Drawing inspiration from our human brain that designs different neurons for different tasks, recent advances in deep learning have explored modifying a network's neurons to develop so-called task-driven neurons. Prototyping task-driven neurons (referred to as NeuronSeek) employs symbolic regression (SR) to discover the optimal neuron formulation and construct a network from these optimized neurons. Along this direction, this work replaces symbolic regression with tensor decomposition (TD) to discover optimal neuronal formulations, offering enhanced stability and faster convergence. Furthermore, we establish theoretical guarantees that modifying the aggregation functions with common activation functions can empower a network with a fixed number of parameters to approximate any continuous function with an arbitrarily small error, providing a rigorous mathematical foundation for the NeuronSeek framework. Extensive empirical evaluations demonstrate that our NeuronSeek-TD framework not only achieves superior stability, but also is competitive relative to the state-of-the-art models across diverse benchmarks. The code is available at https://github.com/HanyuPei22/NeuronSeek.
中文摘要:本研究通过用张量分解替代符号回归来优化神经元结构,增强了NeuronSeek框架的稳定性和收敛速度,在多个基准测试中展现出竞争优势,同时为通用近似能力提供了理论保证。
English Summary: This study enhances the NeuronSeek framework by replacing symbolic regression with tensor decomposition to optimize neuron formulations, achieving improved stability, faster convergence, and competitive performance across various benchmarks while providing theoretical guarantees for universal approximation.
Authors:Siru Ouyang, Xinyu Zhu, Zilin Xiao, Minhao Jiang, Yu Meng, Jiawei Han
Abstract:
Reinforcement learning (RL) has become a powerful approach for improving the reasoning capabilities of large language models (LLMs), as evidenced by recent successes such as OpenAI's o1 and Deepseek-R1. However, applying RL at scale remains intimidatingly resource-intensive, requiring multiple model copies and extensive GPU workloads. On the other hand, while being powerful, recent studies suggest that RL does not fundamentally endow models with new knowledge; rather, it primarily reshapes the model's output distribution to activate reasoning capabilities latent in the base model. Building on this insight, we hypothesize that the changes in output probabilities induced by RL are largely model-size invariant, opening the door to a more efficient paradigm: training a small model with RL and transferring its induced probability shifts to larger base models. To verify our hypothesis, we conduct a token-level analysis of decoding trajectories and find high alignment in RL-induced output distributions across model scales, validating our hypothesis. Motivated by this, we propose RAST, a simple yet effective method that transfers reasoning behaviors by injecting RL-induced probability adjustments from a small RL-trained model into larger models. Experiments across multiple mathematical reasoning benchmarks show that RAST substantially and consistently enhances the reasoning capabilities of base models while requiring significantly lower GPU memory than direct RL training, sometimes even yielding better performance than the RL-trained counterparts. Our findings offer new insights into the nature of RL-driven reasoning and practical strategies for scaling its benefits without incurring its full computational cost. The project page of RAST is available at https://ozyyshr.github.io/RAST/.
中文摘要:强化学习通过重塑输出分布而非增加新知识来提升大语言模型的推理能力,而RAST方法可将从小模型习得的概率调整迁移至大模型,从而以更低资源成本显著增强推理性能。
English Summary: Reinforcement learning enhances LLMs' reasoning by reshaping output distributions rather than adding new knowledge, and the proposed RAST method efficiently transfers these adjustments from small to large models to boost performance with lower resource costs.
Authors:Guoqing Chao, Zhenghao Zhang, Lei Meng, Jie Wen, Dianhui Chu
Abstract:
Federated multi-view clustering has been proposed to mine the valuable information within multi-view data distributed across different devices and has achieved impressive results while preserving the privacy. Despite great progress, most federated multi-view clustering methods only used global pseudo-labels to guide the downstream clustering process and failed to exploit the global information when extracting features. In addition, missing data problem in federated multi-view clustering task is less explored. To address these problems, we propose a novel Federated Incomplete Multi-view Clustering method with globally Fused Graph guidance (FIMCFG). Specifically, we designed a dual-head graph convolutional encoder at each client to extract two kinds of underlying features containing global and view-specific information. Subsequently, under the guidance of the fused graph, the two underlying features are fused into high-level features, based on which clustering is conducted under the supervision of pseudo-labeling. Finally, the high-level features are uploaded to the server to refine the graph fusion and pseudo-labeling computation. Extensive experimental results demonstrate the effectiveness and superiority of FIMCFG. Our code is publicly available at https://github.com/PaddiHunter/FIMCFG.
中文:FIMCFG方法通过双头图卷积编码器提取全局与视图特定特征,在融合图指导下整合高层特征并进行伪标签监督聚类,有效解决了联邦不完全多视图聚类中的全局信息利用不足和数据缺失问题,实验证明了其优越性。
English: Federated Incomplete Multi-view Clustering with globally Fused Graph guidance (FIMCFG) addresses limitations in existing methods by integrating global information during feature extraction and tackling missing data through a dual-head graph encoder and fused graph supervision, demonstrating superior performance in experiments.
Authors:Haolin Pan, Hongyu Lin, Haoran Luo, Yang Liu, Kaichun Yao, Libo Zhang, Mingjie Xing, Yanjun Wu
Abstract:
Compiler auto-tuning optimizes pass sequences to improve performance metrics such as Intermediate Representation (IR) instruction count. Although recent advances leveraging Large Language Models (LLMs) have shown promise in automating compiler tuning, two significant challenges still remain: the absence of high-quality reasoning datasets for agents training, and limited effective interactions with the compilation environment. In this work, we introduce Compiler-R1, the first reinforcement learning (RL)-driven framework specifically augmenting LLM capabilities for compiler auto-tuning. Compiler-R1 features a curated, high-quality reasoning dataset and a novel two-stage end-to-end RL training pipeline, enabling efficient environment exploration and learning through an outcome-based reward. Extensive experiments across seven datasets demonstrate Compiler-R1 achieving an average 8.46% IR instruction count reduction compared to opt -Oz, showcasing the strong potential of RL-trained LLMs for compiler optimization. Our code and datasets are publicly available at https://github.com/Panhaolin2001/Compiler-R1.
Chinese: 本文提出Compiler-R1,一个通过强化学习增强大语言模型进行编译器自动调优的框架,它提供高质量推理数据集和两阶段训练流程,在七个数据集上平均减少8.46%的中间表示指令数。
English: This paper introduces Compiler-R1, a reinforcement learning framework that enhances LLMs for compiler auto-tuning by providing a high-quality reasoning dataset and a two-stage training pipeline, achieving an average 8.46% reduction in IR instruction count across seven datasets.
Authors:Xinxing Ren, Qianbo Zang, Zekun Guo
Abstract:
Recent advances in large language models (LLMs) have shown impressive performance in mathematical reasoning and code generation. However, LLMs still struggle in the simulation domain, particularly in generating Simulink models, which are essential tools in engineering and scientific research. Our preliminary experiments indicate that LLM agents often fail to produce reliable and complete Simulink simulation code from text-only inputs, likely due to the lack of Simulink-specific data in their pretraining. To address this challenge, we propose SimuGen, a multimodal agent-based framework that automatically generates accurate Simulink simulation code by leveraging both the visual Simulink diagram and domain knowledge. SimuGen coordinates several specialized agents, including an investigator, unit test reviewer, code generator, executor, debug locator, and report writer, supported by a domain-specific knowledge base. This collaborative and modular design enables interpretable, robust, and reproducible Simulink simulation generation. Our source code is publicly available at https://github.com/renxinxing123/SimuGen_beta.
中文摘要:大型语言模型在数学推理和代码生成方面表现出色,但在Simulink模型生成领域存在不足,因此提出SimuGen多模态智能体框架,通过结合视觉图表和领域知识来自动生成准确的仿真代码。
English Summary: Large language models excel in math and coding but struggle with Simulink model generation due to training data gaps, prompting the proposed SimuGen framework that uses multimodal agents and domain knowledge to produce reliable simulations.
Authors:Kaifeng Zhang, Baoyu Li, Kris Hauser, Yunzhu Li
Abstract:
Modeling the dynamics of deformable objects is challenging due to their diverse physical properties and the difficulty of estimating states from limited visual information. We address these challenges with a neural dynamics framework that combines object particles and spatial grids in a hybrid representation. Our particle-grid model captures global shape and motion information while predicting dense particle movements, enabling the modeling of objects with varied shapes and materials. Particles represent object shapes, while the spatial grid discretizes the 3D space to ensure spatial continuity and enhance learning efficiency. Coupled with Gaussian Splattings for visual rendering, our framework achieves a fully learning-based digital twin of deformable objects and generates 3D action-conditioned videos. Through experiments, we demonstrate that our model learns the dynamics of diverse objects -- such as ropes, cloths, stuffed animals, and paper bags -- from sparse-view RGB-D recordings of robot-object interactions, while also generalizing at the category level to unseen instances. Our approach outperforms state-of-the-art learning-based and physics-based simulators, particularly in scenarios with limited camera views. Furthermore, we showcase the utility of our learned models in model-based planning, enabling goal-conditioned object manipulation across a range of tasks. The project page is available at https://kywind.github.io/pgnd .
Authors:Tevin Wang, Chenyan Xiong
Abstract:
Rule-based rewards offer a promising strategy for improving reinforcement learning from human feedback (RLHF), but current approaches often rely on manual rule engineering. We present AutoRule, a fully automated method for extracting rules from preference feedback and formulating them into rule-based rewards. AutoRule extraction operates in three stages: it leverages a reasoning model to interpret user preferences, identifies candidate rules from the reasoning chain of these interpretations, and synthesizes them into a unified rule set. Leveraging the finalized rule set, we employ language-model verifiers to compute the fraction of rules satisfied by each output, using this metric as an auxiliary reward alongside the learned reward model during policy optimization. Training a Llama-3-8B model with AutoRule results in a 28.6\% relative improvement in length-controlled win rate on AlpacaEval2.0, and a 6.1\% relative gain in second-turn performance on a held-out MT-Bench subset, compared to a GRPO baseline trained with the same learned reward model but without the rule-based auxiliary reward. Our analysis confirms that the extracted rules exhibit good agreement with dataset preference. We find that AutoRule demonstrates reduced reward hacking compared to a learned reward model when run over two episodes. Finally, our case study suggests that the extracted rules capture unique qualities valued in different datasets. The extracted rules are provided in the appendix, and the code is open-sourced at https://github.com/cxcscmu/AutoRule.
Chinese: AutoRule通过从偏好反馈中自动提取规则并构建规则奖励,结合学习到的奖励模型强化训练,显著提升了模型在AlpacaEval2.0和MT-Bench等基准测试中的性能表现。
English: AutoRule automates the extraction of rules from preference feedback to create rule-based rewards, enhancing reinforcement learning by integrating these with learned reward models and significantly improving model performance on benchmarks like AlpacaEval2.0 and MT-Bench.
Authors:Alaa Anani, Tobias Lorenz, Mario Fritz, Bernt Schiele
Abstract:
Post-hoc attribution methods aim to explain deep learning predictions by highlighting influential input pixels. However, these explanations are highly non-robust: small, imperceptible input perturbations can drastically alter the attribution map while maintaining the same prediction. This vulnerability undermines their trustworthiness and calls for rigorous robustness guarantees of pixel-level attribution scores. We introduce the first certification framework that guarantees pixel-level robustness for any black-box attribution method using randomized smoothing. By sparsifying and smoothing attribution maps, we reformulate the task as a segmentation problem and certify each pixel's importance against $\ell_2$-bounded perturbations. We further propose three evaluation metrics to assess certified robustness, localization, and faithfulness. An extensive evaluation of 12 attribution methods across 5 ImageNet models shows that our certified attributions are robust, interpretable, and faithful, enabling reliable use in downstream tasks. Our code is at https://github.com/AlaaAnani/certified-attributions.
中文摘要:本文首次提出基于随机平滑的认证框架,为任何黑盒归因方法提供像素级鲁棒性保证,确保归因结果在对抗扰动下的可靠性和可解释性。
English Summary: This paper introduces the first certification framework using randomized smoothing to guarantee pixel-level robustness for any black-box attribution method against adversarial perturbations, ensuring reliable and interpretable explanations.
Authors:Nikolay Blagoev, OÄuzhan Ersoy, Lydia Yiyu Chen
Abstract:
Training LLMs on decentralized and wimpy computation nodes, e.g., multiple on-spot instances, lowers the training cost and enables model democratization. The inevitable challenge here is the churn of nodes due to failures and the operator's scheduling policies, leading to losing a stage - a part of the model. The conventional approaches to recover from failures are to either use checkpointing, where periodically a copy of the entire model is sent to an additional storage, or redundant computation. These approaches yield significant communication and/or computation overhead even in non-failure cases and scale poorly in settings with large models. In this paper, we propose, CheckFree, an efficient recovery method where a failing stage is substituted by a weighted average of the closest neighboring stages. In contrast to the state of the art, CheckFree requires no additional computation or storage. However, because of the nature of averaging neighbouring stages, it can only recover failures of intermediate stages. We further extend our method to CheckFree+ with out-of-order pipeline execution to tolerate crashes of the first and last stages. Thanks to out-of-order pipelining, behaviour of those stages is mimicked by their neighboring ones, which allows CheckFree+ to recover them by simply copying the weights from the immediate neighbour. To be able to recover the (de)embedding layers, CheckFree+ copies those layers to the neighboring stages, which requires relatively small storage overhead. We extensively evaluate our method on LLaMa models of model sizes from 124M to 1.5B with varying failure frequencies. In the case of low and medium failure rates (5-10%), CheckFree and CheckFree+ outperform both checkpointing and redundant computation in terms of convergence in wall-clock time by over 12%. Both of our proposals can be run via our code available at: https://github.com/gensyn-ai/CheckFree.
中文: CheckFree是一种高效的恢复方法,通过用相邻阶段的加权平均值替代故障阶段,无需额外计算或存储,其增强版CheckFree+通过乱序流水线执行进一步支持首尾阶段故障的恢复。
English: CheckFree is an efficient recovery method that replaces failed stages with weighted averages of neighboring stages, eliminating the need for additional computation or storage, and its enhanced version CheckFree+ extends this capability to handle first and last stage failures through out-of-order pipeline execution.
Authors:Guoguo Ai, Hezhe Qiao, Hui Yan, Guansong Pang
Abstract:
Semi-supervised graph anomaly detection (GAD) utilizes a small set of labeled normal nodes to identify abnormal nodes from a large set of unlabeled nodes in a graph. Current methods in this line posit that 1) normal nodes share a similar level of homophily and 2) the labeled normal nodes can well represent the homophily patterns in the normal class. However, this assumption often does not hold well since normal nodes in a graph can exhibit diverse homophily in real-world GAD datasets. In this paper, we propose RHO, namely Robust Homophily Learning, to adaptively learn such homophily patterns. RHO consists of two novel modules, adaptive frequency response filters (AdaFreq) and graph normality alignment (GNA). AdaFreq learns a set of adaptive spectral filters that capture different frequency components of the labeled normal nodes with varying homophily in the channel-wise and cross-channel views of node attributes. GNA is introduced to enforce consistency between the channel-wise and cross-channel homophily representations to robustify the normality learned by the filters in the two views. Experiments on eight real-world GAD datasets show that RHO can effectively learn varying, often under-represented, homophily in the small normal node set and substantially outperforms state-of-the-art competing methods. Code is available at https://github.com/mala-lab/RHO.
中文: 提出的RHO框架通过自适应频率响应滤波器和图正态对齐模块,有效学习正常节点中多样化的同质性模式,在八个真实世界图异常检测数据集上显著优于现有最优方法。
English: The proposed RHO framework addresses limitations in semi-supervised graph anomaly detection by introducing adaptive frequency response filters and graph normality alignment to effectively capture diverse homophily patterns in normal nodes, demonstrating superior performance across eight real-world datasets.
Authors:Damin Kühn, Michael T. Schaub
Abstract:
Optimal transport provides a robust framework for comparing probability distributions. Its effectiveness is significantly influenced by the choice of the underlying ground metric. Traditionally, the ground metric has either been (i) predefined, e.g., as the Euclidean distance, or (ii) learned in a supervised way, by utilizing labeled data to learn a suitable ground metric for enhanced task-specific performance. Yet, predefined metrics typically cannot account for the inherent structure and varying importance of different features in the data, and existing supervised approaches to ground metric learning often do not generalize across multiple classes or are restricted to distributions with shared supports. To address these limitations, we propose a novel approach for learning metrics for arbitrary distributions over a shared metric space. Our method provides a distance between individual points like a global metric, but requires only class labels on a distribution-level for training. The learned global ground metric enables more accurate optimal transport distances, leading to improved performance in embedding, clustering and classification tasks. We demonstrate the effectiveness and interpretability of our approach using patient-level scRNA-seq data spanning multiple diseases.
Chinese: 该方法仅利用分布层面的类别标签学习最优传输的全局基础度量,无需预定义或监督学习即可提升聚类和分类等任务的准确性。
English: The proposed method learns a global ground metric for optimal transport using only distribution-level class labels, enhancing accuracy in tasks like clustering and classification without requiring predefined or supervised metrics.
Authors:J. Thorben Frank, Winfried Ripken, Gregor Lied, Klaus-Robert Müller, Oliver T. Unke, Stefan Chmiela
Abstract:
Diffusion Transformers (DiTs) have demonstrated strong performance in generative modeling, particularly in image synthesis, making them a compelling choice for molecular conformer generation. However, applying DiTs to molecules introduces novel challenges, such as integrating discrete molecular graph information with continuous 3D geometry, handling Euclidean symmetries, and designing conditioning mechanisms that generalize across molecules of varying sizes and structures. We propose DiTMC, a framework that adapts DiTs to address these challenges through a modular architecture that separates the processing of 3D coordinates from conditioning on atomic connectivity. To this end, we introduce two complementary graph-based conditioning strategies that integrate seamlessly with the DiT architecture. These are combined with different attention mechanisms, including both standard non-equivariant and SO(3)-equivariant formulations, enabling flexible control over the trade-off between between accuracy and computational efficiency. Experiments on standard conformer generation benchmarks (GEOM-QM9, -DRUGS, -XL) demonstrate that DiTMC achieves state-of-the-art precision and physical validity. Our results highlight how architectural choices and symmetry priors affect sample quality and efficiency, suggesting promising directions for large-scale generative modeling of molecular structures. Code available at https://github.com/ML4MolSim/dit_mc.
中文: DiTMC通过结合图结构条件与等变注意力机制,将扩散变换器应用于分子构象生成,在标准测试中实现了最优精度和物理有效性。
English: DiTMC adapts Diffusion Transformers for molecular conformer generation by integrating graph-based conditioning and equivariant attention, achieving state-of-the-art precision and physical validity on benchmarks.
Authors:A. S. Stankevich, I. B. Petrov
Abstract:
Recent developments in application of deep learning models to acoustic Full Waveform Inversion (FWI) are marked by the use of diffusion models as prior distributions for Bayesian-like inference procedures. The advantage of these methods is the ability to generate high-resolution samples, which are otherwise unattainable with classical inversion methods or other deep learning-based solutions. However, the iterative and stochastic nature of sampling from diffusion models along with heuristic nature of output control remain limiting factors for their applicability. For instance, an optimal way to include the approximate velocity model into diffusion-based inversion scheme remains unclear, even though it is considered an essential part of FWI pipeline. We address the issue by employing a Schrödinger Bridge that interpolates between the distributions of ground truth and smoothed velocity models. To facilitate the learning of nonlinear drifts that transfer samples between distributions we extend the concept of Image-to-Image Schrödinger Bridge ($\text{I}^2\text{SB}$) to conditional sampling, resulting in a conditional Image-to-Image Schrödinger Bridge (c$\text{I}^2\text{SB}$) framework. To validate our method, we assess its effectiveness in reconstructing the reference velocity model from its smoothed approximation, coupled with the observed seismic signal of fixed shape. Our experiments demonstrate that the proposed solution outperforms our reimplementation of conditional diffusion model suggested in earlier works, while requiring only a few neural function evaluations (NFEs) to achieve sample fidelity superior to that attained with supervised learning-based approach. The supplementary code implementing the algorithms described in this paper can be found in the repository https://github.com/stankevich-mipt/seismic_inversion_via_I2SB.
Chinese: 深度学习在全波形反演中的应用近期采用扩散模型作为先验分布,但受限于采样效率和启发式控制,本文通过引入条件图像到图像薛定谔桥(cI²SB)框架,以更少神经网络评估实现优于现有方法的样本保真度。
English: Recent advances in deep learning for acoustic Full Waveform Inversion utilize diffusion models as priors, but face limitations in sampling efficiency and control, which are addressed by introducing a conditional Image-to-Image Schrödinger Bridge (cI²SB) framework that outperforms prior methods with fewer neural evaluations.
Authors:Bihe Zhao, Pratyush Maini, Franziska Boenisch, Adam Dziedzic
Abstract:
The remarkable capabilities of Large Language Models (LLMs) can be mainly attributed to their massive training datasets, which are often scraped from the internet without respecting data owners' intellectual property rights. Dataset Inference (DI) offers a potential remedy by identifying whether a suspect dataset was used in training, thereby enabling data owners to verify unauthorized use. However, existing DI methods require a private set-known to be absent from training-that closely matches the compromised dataset's distribution. Such in-distribution, held-out data is rarely available in practice, severely limiting the applicability of DI. In this work, we address this challenge by synthetically generating the required held-out set. Our approach tackles two key obstacles: (1) creating high-quality, diverse synthetic data that accurately reflects the original distribution, which we achieve via a data generator trained on a carefully designed suffix-based completion task, and (2) bridging likelihood gaps between real and synthetic data, which is realized through post-hoc calibration. Extensive experiments on diverse text datasets show that using our generated data as a held-out set enables DI to detect the original training sets with high confidence, while maintaining a low false positive rate. This result empowers copyright owners to make legitimate claims on data usage and demonstrates our method's reliability for real-world litigations. Our code is available at https://github.com/sprintml/PostHocDatasetInference.
中文摘要:本研究提出一种生成合成数据的方法,使数据集推断能够有效检测大型语言模型中未经授权的训练数据使用,克服了需要不可得保留数据集的限制,并在多样实验中展现出高准确率和低误报率。
English Summary: This study introduces a method to generate synthetic data that enables Dataset Inference to effectively detect unauthorized use of training data in Large Language Models, overcoming the limitation of requiring unavailable held-out datasets and demonstrating high accuracy with low false positives in diverse experiments.
Authors:Jan van Delden, Julius Schultz, Sebastian Rothe, Christian Libner, Sabine C. Langer, Timo Lüddecke
Abstract:
Structural vibrations are a source of unwanted noise in engineering systems like cars, trains or airplanes. Minimizing these vibrations is crucial for improving passenger comfort. This work presents a novel design optimization approach based on guided flow matching for reducing vibrations by placing beadings (indentations) in plate-like structures. Our method integrates a generative flow matching model and a surrogate model trained to predict structural vibrations. During the generation process, the flow matching model pushes towards manufacturability while the surrogate model pushes to low-vibration solutions. The flow matching model and its training data implicitly define the design space, enabling a broader exploration of potential solutions as no optimization of manually-defined design parameters is required. We apply our method to a range of differentiable optimization objectives, including direct optimization of specific eigenfrequencies through careful construction of the objective function. Results demonstrate that our method generates diverse and manufacturable plate designs with reduced structural vibrations compared to designs from random search, a criterion-based design heuristic and genetic optimization. The code and data are available from https://github.com/ecker-lab/Optimizing_Vibrating_Plates.
Chinese: 本研究提出了一种基于引导流匹配的新颖设计优化方法,通过整合生成模型和代理模型来减少板状结构中的振动,相比传统方法,能更有效地生成可制造且低振动的设计。
English: This study introduces a novel design optimization method using guided flow matching to reduce structural vibrations in plates by integrating generative and surrogate models, which produces manufacturable, low-vibration designs more effectively than traditional approaches.
Authors:Zongxia Li, Yapei Chang, Yuhang Zhou, Xiyang Wu, Zichao Liang, Yoo Yeon Sung, Jordan Lee Boyd-Graber
Abstract:
Evaluating open-ended long-form generation is challenging because it is hard to define what clearly separates good from bad outputs. Existing methods often miss key aspects like coherence, style, or relevance, or are biased by pretraining data, making open-ended long-form evaluation an underexplored problem. To address this gap, we propose PrefBERT, a scoring model for evaluating open-ended long-form generation in GRPO and guiding its training with distinct rewards for good and bad outputs. Trained on two response evaluation datasets with diverse long-form styles and Likert-rated quality, PrefBERT effectively supports GRPO by offering better semantic reward feedback than traditional metrics ROUGE-L and BERTScore do. Through comprehensive evaluations, including LLM-as-a-judge, human ratings, and qualitative analysis, we show that PrefBERT, trained on multi-sentence and paragraph-length responses, remains reliable across varied long passages and aligns well with the verifiable rewards GRPO needs. Human evaluations confirm that using PrefBERT as the reward signal to train policy models yields responses better aligned with human preferences than those trained with traditional metrics. Our code is available at https://github.com/zli12321/long_form_rl.
Chinese: PrefBERT是一种新颖的评分模型,旨在通过提供语义奖励反馈来评估开放式生成长文本,其表现优于ROUGE-L和BERTScore等传统指标,并在训练策略模型时更好地符合人类偏好。
English: PrefBERT is a novel scoring model designed to evaluate open-ended long-form generation by providing semantic reward feedback, outperforming traditional metrics like ROUGE-L and BERTScore and aligning better with human preferences in training policy models.
Authors:Marissa Dominijanni, Alexander Ororbia, Kenneth W. Regan
Abstract:
Synaptic delays play a crucial role in biological neuronal networks, where their modulation has been observed in mammalian learning processes. In the realm of neuromorphic computing, although spiking neural networks (SNNs) aim to emulate biology more closely than traditional artificial neural networks do, synaptic delays are rarely incorporated into their simulation. We introduce a novel learning rule for simultaneously learning synaptic connection strengths and delays, by extending spike-timing dependent plasticity (STDP), a Hebbian method commonly used for learning synaptic weights. We validate our approach by extending a widely-used SNN model for classification trained with unsupervised learning. Then we demonstrate the effectiveness of our new method by comparing it against another existing methods for co-learning synaptic weights and delays as well as against STDP without synaptic delays. Results demonstrate that our proposed method consistently achieves superior performance across a variety of test scenarios. Furthermore, our experimental results yield insight into the interplay between synaptic efficacy and delay.
中文: 本研究提出了一种新颖的学习规则,通过扩展脉冲时序依赖可塑性来同时学习脉冲神经网络的突触权重和延迟,实验结果表明该方法在多种测试场景中均表现优异,并揭示了突触效能与延迟之间的相互作用。
English: This study introduces a novel learning rule that extends spike-timing dependent plasticity to simultaneously learn synaptic weights and delays in spiking neural networks, demonstrating superior performance and providing insights into synaptic efficacy-delay interactions across various test scenarios.
Authors:Zhoujun Cheng, Shibo Hao, Tianyang Liu, Fan Zhou, Yutao Xie, Feng Yao, Yuexin Bian, Yonghao Zhuang, Nilabjo Dey, Yuheng Zha, Yi Gu, Kun Zhou, Yuqi Wang, Yuan Li, Richard Fan, Jianshu She, Chengqian Gao, Abulhair Saparov, Haonan Li, Taylor W. Killian, Mikhail Yurochkin, Zhengzhong Liu, Eric P. Xing, Zhiting Hu
Abstract:
Reinforcement learning (RL) has emerged as a promising approach to improve large language model (LLM) reasoning, yet most open efforts focus narrowly on math and code, limiting our understanding of its broader applicability to general reasoning. A key challenge lies in the lack of reliable, scalable RL reward signals across diverse reasoning domains. We introduce Guru, a curated RL reasoning corpus of 92K verifiable examples spanning six reasoning domains--Math, Code, Science, Logic, Simulation, and Tabular--each built through domain-specific reward design, deduplication, and filtering to ensure reliability and effectiveness for RL training. Based on Guru, we systematically revisit established findings in RL for LLM reasoning and observe significant variation across domains. For example, while prior work suggests that RL primarily elicits existing knowledge from pretrained models, our results reveal a more nuanced pattern: domains frequently seen during pretraining (Math, Code, Science) easily benefit from cross-domain RL training, while domains with limited pretraining exposure (Logic, Simulation, and Tabular) require in-domain training to achieve meaningful performance gains, suggesting that RL is likely to facilitate genuine skill acquisition. Finally, we present Guru-7B and Guru-32B, two models that achieve state-of-the-art performance among open models RL-trained with publicly available data, outperforming best baselines by 7.9% and 6.7% on our 17-task evaluation suite across six reasoning domains. We also show that our models effectively improve the Pass@k performance of their base models, particularly on complex tasks less likely to appear in pretraining data. We release data, models, training and evaluation code to facilitate general-purpose reasoning at: https://github.com/LLM360/Reasoning360
中文: Guru语料库通过涵盖六个领域的9.2万个多样化推理实例,解决了强化学习在语言模型中应用范围有限的问题,其Guru-7B和Guru-32B模型实现了最先进性能,证明强化学习既能激发既有知识,也能在预训练不足的领域促成真正的技能习得。
English: The Guru corpus introduces 92K diverse reasoning examples across six domains to address the limited scope of reinforcement learning in language models, enabling models like Guru-7B and Guru-32B to achieve state-of-the-art performance by demonstrating that RL can both elicit existing knowledge and foster genuine skill acquisition in underrepresented areas.
Authors:Thomas Kuntz, Agatha Duzan, Hao Zhao, Francesco Croce, Zico Kolter, Nicolas Flammarion, Maksym Andriushchenko
Abstract:
Computer use agents are LLM-based agents that can directly interact with a graphical user interface, by processing screenshots or accessibility trees. While these systems are gaining popularity, their safety has been largely overlooked, despite the fact that evaluating and understanding their potential for harmful behavior is essential for widespread adoption. To address this gap, we introduce OS-Harm, a new benchmark for measuring safety of computer use agents. OS-Harm is built on top of the OSWorld environment and aims to test models across three categories of harm: deliberate user misuse, prompt injection attacks, and model misbehavior. To cover these cases, we create 150 tasks that span several types of safety violations (harassment, copyright infringement, disinformation, data exfiltration, etc.) and require the agent to interact with a variety of OS applications (email client, code editor, browser, etc.). Moreover, we propose an automated judge to evaluate both accuracy and safety of agents that achieves high agreement with human annotations (0.76 and 0.79 F1 score). We evaluate computer use agents based on a range of frontier models - such as o4-mini, Claude 3.7 Sonnet, Gemini 2.5 Pro - and provide insights into their safety. In particular, all models tend to directly comply with many deliberate misuse queries, are relatively vulnerable to static prompt injections, and occasionally perform unsafe actions. The OS-Harm benchmark is available at https://github.com/tml-epfl/os-harm.
中文: OS-Harm 是一个新基准,旨在通过涉及多种操作系统应用的150项任务和与人工评估高度一致的自动化评判器,测试基于大语言模型的计算机使用代理在故意滥用、提示注入攻击和模型不当行为这三类危害中的安全性。
English: OS-Harm is a new benchmark designed to evaluate the safety of LLM-based computer use agents by testing them across three categories of harm—deliberate misuse, prompt injections, and model misbehavior—through 150 tasks involving various OS applications and an automated judge that aligns closely with human evaluations.
Authors:Jenny Schmalfuss, Nadine Chang, Vibashan VS, Maying Shen, Andres Bruhn, Jose M. Alvarez
Abstract:
Vision language models (VLMs) respond to user-crafted text prompts and visual inputs, and are applied to numerous real-world problems. VLMs integrate visual modalities with large language models (LLMs), which are well known to be prompt-sensitive. Hence, it is crucial to determine whether VLMs inherit this instability to varying prompts. We therefore investigate which prompt variations VLMs are most sensitive to and which VLMs are most agnostic to prompt variations. To this end, we introduce PARC (Prompt Analysis via Reliability and Calibration), a VLM prompt sensitivity analysis framework built on three pillars: (1) plausible prompt variations in both the language and vision domain, (2) a novel model reliability score with built-in guarantees, and (3) a calibration step that enables dataset- and prompt-spanning prompt variation analysis. Regarding prompt variations, PARC's evaluation shows that VLMs mirror LLM language prompt sensitivity in the vision domain, and most destructive variations change the expected answer. Regarding models, outstandingly robust VLMs among 22 evaluated models come from the InternVL2 family. We further find indications that prompt sensitivity is linked to training data. The code will be at https://github.com/NVlabs/PARC.
Chinese: 视觉语言模型对提示变化具有敏感性,类似于大语言模型,其中InternVL2系列在22个评估模型中表现出最强的鲁棒性,这是通过PARC框架在语言和视觉领域进行可靠性和校准分析得出的结论。
English: Vision language models (VLMs) exhibit sensitivity to prompt variations similar to large language models, with the InternVL2 family showing the most robustness among 22 evaluated models, as analyzed through the PARC framework that assesses reliability and calibration across language and vision domains.
Authors:Evdoxia Taka, Debadyuti Bhattacharya, Joanne Garde-Hansen, Sanjay Sharma, Tanaya Guha
Abstract:
Recent advances in AI has made automated analysis of complex media content at scale possible while generating actionable insights regarding character representation along such dimensions as gender and age. Past works focused on quantifying representation from audio/video/text using AI models, but without having the audience in the loop. We ask, even if character distribution along demographic dimensions are available, how useful are those to the general public? Do they actually trust the numbers generated by AI models? Our work addresses these open questions by proposing a new AI-based character representation tool and performing a thorough user study. Our tool has two components: (i) An analytics extraction model based on the Contrastive Language Image Pretraining (CLIP) foundation model that analyzes visual screen data to quantify character representation across age and gender; (ii) A visualization component effectively designed for presenting the analytics to lay audience. The user study seeks empirical evidence on the usefulness and trustworthiness of the AI-generated results for carefully chosen movies presented in the form of our visualizations. We found that participants were able to understand the analytics in our visualizations, and deemed the tool `overall useful'. Participants also indicated a need for more detailed visualizations to include more demographic categories and contextual information of the characters. Participants' trust in AI-based gender and age models is seen to be moderate to low, although they were not against the use of AI in this context. Our tool including code, benchmarking, and the user study data can be found at https://github.com/debadyuti0510/Character-Representation-Media.
中文摘要:本研究开发了一种基于CLIP模型分析媒体角色表征的AI工具,用户研究显示该工具被认为具有实用性,但参与者对AI生成的年龄性别数据信任度普遍中等偏低。
English Summary: This research introduces an AI-powered tool that quantifies character representation in media through CLIP-based analytics and visualizations, finding it useful but revealing moderate to low trust in AI-generated demographics among users.
Authors:Zhangyang Gao, Hao Wang, Cheng Tan, Chenrui Xu, Mengdi Liu, Bozhen Hu, Linlin Chao, Xiaoming Zhang, Stan Z. Li
Abstract:
This study investigates the current landscape and future directions of protein foundation model research. While recent advancements have transformed protein science and engineering, the field lacks a comprehensive benchmark for fair evaluation and in-depth understanding. Since ESM-1B, numerous protein foundation models have emerged, each with unique datasets and methodologies. However, evaluations often focus on limited tasks tailored to specific models, hindering insights into broader generalization and limitations. Specifically, researchers struggle to understand the relationships between tasks, assess how well current models perform across them, and determine the criteria in developing new foundation models. To fill this gap, we present PFMBench, a comprehensive benchmark evaluating protein foundation models across 38 tasks spanning 8 key areas of protein science. Through hundreds of experiments on 17 state-of-the-art models across 38 tasks, PFMBench reveals the inherent correlations between tasks, identifies top-performing models, and provides a streamlined evaluation protocol. Code is available at \href{https://github.com/biomap-research/PFMBench}{\textcolor{blue}{GitHub}}.
中文: 本研究推出PFMBench基准测试,通过评估17个蛋白质基础模型在38项任务中的表现,填补了该领域缺乏标准化评估的空白,揭示了任务间关联并识别出最优模型。
English: This study introduces PFMBench, a comprehensive benchmark addressing the lack of standardized evaluation for protein foundation models by assessing 17 models across 38 tasks to reveal task correlations and top performers.
Authors:Li-Wei Chen, Takuya Higuchi, Zakaria Aldeneh, Ahmed Hussen Abdelaziz, Alexander Rudnicky
Abstract:
The success of large language models in text processing has inspired their adaptation to speech modeling. However, since speech is continuous and complex, it is often discretized for autoregressive modeling. Speech tokens derived from self-supervised models (known as semantic tokens) typically focus on the linguistic aspects of speech but neglect prosodic information. As a result, models trained on these tokens can generate speech with reduced naturalness. Existing approaches try to fix this by adding pitch features to the semantic tokens. However, pitch alone cannot fully represent the range of paralinguistic attributes, and selecting the right features requires careful hand-engineering. To overcome this, we propose an end-to-end variational approach that automatically learns to encode these continuous speech attributes to enhance the semantic tokens. Our approach eliminates the need for manual extraction and selection of paralinguistic features. Moreover, it produces preferred speech continuations according to human raters. Code, samples and models are available at https://github.com/b04901014/vae-gslm.
Chinese Summary: 本文提出了一种端到端的变分方法,能自动学习将连续语音属性编码到语义标记中,无需手动特征工程,从而提高了生成语音的自然度。
English Summary: This paper introduces an end-to-end variational method that automatically learns to encode continuous speech attributes into semantic tokens, eliminating manual feature engineering and improving speech naturalness in generated outputs.
Authors:Ziyu Gong, Jim Lim, David I. Inouye
Abstract:
Distribution matching (DM) is a versatile domain-invariant representation learning technique that has been applied to tasks such as fair classification, domain adaptation, and domain translation. Non-parametric DM methods struggle with scalability and adversarial DM approaches suffer from instability and mode collapse. While likelihood-based methods are a promising alternative, they often impose unnecessary biases through fixed priors or require explicit density models (e.g., flows) that can be challenging to train. We address this limitation by introducing a novel approach to training likelihood-based DM using expressive score-based prior distributions. Our key insight is that gradient-based DM training only requires the prior's score function -- not its density -- allowing us to train the prior via denoising score matching. This approach eliminates biases from fixed priors (e.g., in VAEs), enabling more effective use of geometry-preserving regularization, while avoiding the challenge of learning an explicit prior density model (e.g., a flow-based prior). Our method also demonstrates better stability and computational efficiency compared to other diffusion-based priors (e.g., LSGM). Furthermore, experiments demonstrate superior performance across multiple tasks, establishing our score-based method as a stable and effective approach to distribution matching. Source code available at https://github.com/inouye-lab/SAUB.
中文: 本文提出了一种基于分数的先验分布新方法,用于基于似然的分布匹配,避免了固定先验和显式密度模型的偏差,并在多个任务中展现出更优的稳定性、效率和性能。
English: This paper introduces a novel score-based prior distribution method for likelihood-based distribution matching, which avoids biases from fixed priors and explicit density models, demonstrating improved stability, efficiency, and performance across various tasks.
Authors:Ahmed Heakl, Sarim Hashmi, Chaimaa Abi, Celine Lee, Abdulrahman Mahmoud
Abstract:
The hardware ecosystem is rapidly evolving, with increasing interest in translating low-level programs across different instruction set architectures (ISAs) in a quick, flexible, and correct way to enhance the portability and longevity of existing code. A particularly challenging class of this transpilation problem is translating between complex- (CISC) and reduced- (RISC) hardware architectures, due to fundamental differences in instruction complexity, memory models, and execution paradigms. In this work, we introduce GG (Guaranteed Guess), an ISA-centric transpilation pipeline that combines the translation power of pre-trained large language models (LLMs) with the rigor of established software testing constructs. Our method generates candidate translations using an LLM from one ISA to another, and embeds such translations within a software-testing framework to build quantifiable confidence in the translation. We evaluate our GG approach over two diverse datasets, enforce high code coverage (>98%) across unit tests, and achieve functional/semantic correctness of 99% on HumanEval programs and 49% on BringupBench programs, respectively. Further, we compare our approach to the state-of-the-art Rosetta 2 framework on Apple Silicon, showcasing 1.73x faster runtime performance, 1.47x better energy efficiency, and 2.41x better memory usage for our transpiled code, demonstrating the effectiveness of GG for real-world CISC-to-RISC translation tasks. We will open-source our codes, data, models, and benchmarks to establish a common foundation for ISA-level code translation research.
Authors:Giacomo Meanti, Thomas Ryckeboer, Michael Arbel, Julien Mairal
Abstract:
This work addresses image restoration tasks through the lens of inverse problems using unpaired datasets. In contrast to traditional approaches -- which typically assume full knowledge of the forward model or access to paired degraded and ground-truth images -- the proposed method operates under minimal assumptions and relies only on small, unpaired datasets. This makes it particularly well-suited for real-world scenarios, where the forward model is often unknown or misspecified, and collecting paired data is costly or infeasible. The method leverages conditional flow matching to model the distribution of degraded observations, while simultaneously learning the forward model via a distribution-matching loss that arises naturally from the framework. Empirically, it outperforms both single-image blind and unsupervised approaches on deblurring and non-uniform point spread function (PSF) calibration tasks. It also matches state-of-the-art performance on blind super-resolution. We also showcase the effectiveness of our method with a proof of concept for lens calibration: a real-world application traditionally requiring time-consuming experiments and specialized equipment. In contrast, our approach achieves this with minimal data acquisition effort.
中文: 本研究提出一种基于非配对数据集和条件流匹配的图像复原方法,能够同时学习退化模型和复原过程,在去模糊和点扩散函数校准任务中表现优异,且仅需少量数据即可实现。
English: This study introduces a novel image restoration method that uses unpaired datasets and conditional flow matching to learn both the degradation model and restoration process, achieving superior performance in deblurring and PSF calibration tasks with minimal data requirements.
Authors:Mingkang Zhu, Xi Chen, Zhongdao Wang, Bei Yu, Hengshuang Zhao, Jiaya Jia
Abstract:
Recent advancements in reinforcement learning from human feedback have shown that utilizing fine-grained token-level reward models can substantially enhance the performance of Proximal Policy Optimization (PPO) in aligning large language models. However, it is challenging to leverage such token-level reward as guidance for Direct Preference Optimization (DPO), since DPO is formulated as a sequence-level bandit problem. To address this challenge, this work decomposes the sequence-level PPO into a sequence of token-level proximal policy optimization problems and then frames the problem of token-level PPO with token-level reward guidance, from which closed-form optimal token-level policy and the corresponding token-level reward can be derived. Using the obtained reward and Bradley-Terry model, this work establishes a framework of computable loss functions with token-level reward guidance for DPO, and proposes a practical reward guidance based on the induced DPO reward. This formulation enables different tokens to exhibit varying degrees of deviation from reference policy based on their respective rewards. Experiment results demonstrate that our method achieves substantial performance improvements over DPO, with win rate gains of up to 7.5 points on MT-Bench, 6.2 points on AlpacaEval 2, and 4.3 points on Arena-Hard. Code is available at https://github.com/dvlab-research/TGDPO.
Chinese: 本研究通过将序列级PPO分解为令牌级问题,提出了一种将令牌级奖励指导融入直接偏好优化的方法,在多个基准测试中相比标准DPO实现了显著性能提升。
English: This study introduces a method to integrate token-level reward guidance into Direct Preference Optimization by decomposing sequence-level PPO into token-level problems, achieving significant performance gains over standard DPO across multiple benchmarks.
Authors:Di He, Ajay Jaiswal, Songjun Tu, Li Shen, Ganzhao Yuan, Shiwei Liu, Lu Yin
Abstract:
Weight decay is a standard regularization technique for training large language models (LLMs). While it is common to assign a uniform decay rate to every layer, this approach overlooks the structural diversity of LLMs and the varying spectral properties across modules. In this paper, we introduce AlphaDecay, a simple yet effective method that adaptively assigns different weight decay strengths to each module of an LLM. Our approach is guided by Heavy-Tailed Self-Regularization (HT-SR) theory, which analyzes the empirical spectral density (ESD) of weight correlation matrices to quantify "heavy-tailedness." Modules exhibiting more pronounced heavy-tailed ESDs, reflecting stronger feature learning, are assigned weaker decay, while modules with lighter-tailed spectra receive stronger decay. Our method leverages tailored weight decay assignments to balance the module-wise differences in spectral properties, leading to improved performance. Extensive pre-training tasks with various model sizes from 60M to 1B demonstrate that AlphaDecay achieves better perplexity and generalization than conventional uniform decay and other adaptive decay baselines. Our code is available at https://github.com/hed-ucas/AlphaDecay.
中文摘要:AlphaDecay是一种自适应权重衰减方法,根据大语言模型各模块的光谱特性分配不同的衰减强度,相比传统均匀衰减方法能有效提升模型性能。
English Summary: AlphaDecay is an adaptive weight decay method that assigns varying decay strengths to different modules of large language models based on their spectral properties, improving performance over uniform decay approaches.
Authors:Shen Yuan, Yin Zheng, Taifeng Wang, Binbin Liu, Hongteng Xu
Abstract:
Adapting large-scale foundation models in multi-task scenarios often suffers from task conflict and oblivion. To mitigate such issues, we propose a novel ''model MoE-ization'' strategy that leads to a conflict- and oblivion-resistant multi-task adaptation method. Given a weight matrix of a pre-trained model, our method applies SVD to it and introduces a learnable router to adjust its singular values based on tasks and samples. Accordingly, the weight matrix becomes a Mixture of Orthogonal Rank-one Experts (MoORE), in which each expert corresponds to the outer product of a left singular vector and the corresponding right one. We can improve the model capacity by imposing a learnable orthogonal transform on the right singular vectors. Unlike low-rank adaptation (LoRA) and its MoE-driven variants, MoORE guarantees the experts' orthogonality and maintains the column space of the original weight matrix. These two properties make the adapted model resistant to the conflicts among the new tasks and the oblivion of its original tasks, respectively. Experiments on various datasets demonstrate that MoORE outperforms existing multi-task adaptation methods consistently, showing its superiority in terms of conflict- and oblivion-resistance. The code of the experiments is available at https://github.com/DaShenZi721/MoORE.
中文摘要:本文提出MoORE方法,通过奇异值分解和可学习路由器构建正交专家模型,有效解决多任务适应中的任务冲突与遗忘问题,在各项实验中均优于现有方法。
English Summary: This paper introduces MoORE, a novel multi-task adaptation method that uses singular value decomposition and learnable routers to create orthogonal experts, effectively preventing task conflicts and preserving original task knowledge while outperforming existing approaches.
Authors:Qingyu Song, Wei Lin, Juncheng Wang, Hong Xu
Abstract:
Learning to optimize (L2O) is an emerging technique to solve mathematical optimization problems with learning-based methods. Although with great success in many real-world scenarios such as wireless communications, computer networks, and electronic design, existing L2O works lack theoretical demonstration of their performance and robustness in out-of-distribution (OOD) scenarios. We address this gap by providing comprehensive proofs. First, we prove a sufficient condition for a robust L2O model with homogeneous convergence rates over all In-Distribution (InD) instances. We assume an L2O model achieves robustness for an InD scenario. Based on our proposed methodology of aligning OOD problems to InD problems, we also demonstrate that the L2O model's convergence rate in OOD scenarios will deteriorate by an equation of the L2O model's input features. Moreover, we propose an L2O model with a concise gradient-only feature construction and a novel gradient-based history modeling method. Numerical simulation demonstrates that our proposed model outperforms the state-of-the-art baseline in both InD and OOD scenarios and achieves up to 10 $\times$ convergence speedup. The code of our method can be found from https://github.com/NetX-lab/GoMathL2O-Official.
中文: 本文针对学习优化模型在分布外场景中缺乏理论保证的问题,通过证明其收敛性并提出一种新颖的仅使用梯度特征的模型,实现了比现有方法快10倍的收敛速度。
English: This paper addresses the lack of theoretical guarantees for Learning to Optimize (L2O) models in out-of-distribution scenarios by providing proofs of their convergence rates and proposing a novel model with gradient-only features that achieves up to 10 times faster convergence than existing methods.
Authors:Jeremy A. Collins, Loránd Cheng, Kunal Aneja, Albert Wilcox, Benjamin Joffe, Animesh Garg
Abstract:
Action-labeled data for robotics is scarce and expensive, limiting the generalization of learned policies. In contrast, vast amounts of action-free video data are readily available, but translating these observations into effective policies remains a challenge. We introduce AMPLIFY, a novel framework that leverages large-scale video data by encoding visual dynamics into compact, discrete motion tokens derived from keypoint trajectories. Our modular approach separates visual motion prediction from action inference, decoupling the challenges of learning what motion defines a task from how robots can perform it. We train a forward dynamics model on abundant action-free videos and an inverse dynamics model on a limited set of action-labeled examples, allowing for independent scaling. Extensive evaluations demonstrate that the learned dynamics are both accurate, achieving up to 3.7x better MSE and over 2.5x better pixel prediction accuracy compared to prior approaches, and broadly useful. In downstream policy learning, our dynamics predictions enable a 1.2-2.2x improvement in low-data regimes, a 1.4x average improvement by learning from action-free human videos, and the first generalization to LIBERO tasks from zero in-distribution action data. Beyond robotic control, we find the dynamics learned by AMPLIFY to be a versatile latent world model, enhancing video prediction quality. Our results present a novel paradigm leveraging heterogeneous data sources to build efficient, generalizable world models. More information can be found at https://amplify-robotics.github.io/.
Authors:Jonathan Hayase, Alisa Liu, Noah A. Smith, Sewoong Oh
Abstract:
Tokenization is used almost universally by modern language models, enabling efficient text representation using multi-byte or multi-character tokens. However, prior work has shown that tokenization can introduce distortion into the model's generations, an issue known as the Prompt Boundary Problem (PBP). For example, users are often advised not to end their prompts with a space because it prevents the model from including the space as part of the next token. While this heuristic is effective in English, the underlying PBP continues to affect languages such as Chinese as well as code generation, where tokens often do not line up with word and syntactic boundaries. In this work, we present an inference-time method to convert any autoregressive LM with a BPE tokenizer into a character-level or byte-level LM. Our method efficiently solves the PBP and is also able to unify the vocabularies of language models with different tokenizers, allowing one to ensemble LMs with different tokenizers at inference time or transfer the post-training from one model to another using proxy-tuning. We demonstrate in experiments that the ensemble and proxy-tuned models outperform their constituents on downstream evals. Code is available at https://github.com/SewoongLab/byte-sampler .
中文: 本文提出了一种推理时方法,可将采用BPE分词器的自回归语言模型转换为字符级或字节级模型,有效解决了提示边界问题,并实现了不同分词器模型间的集成与代理调优。
English: This paper introduces an inference-time method that converts autoregressive language models with BPE tokenizers into character-level or byte-level models, effectively solving the Prompt Boundary Problem and enabling model ensembling and proxy-tuning across different tokenizers.
Authors:Yiwei Chen, Soumyadeep Pal, Yimeng Zhang, Qing Qu, Sijia Liu
Abstract:
Machine unlearning (MU) for large language models (LLMs), commonly referred to as LLM unlearning, seeks to remove specific undesirable data or knowledge from a trained model, while maintaining its performance on standard tasks. While unlearning plays a vital role in protecting data privacy, enforcing copyright, and mitigating sociotechnical harms in LLMs, we identify a new vulnerability post-unlearning: unlearning trace detection. We discover that unlearning leaves behind persistent ''fingerprints'' in LLMs, detectable traces in both model behavior and internal representations. These traces can be identified from output responses, even when prompted with forget-irrelevant inputs. Specifically, a simple supervised classifier can reliably determine whether a model has undergone unlearning based solely on its textual outputs. Further analysis shows that these traces are embedded in intermediate activations and propagate nonlinearly to the final layer, forming low-dimensional, learnable manifolds in activation space. Through extensive experiments, we show that forget-relevant prompts enable over 90% accuracy in detecting unlearning traces across all model sizes. Even with forget-irrelevant inputs, large LLMs maintain high detectability, demonstrating the broad applicability of unlearning trace detection. These findings reveal that unlearning leaves measurable signatures, introducing a new risk of reverse-engineering forgotten information when a model is identified as unlearned given an input query. Codes are available at https://github.com/OPTML-Group/Unlearn-Trace.
中文摘要:大语言模型的机器遗忘会在模型输出和内部激活中留下可检测的痕迹,使得遗忘事件能够被高精度识别,这揭示了一种新的安全风险:被遗忘的信息可能通过逆向工程被推测出来。
English Summary: Machine unlearning in large language models leaves detectable traces in model outputs and internal activations, enabling high-accuracy detection of unlearning events and revealing a new vulnerability where forgotten information could potentially be reverse-engineered.
Authors:Stas Bekman, Samyam Rajbhandari, Michael Wyatt, Jeff Rasley, Tunji Ruwase, Zhewei Yao, Aurick Qiao, Yuxiong He
Abstract:
Long sequences are critical for applications like RAG, long document summarization, multi-modality, etc., and modern LLMs, like Llama 4 Scout, support max sequence length of up to 10 million tokens. However, outside of enterprise labs, long sequence training is challenging for the AI community with limited system support in the open-source space.
Out-of-box, even on a modern NVIDIA H100 80GB GPU cluster, training Llama 8B model with sequence over 32K runs out of memory on a basic Hugging Face (HF) model due to two reasons: i) LLM training workloads are not optimized to fully leverage a single GPU memory, ii) existing solutions for leveraging multiple GPU memory are not easily available to HF models, making long sequence training inaccessible.
We address this with Arctic Long Sequence Training (ALST). It offers a combination of attention-agnostic single GPU and multi-GPU memory optimizations, that enables it to support out-of-box training of multi-million sequence length for a wide variety of HF models.
ALST supports training Meta's Llama 8B model with 500K sequence length on a single H100 GPU, 3.7M on a single 8xH100 GPU node, and over 15M on a 4 node cluster, an increase of over 400x compared to the 32K baseline for the latter. ALST is fully compatible with HF models and open-sourced via Deepspeed https://www.deepspeed.ai/tutorials/ulysses-alst-sequence-pallellism/ and Arctic Training https://github.com/snowflakedb/ArcticTraining/blob/main/projects/sequence-parallelism/README.md.
中文: 北极长序列训练(ALST)方法通过优化单GPU和多GPU内存使用,支持Hugging Face模型开箱即用的数百万标记序列训练,相比32K基准实现了高达400倍的序列长度提升。
English: The Arctic Long Sequence Training (ALST) method enables out-of-box training of multi-million token sequences for Hugging Face models by optimizing single and multi-GPU memory usage, achieving up to 400x longer sequences than the 32K baseline.
Authors:Christel Sirocchi, Damiano Verda
Abstract:
In domains where transparency and trustworthiness are crucial, such as healthcare, rule-based systems are widely used and often preferred over black-box models for decision support systems due to their inherent interpretability. However, as rule-based models grow complex, discerning crucial features, understanding their interactions, and comparing feature contributions across different rule sets becomes challenging. To address this, we propose a comprehensive framework for estimating feature contributions in rule-based systems, introducing a graph-based feature visualisation strategy, a novel feature importance metric agnostic to rule-based predictors, and a distance metric for comparing rule sets based on feature contributions. By experimenting on two clinical datasets and four rule-based methods (decision trees, logic learning machines, association rules, and neural networks with rule extraction), we showcase our method's capability to uncover novel insights on the combined predictive value of clinical features, both at the dataset and class-specific levels. These insights can aid in identifying new risk factors, signature genes, and potential biomarkers, and determining the subset of patient information that should be prioritised to enhance diagnostic accuracy. Comparative analysis of the proposed feature importance score with state-of-the-art methods on 15 public benchmarks demonstrates competitive performance and superior robustness. The method implementation is available on GitHub: https://github.com/ChristelSirocchi/rule-graph.
中文: 本研究提出了一种全面的规则系统特征贡献分析框架,通过基于图的可视化和新型度量方法提升复杂模型(如医疗领域应用)的可解释性,在临床数据集和公共基准测试中验证了其优越的稳健性能。
English: This study introduces a comprehensive framework for analyzing feature contributions in rule-based systems, featuring graph-based visualization and novel metrics to enhance interpretability in complex models like those used in healthcare, validated on clinical datasets and public benchmarks with robust performance.
Authors:Runtao Liu, Jiahao Zhan, Yingqing He, Chen Wei, Alan Yuille, Qifeng Chen
Abstract:
An effective reward model plays a pivotal role in reinforcement learning for post-training enhancement of visual generative models. However, current approaches of reward modeling suffer from implementation complexity due to their reliance on extensive human-annotated preference data or meticulously engineered quality dimensions that are often incomplete and engineering-intensive. Inspired by adversarial training in generative adversarial networks (GANs), this paper proposes GAN-RM, an efficient reward modeling framework that eliminates manual preference annotation and explicit quality dimension engineering. Our method trains the reward model through discrimination between a small set of representative, unpaired target samples(denoted as Preference Proxy Data) and model-generated ordinary outputs, requiring only a few hundred target samples. Comprehensive experiments demonstrate our GAN-RM's effectiveness across multiple key applications including test-time scaling implemented as Best-of-N sample filtering, post-training approaches like Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO). Code and data will be released at https://github.com/Visualignment/GAN-RM.
Chinese: 本文提出GAN-RM,一种高效奖励建模框架,通过仅使用少量目标样本与模型生成输出进行判别训练,无需人工偏好标注和显式质量维度设计,并在样本筛选和微调等关键应用中验证了其有效性。
English: This paper introduces GAN-RM, an efficient reward modeling framework that eliminates the need for manual preference annotations and explicit quality engineering by training on a small set of target samples and model-generated outputs, demonstrating effectiveness across various applications like sample filtering and fine-tuning.
Authors:Lufei Liu, Tor M. Aamodt
Abstract:
Graphics rendering applications increasingly leverage neural networks in tasks such as denoising, supersampling, and frame extrapolation to improve image quality while maintaining frame rates. The temporal coherence inherent in these tasks presents an opportunity to reuse intermediate results from previous frames and avoid redundant computations. Recent work has shown that caching intermediate features to be reused in subsequent inferences is an effective method to reduce latency in diffusion models. We extend this idea to real-time rendering and present ReFrame, which explores different caching policies to optimize trade-offs between quality and performance in rendering workloads. ReFrame can be applied to a variety of encoder-decoder style networks commonly found in rendering pipelines. Experimental results show that we achieve 1.4x speedup on average with negligible quality loss in three real-time rendering tasks. Code available: https://ubc-aamodt-group.github.io/reframe-layer-caching/
Authors:Runpeng Yu, Qi Li, Xinchao Wang
Abstract:
In this work, we provide a systematic survey of Discrete Diffusion Language Models (dLLMs) and Discrete Diffusion Multimodal Language Models (dMLLMs). Unlike autoregressive (AR) models, dLLMs and dMLLMs adopt a multi-token, parallel decoding paradigm using full attention and a denoising-based generation strategy. This paradigm naturally enables parallel generation, fine-grained output control, and dynamic perception. These capabilities are previously difficult to achieve with AR models. A growing number of industrial-scale proprietary d(M)LLMs, as well as a large number of open-source academic d(M)LLMs, have demonstrated performance comparable to their autoregressive counterparts, while achieving up to 10$\times$ acceleration in inference speed. These developments position discrete diffusion models as a promising alternative to intelligence based on the traditional autoregressive approach. In this work, we present a comprehensive overview of the research in the dLLM and dMLLM domains. We trace the historical development of dLLMs and dMLLMs, formalize the underlying mathematical frameworks, list commonly-used modeling methods, and categorize representative models. We further analyze key techniques for training, inference, quantization. We also discuss the trustworthy issues and summarize emerging applications across language, vision-language, and biological domains and etc.. We conclude by discussing future directions for research and deployment. Relative papers are collected in https://github.com/LiQiiiii/Awesome-Discrete-Diffusion-LLM_MLLM
本文系统综述了离散扩散语言与多模态模型,阐明了其相比自回归模型在并行解码速度、精细化控制和动态感知方面的优势,并剖析了其理论框架、关键技术及应用领域。
This survey comprehensively explores discrete diffusion language and multimodal models, highlighting their parallel decoding advantages over autoregressive models in speed, control, and perception while analyzing their frameworks, techniques, and applications.
Authors:Sayed Mohammad Vakilzadeh Hatefi, Maximilian Dreyer, Reduan Achtibat, Patrick Kahardipraja, Thomas Wiegand, Wojciech Samek, Sebastian Lapuschkin
Abstract:
Large Language Models (LLMs) are central to many contemporary AI applications, yet their extensive parameter counts pose significant challenges for deployment in memory- and compute-constrained environments. Recent works in eXplainable AI (XAI), particularly on attribution methods, suggest that interpretability can also enable model compression by identifying and removing components irrelevant to inference. In this paper, we leverage Layer-wise Relevance Propagation (LRP) to perform attribution-guided pruning of LLMs. While LRP has shown promise in structured pruning for vision models, we extend it to unstructured pruning in LLMs and demonstrate that it can substantially reduce model size with minimal performance loss. Our method is especially effective in extracting task-relevant subgraphs -- so-called ``circuits'' -- which can represent core functions (e.g., indirect object identification). Building on this, we introduce a technique for model correction, by selectively removing circuits responsible for spurious behaviors (e.g., toxic outputs). All in all, we gather these techniques as a uniform holistic framework and showcase its effectiveness and limitations through extensive experiments for compression, circuit discovery and model correction on Llama and OPT models, highlighting its potential for improving both model efficiency and safety. Our code is publicly available at https://github.com/erfanhatefi/SparC3.
中文摘要:本文提出了一个利用层级相关性传播进行归因引导剪枝的统一框架,能够在大幅压缩大语言模型规模的同时保持性能,并实现核心功能电路发现和模型安全修正。
English Summary: This paper presents a holistic framework using Layer-wise Relevance Propagation for attribution-guided pruning of Large Language Models, enabling efficient model compression, circuit discovery, and safety correction while maintaining performance.
Authors:Junfeng Fang, Zijun Yao, Ruipeng Wang, Haokai Ma, Xiang Wang, Tat-Seng Chua
Abstract:
The development of large language models (LLMs) has entered in a experience-driven era, flagged by the emergence of environment feedback-driven learning via reinforcement learning and tool-using agents. This encourages the emergenece of model context protocol (MCP), which defines the standard on how should a LLM interact with external services, such as \api and data. However, as MCP becomes the de facto standard for LLM agent systems, it also introduces new safety risks. In particular, MCP introduces third-party services, which are not controlled by the LLM developers, into the agent systems. These third-party MCP services provider are potentially malicious and have the economic incentives to exploit vulnerabilities and sabotage user-agent interactions. In this position paper, we advocate the research community in LLM safety to pay close attention to the new safety risks issues introduced by MCP, and develop new techniques to build safe MCP-powered agent systems. To establish our position, we argue with three key parts. (1) We first construct \framework, a controlled framework to examine safety issues in MCP-powered agent systems. (2) We then conduct a series of pilot experiments to demonstrate the safety risks in MCP-powered agent systems is a real threat and its defense is not trivial. (3) Finally, we give our outlook by showing a roadmap to build safe MCP-powered agent systems. In particular, we would call for researchers to persue the following research directions: red teaming, MCP safe LLM development, MCP safety evaluation, MCP safety data accumulation, MCP service safeguard, and MCP safe ecosystem construction. We hope this position paper can raise the awareness of the research community in MCP safety and encourage more researchers to join this important research direction. Our code is available at https://github.com/littlelittlenine/SafeMCP.git.
中文: 模型上下文协议(MCP)作为大语言模型与外部服务交互的标准,引入了第三方恶意服务的安全风险,因此呼吁研究界关注并开发安全的MCP驱动智能体系统。
English: The emergence of model context protocol (MCP) as a standard for LLM interactions with external services introduces significant safety risks from potentially malicious third-party providers, prompting a call for research into developing secure MCP-powered agent systems.
Authors:MiniMax, :, Aili Chen, Aonian Li, Bangwei Gong, Binyang Jiang, Bo Fei, Bo Yang, Boji Shan, Changqing Yu, Chao Wang, Cheng Zhu, Chengjun Xiao, Chengyu Du, Chi Zhang, Chu Qiao, Chunhao Zhang, Chunhui Du, Congchao Guo, Da Chen, Deming Ding, Dianjun Sun, Dong Li, Enwei Jiao, Haigang Zhou, Haimo Zhang, Han Ding, Haohai Sun, Haoyu Feng, Huaiguang Cai, Haichao Zhu, Jian Sun, Jiaqi Zhuang, Jiaren Cai, Jiayuan Song, Jin Zhu, Jingyang Li, Jinhao Tian, Jinli Liu, Junhao Xu, Junjie Yan, Junteng Liu, Junxian He, Kaiyi Feng, Ke Yang, Kecheng Xiao, Le Han, Leyang Wang, Lianfei Yu, Liheng Feng, Lin Li, Lin Zheng, Linge Du, Lingyu Yang, Lunbin Zeng, Minghui Yu, Mingliang Tao, Mingyuan Chi, Mozhi Zhang, Mujie Lin, Nan Hu, Nongyu Di, Peng Gao, Pengfei Li, Pengyu Zhao, Qibing Ren, Qidi Xu, Qile Li, Qin Wang, Rong Tian, Ruitao Leng, Shaoxiang Chen, Shaoyu Chen, Shengmin Shi, Shitong Weng, Shuchang Guan, Shuqi Yu, Sichen Li, Songquan Zhu, Tengfei Li, Tianchi Cai, Tianrun Liang, Weiyu Cheng, Weize Kong, Wenkai Li, Xiancai Chen, Xiangjun Song, Xiao Luo, Xiao Su, Xiaobo Li, Xiaodong Han, Xinzhu Hou, Xuan Lu, Xun Zou, Xuyang Shen, Yan Gong, Yan Ma, Yang Wang, Yiqi Shi, Yiran Zhong, Yonghong Duan, Yongxiang Fu, Yongyi Hu, Yu Gao, Yuanxiang Fan, Yufeng Yang, Yuhao Li, Yulin Hu, Yunan Huang, Yunji Li, Yunzhi Xu, Yuxin Mao, Yuxuan Shi, Yuze Wenren, Zehan Li, Zelin Li, Zhanxu Tian, Zhengmao Zhu, Zhenhua Fan, Zhenzhen Wu, Zhichao Xu, Zhihang Yu, Zhiheng Lyu, Zhuo Jiang, Zibo Gao, Zijia Wu, Zijian Song, Zijun Sun
Abstract:
We introduce MiniMax-M1, the world's first open-weight, large-scale hybrid-attention reasoning model. MiniMax-M1 is powered by a hybrid Mixture-of-Experts (MoE) architecture combined with a lightning attention mechanism. The model is developed based on our previous MiniMax-Text-01 model, which contains a total of 456 billion parameters with 45.9 billion parameters activated per token. The M1 model natively supports a context length of 1 million tokens, 8x the context size of DeepSeek R1. Furthermore, the lightning attention mechanism in MiniMax-M1 enables efficient scaling of test-time compute. These properties make M1 particularly suitable for complex tasks that require processing long inputs and thinking extensively. MiniMax-M1 is trained using large-scale reinforcement learning (RL) on diverse problems including sandbox-based, real-world software engineering environments. In addition to M1's inherent efficiency advantage for RL training, we propose CISPO, a novel RL algorithm to further enhance RL efficiency. CISPO clips importance sampling weights rather than token updates, outperforming other competitive RL variants. Combining hybrid-attention and CISPO enables MiniMax-M1's full RL training on 512 H800 GPUs to complete in only three weeks, with a rental cost of just $534,700. We release two versions of MiniMax-M1 models with 40K and 80K thinking budgets respectively, where the 40K model represents an intermediate phase of the 80K training. Experiments on standard benchmarks show that our models are comparable or superior to strong open-weight models such as the original DeepSeek-R1 and Qwen3-235B, with particular strengths in complex software engineering, tool utilization, and long-context tasks. We publicly release MiniMax-M1 at https://github.com/MiniMax-AI/MiniMax-M1.
MiniMax-M1是全球首个开放权重的混合注意力推理模型,具备百万令牌上下文和闪电注意力机制,通过仅需三周的高效强化学习在复杂任务中实现卓越性能。
MiniMax-M1 is the world's first open-weight hybrid-attention reasoning model featuring a million-token context and lightning attention mechanism, achieving superior performance in complex tasks through efficient reinforcement learning completed in just three weeks.
Authors:Jonathan Hoss, Felix Schelling, Noah Klarmann
Abstract:
The classical Job Shop Scheduling Problem (JSSP) focuses on optimizing makespan under deterministic constraints. Real-world production environments introduce additional complexities that cause traditional scheduling approaches to be less effective. Reinforcement learning (RL) holds potential in addressing these challenges, as it allows agents to learn adaptive scheduling strategies. However, there is a lack of a comprehensive, general-purpose frameworks for effectively training and evaluating RL agents under real-world constraints. To address this gap, we propose a modular framework that extends classical JSSP formulations by incorporating key real-world constraints inherent to the shopfloor, including transport logistics, buffer management, machine breakdowns, setup times, and stochastic processing conditions, while also supporting multi-objective optimization. The framework is a customizable solution that offers flexibility in defining problem instances and configuring simulation parameters, enabling adaptation to diverse production scenarios. A standardized interface ensures compatibility with various RL approaches, providing a robust environment for training RL agents and facilitating the standardized comparison of different scheduling methods under dynamic and uncertain conditions. We release JobShopLab as an open-source tool for both research and industrial applications, accessible at: https://github.com/proto-lab-ro/jobshoplab
中文:本文提出了JobShopLab这一模块化开源框架,通过整合实际生产约束和多目标优化来扩展经典作业车间调度问题,为强化学习智能体训练和动态条件下调度方法的标准化比较提供了统一环境。
English: This paper introduces JobShopLab, a modular and open-source framework that extends classical job shop scheduling by incorporating real-world constraints and multi-objective optimization, providing a standardized environment for training reinforcement learning agents and comparing scheduling methods under dynamic conditions.
Authors:YuQing Xie, Ameya Daigavane, Mit Kotak, Tess Smidt
Abstract:
$E(3)$-equivariant neural networks have demonstrated success across a wide range of 3D modelling tasks. A fundamental operation in these networks is the tensor product, which interacts two geometric features in an equivariant manner to create new features. Due to the high computational complexity of the tensor product, significant effort has been invested to optimize the runtime of this operation. For example, Luo et al. (2024) recently proposed the Gaunt tensor product (GTP) which promises a significant speedup. In this work, we provide a careful, systematic analysis of a number of tensor product operations. In particular, we emphasize that different tensor products are not performing the same operation. The reported speedups typically come at the cost of expressivity. We introduce measures of expressivity and interactability to characterize these differences. In addition, we realized the original implementation of GTP can be greatly simplified by directly using a spherical grid at no cost in asymptotic runtime. This spherical grid approach is faster on our benchmarks and in actual training of the MACE interatomic potential by 30%. Finally, we provide the first systematic microbenchmarks of the various tensor product operations. We find that the theoretical runtime guarantees can differ wildly from empirical performance, demonstrating the need for careful application-specific benchmarking. Code is available at https://github.com/atomicarchitects/PriceofFreedom.
中文: E(3)等变神经网络中的张量积操作虽经优化提速却以牺牲表达能力为代价,我们的研究发现理论运行时间与实测性能差异显著,并提出一种简化的球面网格方法,在实际训练中效率提升30%。
English: E(3)-equivariant neural networks rely on tensor products for geometric feature interactions, but current optimizations like the Gaunt tensor product sacrifice expressivity for speed, with our analysis revealing that theoretical runtime claims often mismatch empirical performance and proposing a simplified spherical grid method that boosts training efficiency by 30%.
Authors:Wooseok Seo, Seungju Han, Jaehun Jung, Benjamin Newman, Seungwon Lim, Seungbeen Lee, Ximing Lu, Yejin Choi, Youngjae Yu
Abstract:
Fact verification is essential for ensuring the reliability of LLM applications. In this study, we evaluate 12 pre-trained LLMs and one specialized fact-verifier, including frontier LLMs and open-weight reasoning LLMs, using a collection of examples from 14 fact-checking benchmarks. We share three findings intended to guide future development of more robust fact verifiers. First, we highlight the importance of addressing annotation errors and ambiguity in datasets, demonstrating that approximately 16\% of ambiguous or incorrectly labeled data substantially influences model rankings. Neglecting this issue may result in misleading conclusions during comparative evaluations, and we suggest using a systematic pipeline utilizing LLM-as-a-judge to help identify these issues at scale. Second, we discover that frontier LLMs with few-shot in-context examples, often overlooked in previous works, achieve top-tier performance. We therefore recommend future studies include comparisons with these simple yet highly effective baselines. Lastly, despite their effectiveness, frontier LLMs incur substantial costs, motivating the development of small, fine-tuned fact verifiers. We show that these small models still have room for improvement, particularly on instances that require complex reasoning. Encouragingly, we demonstrate that augmenting training with synthetic multi-hop reasoning data significantly enhances their capabilities in such instances. We release our code, model, and dataset at https://github.com/just1nseo/verifying-the-verifiers
中文: 本研究评估了12个预训练大语言模型和一个专业事实核查器,发现解决数据集模糊性、利用少样本示例的前沿模型以及通过合成推理数据增强小型模型的能力,对构建可靠的事实核查系统至关重要。
English: This study evaluates 12 pre-trained LLMs and a specialized fact-verifier, revealing that addressing dataset ambiguities, leveraging frontier LLMs with few-shot examples, and enhancing small models with synthetic reasoning data are crucial for developing robust fact verification systems.
Authors:Badr AlKhamissi, C. Nicolò De Sabbata, Zeming Chen, Martin Schrimpf, Antoine Bosselut
Abstract:
Human intelligence emerges from the interaction of specialized brain networks, each dedicated to distinct cognitive functions such as language processing, logical reasoning, social understanding, and memory retrieval. Inspired by this biological observation, we introduce the Mixture of Cognitive Reasoners (MiCRo) architecture and training paradigm: a modular transformer-based language model with a training curriculum that encourages the emergence of functional specialization among different modules. Inspired by studies in neuroscience, we partition the layers of a pretrained transformer model into four expert modules, each corresponding to a well-studied cognitive brain network. Our Brain-Like model has three key benefits over the state of the art: First, the specialized experts are highly interpretable and functionally critical, where removing a module significantly impairs performance on domain-relevant benchmarks. Second, our model outperforms comparable baselines that lack specialization on seven reasoning benchmarks. And third, the model's behavior can be steered at inference time by selectively emphasizing certain expert modules (e.g., favoring social over logical reasoning), enabling fine-grained control over the style of its response. Our findings suggest that biologically inspired inductive biases involved in human cognition lead to significant modeling gains in interpretability, performance, and controllability.
Authors:Wenlong Wan, Weiying Zheng, Tianyi Xiang, Guiqing Li, Shengfeng He
Abstract:
We introduce the task of Audible Action Temporal Localization, which aims to identify the spatio-temporal coordinates of audible movements. Unlike conventional tasks such as action recognition and temporal action localization, which broadly analyze video content, our task focuses on the distinct kinematic dynamics of audible actions. It is based on the premise that key actions are driven by inflectional movements; for example, collisions that produce sound often involve abrupt changes in motion. To capture this, we propose $TA^{2}Net$, a novel architecture that estimates inflectional flow using the second derivative of motion to determine collision timings without relying on audio input. $TA^{2}Net$ also integrates a self-supervised spatial localization strategy during training, combining contrastive learning with spatial analysis. This dual design improves temporal localization accuracy and simultaneously identifies sound sources within video frames. To support this task, we introduce a new benchmark dataset, $Audible623$, derived from Kinetics and UCF101 by removing non-essential vocalization subsets. Extensive experiments confirm the effectiveness of our approach on $Audible623$ and show strong generalizability to other domains, such as repetitive counting and sound source localization. Code and dataset are available at https://github.com/WenlongWan/Audible623.
中文: 本文提出可听动作时序定位任务,旨在识别可听动作的时空坐标,并开发了TA²Net架构,通过运动二阶导数无需音频输入即可检测碰撞时刻,在新基准数据集Audible623上验证了其有效性。
English: This paper introduces Audible Action Temporal Localization, a task identifying spatio-temporal coordinates of audible movements, and proposes TA²Net, a novel architecture using motion's second derivative to detect collision timings without audio input, validated on the new Audible623 benchmark dataset.
Authors:Huayang Li, Yahui Liu, Hongyu Sun, Deng Cai, Leyang Cui, Wei Bi, Peilin Zhao, Taro Watanabe
Abstract:
Since self-attention layers in Transformers are permutation invariant by design, positional encodings must be explicitly incorporated to enable spatial understanding. However, fixed-size lookup tables used in traditional learnable position embeddings (PEs) limit extrapolation capabilities beyond pre-trained sequence lengths. Expert-designed methods such as ALiBi and RoPE, mitigate this limitation but demand extensive modifications for adapting to new modalities, underscoring fundamental challenges in adaptability and scalability. In this work, we present SeqPE, a unified and fully learnable position encoding framework that represents each $n$-dimensional position index as a symbolic sequence and employs a lightweight sequential position encoder to learn their embeddings in an end-to-end manner. To regularize SeqPE's embedding space, we introduce two complementary objectives: a contrastive objective that aligns embedding distances with a predefined position-distance function, and a knowledge distillation loss that anchors out-of-distribution position embeddings to in-distribution teacher representations, further enhancing extrapolation performance. Experiments across language modeling, long-context question answering, and 2D image classification demonstrate that SeqPE not only surpasses strong baselines in perplexity, exact match (EM), and accuracy--particularly under context length extrapolation--but also enables seamless generalization to multi-dimensional inputs without requiring manual architectural redesign. We release our code, data, and checkpoints at https://github.com/ghrua/seqpe.
中文摘要:SeqPE提出了一种统一且完全可学习的位置编码框架,通过符号序列表示位置索引并采用对比学习与知识蒸馏目标,显著提升了外推能力和多模态适应性。
English Summary: SeqPE introduces a unified, learnable position encoding framework that uses symbolic sequences and complementary objectives to enhance extrapolation and adaptability across various tasks and modalities.
Authors:Philipp Spohn, Leander Girrbach, Jessica Bader, Zeynep Akata
Abstract:
As large language models (LLMs) are trained on massive datasets, they have raised significant privacy and ethical concerns due to their potential to inadvertently retain sensitive information. Unlearning seeks to selectively remove specific data from trained models, such as personal information or copyrighted content. Current approaches targeting specific output sequences at the token level often fail to achieve complete forgetting and remain susceptible to prompt rephrasing. We propose Align-then-Unlearn, a novel framework that performs unlearning in the semantic embedding space rather than directly on output tokens. Align-then-Unlearn first augments the LLM with an embedding prediction module trained to anticipate future context representations. Unlearning is then achieved by fine-tuning the model to minimize the similarity between these predicted embeddings and a target embedding that represents the concept to be removed. Initial results show that Align-then-Unlearn effectively removes targeted knowledge with minimal degradation in overall model utility. These findings suggest that embedding-based unlearning offers a promising and robust approach to removing conceptual knowledge. Our code is available at https://github.com/ExplainableML/align-then-unlearn.
Chinese: Align-then-Unlearn框架通过在语义嵌入空间执行遗忘操作,有效移除大型语言模型中的特定知识,同时保持模型整体性能,为解决隐私问题提供了新途径。
English: The Align-then-Unlearn framework addresses privacy concerns in large language models by performing unlearning in the semantic embedding space, effectively removing targeted knowledge while preserving overall model utility.
Authors:Ting Qiao, Yiming Li, Jianbin Li, Yingjia Wang, Leyi Qi, Junfeng Guo, Ruili Feng, Dacheng Tao
Abstract:
Deep neural networks (DNNs) rely heavily on high-quality open-source datasets (e.g., ImageNet) for their success, making dataset ownership verification (DOV) crucial for protecting public dataset copyrights. In this paper, we find existing DOV methods (implicitly) assume that the verification process is faithful, where the suspicious model will directly verify ownership by using the verification samples as input and returning their results. However, this assumption may not necessarily hold in practice and their performance may degrade sharply when subjected to intentional or unintentional perturbations. To address this limitation, we propose the first certified dataset watermark (i.e., CertDW) and CertDW-based certified dataset ownership verification method that ensures reliable verification even under malicious attacks, under certain conditions (e.g., constrained pixel-level perturbation). Specifically, inspired by conformal prediction, we introduce two statistical measures, including principal probability (PP) and watermark robustness (WR), to assess model prediction stability on benign and watermarked samples under noise perturbations. We prove there exists a provable lower bound between PP and WR, enabling ownership verification when a suspicious model's WR value significantly exceeds the PP values of multiple benign models trained on watermark-free datasets. If the number of PP values smaller than WR exceeds a threshold, the suspicious model is regarded as having been trained on the protected dataset. Extensive experiments on benchmark datasets verify the effectiveness of our CertDW method and its resistance to potential adaptive attacks. Our codes are at \href{https://github.com/NcepuQiaoTing/CertDW}{GitHub}.
中文: 本文提出CertDW认证数据集水印方法,通过统计指标评估模型预测稳定性,在恶意攻击下仍能实现可靠的数据集所有权验证,并提供可证明的保障。
English: This paper introduces CertDW, a certified dataset watermarking method that ensures reliable dataset ownership verification under malicious attacks by leveraging statistical measures to assess model prediction stability and providing provable guarantees.
Authors:Thomas Möllenhoff, Siddharth Swaroop, Finale Doshi-Velez, Mohammad Emtiyaz Khan
Abstract:
ADMM is a popular method for federated deep learning which originated in the 1970s and, even though many new variants of it have been proposed since then, its core algorithmic structure has remained unchanged. Here, we take a major departure from the old structure and present a fundamentally new way to derive and extend federated ADMM. We propose to use a structure called Bayesian Duality which exploits a duality of the posterior distributions obtained by solving a variational-Bayesian reformulation of the original problem. We show that this naturally recovers the original ADMM when isotropic Gaussian posteriors are used, and yields non-trivial extensions for other posterior forms. For instance, full-covariance Gaussians lead to Newton-like variants of ADMM, while diagonal covariances result in a cheap Adam-like variant. This is especially useful to handle heterogeneity in federated deep learning, giving up to 7% accuracy improvements over recent baselines. Our work opens a new Bayesian path to improve primal-dual methods.
Chinese: 本研究提出了一种基于贝叶斯对偶的新方法,从根本上重构了联邦ADMM,通过引入牛顿式和Adam式变体,在异构联邦学习环境中将准确率提升高达7%。
English: The study introduces a novel Bayesian duality approach to fundamentally reformulate federated ADMM, enabling extensions like Newton-like and Adam-like variants that improve accuracy by up to 7% in heterogeneous federated learning settings.
Authors:Qingfeng Chen, Shiyuan Li, Yixin Liu, Shirui Pan, Geoffrey I. Webb, Shichao Zhang
Abstract:
Graph neural networks (GNNs) excel in graph representation learning by integrating graph structure and node features. Existing GNNs, unfortunately, fail to account for the uncertainty of class probabilities that vary with the depth of the model, leading to unreliable and risky predictions in real-world scenarios. To bridge the gap, in this paper, we propose a novel Evidence Fusing Graph Neural Network (EFGNN for short) to achieve trustworthy prediction, enhance node classification accuracy, and make explicit the risk of wrong predictions. In particular, we integrate the evidence theory with multi-hop propagation-based GNN architecture to quantify the prediction uncertainty of each node with the consideration of multiple receptive fields. Moreover, a parameter-free cumulative belief fusion (CBF) mechanism is developed to leverage the changes in prediction uncertainty and fuse the evidence to improve the trustworthiness of the final prediction. To effectively optimize the EFGNN model, we carefully design a joint learning objective composed of evidence cross-entropy, dissonance coefficient, and false confident penalty. The experimental results on various datasets and theoretical analyses demonstrate the effectiveness of the proposed model in terms of accuracy and trustworthiness, as well as its robustness to potential attacks. The source code of EFGNN is available at https://github.com/Shiy-Li/EFGNN.
Chinese: 本文提出了一种新颖的证据融合图神经网络(EFGNN),通过将证据理论与多跳传播相结合,提高了节点分类的准确性并量化预测不确定性,从而增强了预测的可靠性和抗攻击鲁棒性。
English: This paper introduces a novel Evidence Fusing Graph Neural Network (EFGNN) that integrates evidence theory with multi-hop propagation to enhance node classification accuracy and quantify prediction uncertainty, achieving improved trustworthiness and robustness against attacks.
Authors:Kai Tang, Ji Zhang, Hua Meng, Minbo Ma, Qi Xiong, Fengmao Lv, Jie Xu, Tianrui Li
Abstract:
Multivariate time series forecasting (MTSF) is a critical task with broad applications in domains such as meteorology, transportation, and economics. Nevertheless, pervasive missing values caused by sensor failures or human errors significantly degrade forecasting accuracy. Prior efforts usually employ an impute-then-forecast paradigm, leading to suboptimal predictions due to error accumulation and misaligned objectives between the two stages. To address this challenge, we propose the Collaborative Imputation-Forecasting Network (CoIFNet), a novel framework that unifies imputation and forecasting to achieve robust MTSF in the presence of missing values. Specifically, CoIFNet takes the observed values, mask matrix and timestamp embeddings as input, processing them sequentially through the Cross-Timestep Fusion (CTF) and Cross-Variate Fusion (CVF) modules to capture temporal dependencies that are robust to missing values. We provide theoretical justifications on how our CoIFNet learning objective improves the performance bound of MTSF with missing values. Through extensive experiments on challenging MSTF benchmarks, we demonstrate the effectiveness and computational efficiency of our proposed approach across diverse missing-data scenarios, e.g., CoIFNet outperforms the state-of-the-art method by $\underline{\textbf{24.40}}$% ($\underline{\textbf{23.81}}$%) at a point (block) missing rate of 0.6, while improving memory and time efficiency by $\underline{\boldsymbol{4.3\times}}$ and $\underline{\boldsymbol{2.1\times}}$, respectively. Our code is available at: https://github.com/KaiTang-eng/CoIFNet.
Chinese: 提出的CoIFNet框架将填补与预测统一起来,在存在缺失值的多元时间序列预测中提升了准确性,相比现有方法实现了显著的性能提升和计算效率。
English: The proposed CoIFNet framework unifies imputation and forecasting to enhance multivariate time series forecasting accuracy in the presence of missing values, achieving significant performance improvements and computational efficiency over existing methods.
Authors:Coleman Hooper, Sebastian Zhao, Luca Manolache, Sehoon Kim, Michael W. Mahoney, Yakun Sophia Shao, Kurt Keutzer, Amir Gholami
Abstract:
Large Reasoning Models (LRMs) have shown promising accuracy improvements on complex problem-solving tasks. While these models have attained high accuracy by leveraging additional computation at test time, they need to generate long chain-of-thought reasoning in order to think before answering, which requires generating thousands of tokens. While sparse attention methods can help reduce the KV cache pressure induced by this long autoregressive reasoning, these methods can introduce errors which disrupt the reasoning process. Additionally, prior methods often pre-process the input to make it easier to identify the important prompt tokens when computing attention during generation, and this pre-processing is challenging to perform online for newly generated reasoning tokens. Our work addresses these challenges by introducing Multipole Attention, which accelerates autoregressive reasoning by only computing exact attention for the most important tokens, while maintaining approximate representations for the remaining tokens. Our method first performs clustering to group together semantically similar key vectors, and then uses the cluster centroids both to identify important key vectors and to approximate the remaining key vectors in order to retain high accuracy. We design a fast cluster update process to quickly re-cluster the input and previously generated tokens, thereby allowing for accelerating attention to the previous output tokens. We evaluate our method using emerging LRMs such as Qwen-8B, demonstrating that our approach can maintain accuracy on complex reasoning tasks even with aggressive attention sparsity settings. We also provide kernel implementations to demonstrate the practical efficiency gains from our method, achieving up to 4.5$\times$ speedup for attention in long-context reasoning applications. Our code is available at https://github.com/SqueezeAILab/MultipoleAttention.
Chinese: 提出的多极注意力方法通过仅对关键标记计算精确注意力并对其他标记进行聚类近似,加速大型推理模型的自回归推理过程,在保持精度的同时实现了长上下文应用中高达4.5倍的注意力加速。
English: The proposed Multipole Attention method accelerates autoregressive reasoning in Large Reasoning Models by computing exact attention only for key tokens and approximating others through clustering, maintaining accuracy while achieving up to 4.5× speedup in long-context applications.
Authors:Haibo Qiu, Xiaohan Lan, Fanfan Liu, Xiaohu Sun, Delian Ruan, Peng Shi, Lin Ma
Abstract:
Recent advancements in large language models (LLMs) have witnessed a surge in the development of advanced reasoning paradigms, which are now being integrated into multimodal large language models (MLLMs). However, existing approaches often fall short: methods solely employing reinforcement learning (RL) can struggle with sample inefficiency and activating entirely absent reasoning capabilities, while conventional pipelines that initiate with a cold-start supervised fine-tuning (SFT) phase before RL may restrict the model's exploratory capacity and face suboptimal convergence. In this work, we introduce \textbf{Metis-RISE} (\textbf{R}L \textbf{I}ncentivizes and \textbf{S}FT \textbf{E}nhances) for multimodal reasoning model learning. Unlike conventional approaches, Metis-RISE distinctively omits an initial SFT stage, beginning instead with an RL phase (e.g., using a Group Relative Policy Optimization variant) to incentivize and activate the model's latent reasoning capacity. Subsequently, the targeted SFT stage addresses two key challenges identified during RL: (1) \textit{inefficient trajectory sampling} for tasks where the model possesses but inconsistently applies correct reasoning, which we tackle using self-distilled reasoning trajectories from the RL model itself; and (2) \textit{fundamental capability absence}, which we address by injecting expert-augmented knowledge for prompts where the model entirely fails. This strategic application of RL for incentivization followed by SFT for enhancement forms the core of Metis-RISE, leading to two versions of our MLLMs (7B and 72B parameters). Evaluations on the OpenCompass Multimodal Reasoning Leaderboard demonstrate that both models achieve state-of-the-art performance among similar-sized models, with the 72B version ranking fourth overall. Please refer to our project page for open-source information.
Chinese: 本文提出Metis-RISE多模态推理模型,该方法先通过强化学习激发模型推理能力,再采用监督微调解决采样效率低和知识缺失问题,在评测中取得了同类模型的最佳性能。
English: This paper introduces Metis-RISE, a novel multimodal reasoning model that starts with reinforcement learning to activate reasoning capabilities before applying supervised fine-tuning to address inefficiencies and knowledge gaps, achieving state-of-the-art performance on benchmarks.
Authors:Can Polat, Hasan Kurban, Erchin Serpedin, Mustafa Kurban
Abstract:
Evaluating foundation models for crystallographic reasoning requires benchmarks that isolate generalization behavior while enforcing physical constraints. This work introduces a multiscale multicrystal dataset with two physically grounded evaluation protocols to stress-test multimodal generative models. The Spatial-Exclusion benchmark withholds all supercells of a given radius from a diverse dataset, enabling controlled assessments of spatial interpolation and extrapolation. The Compositional-Exclusion benchmark omits all samples of a specific chemical composition, probing generalization across stoichiometries. Nine vision--language foundation models are prompted with crystallographic images and textual context to generate structural annotations. Responses are evaluated via (i) relative errors in lattice parameters and density, (ii) a physics-consistency index penalizing volumetric violations, and (iii) a hallucination score capturing geometric outliers and invalid space-group predictions. These benchmarks establish a reproducible, physically informed framework for assessing generalization, consistency, and reliability in large-scale multimodal models. Dataset and code are available at https://github.com/KurbanIntelligenceLab/StressTestingMMFMinCR.
中文摘要:本研究提出一个多尺度多晶体数据集及两项基于物理原理的评估基准,通过空间排除和成分排除协议,系统测试多模态基础模型在晶体学推理中的泛化能力、物理一致性及可靠性。
English Summary: This study introduces a multiscale multicrystal dataset with two physics-grounded benchmarks to evaluate multimodal foundation models' crystallographic reasoning by testing their generalization, physical consistency, and reliability through structured exclusion protocols.
Authors:Haiyang Guo, Fanhu Zeng, Fei Zhu, Jiayi Wang, Xukai Wang, Jingang Zhou, Hongbo Zhao, Wenzhuo Liu, Shijie Ma, Da-Han Wang, Xu-Yao Zhang, Cheng-Lin Liu
Abstract:
The rapid advancement of generative models has empowered modern AI systems to comprehend and produce highly sophisticated content, even achieving human-level performance in specific domains. However, these models are fundamentally constrained by \emph{catastrophic forgetting}, \ie~a persistent challenge where models experience performance degradation on previously learned tasks when adapting to new tasks. To address this practical limitation, numerous approaches have been proposed to enhance the adaptability and scalability of generative AI in real-world applications. In this work, we present a comprehensive survey of continual learning methods for mainstream generative AI models, encompassing large language models, multimodal large language models, vision-language-action models, and diffusion models. Drawing inspiration from the memory mechanisms of the human brain, we systematically categorize these approaches into three paradigms: architecture-based, regularization-based, and replay-based methods, while elucidating their underlying methodologies and motivations. We further analyze continual learning setups for different generative models, including training objectives, benchmarks, and core backbones, thereby providing deeper insights into the field. The project page of this paper is available at https://github.com/Ghy0501/Awesome-Continual-Learning-in-Generative-Models.
中文: 本综述系统梳理了生成式AI模型的持续学习方法,通过借鉴人类记忆机制提出三种解决灾难性遗忘的范式,并深入分析了不同模型的训练设置与评估基准。
English: This survey systematically categorizes and analyzes continual learning methods for generative AI models to address catastrophic forgetting, drawing inspiration from human memory mechanisms and examining various training setups and benchmarks.
Authors:Siqi Liang, Yudi Zhang, Yubo Wang
Abstract:
Sequential recommender systems aim to model users' evolving preferences by capturing patterns in their historical interactions. Recent advances in this area have leveraged deep neural networks and attention mechanisms to effectively represent sequential behaviors and time-sensitive interests. In this work, we propose C-TLSAN (Content-Enhanced Time-Aware Long- and Short-Term Attention Network), an extension of the TLSAN architecture that jointly models long- and short-term user preferences while incorporating semantic content associated with items, such as product descriptions.
C-TLSAN enriches the recommendation pipeline by embedding textual content linked to users' historical interactions directly into both long-term and short-term attention layers. This allows the model to learn from both behavioral patterns and rich item content, enhancing user and item representations across temporal dimensions. By fusing sequential signals with textual semantics, our approach improves the expressiveness and personalization capacity of recommendation systems.
We conduct extensive experiments on large-scale Amazon datasets, benchmarking C-TLSAN against state-of-the-art baselines, including recent sequential recommenders based on Large Language Models (LLMs), which represent interaction history and predictions in text form. Empirical results demonstrate that C-TLSAN consistently outperforms strong baselines in next-item prediction tasks. Notably, it improves AUC by 1.66%, Recall@10 by 93.99%, and Precision@10 by 94.80% on average over the best-performing baseline (TLSAN) across 10 Amazon product categories. These results highlight the value of integrating content-aware enhancements into temporal modeling frameworks for sequential recommendation. Our code is available at https://github.com/booml247/cTLSAN.
中文: 本文提出的C-TLSAN模型通过将物品语义内容融入长短时注意力层来增强时序偏好建模,在亚马逊数据集上的实验表明其显著超越了现有最优基线模型。
English: This paper introduces C-TLSAN, a sequential recommendation model that enhances temporal preference modeling by integrating item content into both long- and short-term attention layers, achieving significant performance improvements over state-of-the-art baselines on Amazon datasets.
Authors:Christian Zhou-Zheng, Philippe Pasquier
Abstract:
Existing work in automatic music generation has primarily focused on end-to-end systems that produce complete compositions or continuations. However, because musical composition is typically an iterative process, such systems make it difficult to engage in the back-and-forth between human and machine that is essential to computer-assisted creativity. In this study, we address the task of personalizable, multi-track, long-context, and controllable symbolic music infilling to enhance the process of computer-assisted composition. We present MIDI-RWKV, a novel model based on the RWKV-7 linear architecture, to enable efficient and coherent musical cocreation on edge devices. We also demonstrate that MIDI-RWKV admits an effective method of finetuning its initial state for personalization in the very-low-sample regime. We evaluate MIDI-RWKV and its state tuning on several quantitative and qualitative metrics, and release model weights and code at https://github.com/christianazinn/MIDI-RWKV.
Chinese: 本研究提出了MIDI-RWKV这一新型模型,通过其线性架构和状态调优功能,在边缘设备上实现个性化、可控的符号音乐填充,从而有效促进计算机辅助音乐创作中的协同创作过程。
English: This study introduces MIDI-RWKV, a novel model for personalized and controllable symbolic music infilling that enhances computer-assisted composition by enabling efficient musical co-creation on edge devices through its linear architecture and state tuning capabilities.
Authors:Yan Sun, Qixin Zhang, Zhiyuan Yu, Xikun Zhang, Li Shen, Dacheng Tao
Abstract:
The rapid scaling of large language models (LLMs) has made inference efficiency a primary bottleneck in the practical deployment. To address this, semi-structured sparsity offers a promising solution by strategically retaining $N$ elements out of every $M$ weights, thereby enabling hardware-friendly acceleration and reduced memory. However, existing (N:M)-compatible approaches typically fall into two categories: rule-based layerwise greedy search, which suffers from considerable errors, and gradient-driven combinatorial learning, which incurs prohibitive training costs. To tackle these challenges, we propose a novel linear-space probabilistic framework named MaskPro, which aims to learn a prior categorical distribution for every $M$ consecutive weights and subsequently leverages this distribution to generate the (N:M)-sparsity throughout an $N$-way sampling without replacement. Furthermore, to mitigate the training instability induced by the high variance of policy gradients in the super large combinatorial space, we propose a novel update method by introducing a moving average tracker of loss residuals instead of vanilla loss. Finally, we conduct comprehensive theoretical analysis and extensive experiments to validate the superior performance of MaskPro, as well as its excellent scalability in memory efficiency and exceptional robustness to data samples. Our code is available at https://github.com/woodenchild95/Maskpro.git.
中文摘要:该摘要提出了一种名为MaskPro的概率框架,通过学习权重分布的稀疏模式并采用创新的梯度稳定方法,有效提升了大型语言模型的推理效率,展现出卓越的性能和可扩展性。
English Summary: The abstract introduces MaskPro, a probabilistic framework designed to enhance inference efficiency in large language models by learning sparsity patterns through categorical distributions and a novel gradient stabilization method, achieving superior performance and scalability.
Authors:Yuxuan Wang, Ming Yang, Ziluo Ding, Yu Zhang, Weishuai Zeng, Xinrun Xu, Haobin Jiang, Zongqing Lu
Abstract:
Achieving general agile whole-body control on humanoid robots remains a major challenge due to diverse motion demands and data conflicts. While existing frameworks excel in training single motion-specific policies, they struggle to generalize across highly varied behaviors due to conflicting control requirements and mismatched data distributions. In this work, we propose BumbleBee (BB), an expert-generalist learning framework that combines motion clustering and sim-to-real adaptation to overcome these challenges. BB first leverages an autoencoder-based clustering method to group behaviorally similar motions using motion features and motion descriptions. Expert policies are then trained within each cluster and refined with real-world data through iterative delta action modeling to bridge the sim-to-real gap. Finally, these experts are distilled into a unified generalist controller that preserves agility and robustness across all motion types. Experiments on two simulations and a real humanoid robot demonstrate that BB achieves state-of-the-art general whole-body control, setting a new benchmark for agile, robust, and generalizable humanoid performance in the real world. The project webpage is available at https://beingbeyond.github.io/BumbleBee/.
Authors:Xiaoyan Kui, Canwei Liu, Qinsong Li, Zhipeng Hu, Yangyang Shi, Weixin Si, Beiji Zou
Abstract:
Kolmogorov-Arnold Networks (KANs) are highly effective in long-term time series forecasting due to their ability to efficiently represent nonlinear relationships and exhibit local plasticity. However, prior research on KANs has predominantly focused on the time domain, neglecting the potential of the frequency domain. The frequency domain of time series data reveals recurring patterns and periodic behaviors, which complement the temporal information captured in the time domain. To address this gap, we explore the application of KANs in the frequency domain for long-term time series forecasting. By leveraging KANs' adaptive activation functions and their comprehensive representation of signals in the frequency domain, we can more effectively learn global dependencies and periodic patterns. To integrate information from both time and frequency domains, we propose the $\textbf{T}$ime-$\textbf{F}$requency KAN (TFKAN). TFKAN employs a dual-branch architecture that independently processes features from each domain, ensuring that the distinct characteristics of each domain are fully utilized without interference. Additionally, to account for the heterogeneity between domains, we introduce a dimension-adjustment strategy that selectively upscales only in the frequency domain, enhancing efficiency while capturing richer frequency information. Experimental results demonstrate that TFKAN consistently outperforms state-of-the-art (SOTA) methods across multiple datasets. The code is available at https://github.com/LcWave/TFKAN.
中文: 提出的时频KAN(TFKAN)通过双分支架构和选择性维度调整整合时域与频域信息,有效提升了长期时间序列预测性能,在多个数据集上超越了现有最优方法。
English: The proposed Time-Frequency KAN (TFKAN) integrates both time and frequency domains using a dual-branch architecture and selective dimension adjustment to enhance long-term time series forecasting, outperforming existing methods across multiple datasets.
Authors:M. H. Maqbool, Moghis Fereidouni, Umar Farooq, A. B. Siddique, Hassan Foroosh
Abstract:
The mobile app market has expanded exponentially, offering millions of apps with diverse functionalities, yet research in mobile app recommendation remains limited. Traditional sequential recommender systems utilize the order of items in users' historical interactions to predict the next item for the users. Position embeddings, well-established in transformer-based architectures for natural language processing tasks, effectively distinguish token positions in sequences. In sequential recommendation systems, position embeddings can capture the order of items in a user's historical interaction sequence. Nevertheless, this ordering does not consider the time elapsed between two interactions of the same user (e.g., 1 day, 1 week, 1 month), referred to as "user rhythm". In mobile app recommendation datasets, the time between consecutive user interactions is notably longer compared to other domains like movies, posing significant challenges for sequential recommender systems. To address this phenomenon in the mobile app domain, we introduce INTERPOS, an Interaction Rhythm Guided Positional Morphing strategy for autoregressive mobile app recommender systems. INTERPOS incorporates rhythm-guided position embeddings, providing a more comprehensive representation that considers both the sequential order of interactions and the temporal gaps between them. This approach enables a deep understanding of users' rhythms at a fine-grained level, capturing the intricacies of their interaction patterns over time. We propose three strategies to incorporate the morphed positional embeddings in two transformer-based sequential recommendation system architectures. Our extensive evaluations show that INTERPOS outperforms state-of-the-art models using 7 mobile app recommendation datasets on NDCG@K and HIT@K metrics. The source code of INTERPOS is available at https://github.com/dlgrad/INTERPOS.
中文: 本研究提出INTERPOS策略,将节奏引导的位置嵌入整合到基于Transformer的序列推荐系统中,以更全面地捕捉用户交互的顺序和时间间隔,在移动应用推荐任务中展现出卓越性能。
English: The study introduces INTERPOS, a novel strategy that integrates rhythm-guided position embeddings into transformer-based sequential recommender systems to better capture both the order and temporal gaps in user interactions, demonstrating superior performance in mobile app recommendations.
Authors:Zain Muhammad Mujahid, Dilshod Azizov, Maha Tufail Agro, Preslav Nakov
Abstract:
In an age characterized by the proliferation of mis- and disinformation online, it is critical to empower readers to understand the content they are reading. Important efforts in this direction rely on manual or automatic fact-checking, which can be challenging for emerging claims with limited information. Such scenarios can be handled by assessing the reliability and the political bias of the source of the claim, i.e., characterizing entire news outlets rather than individual claims or articles. This is an important but understudied research direction. While prior work has looked into linguistic and social contexts, we do not analyze individual articles or information in social media. Instead, we propose a novel methodology that emulates the criteria that professional fact-checkers use to assess the factuality and political bias of an entire outlet. Specifically, we design a variety of prompts based on these criteria and elicit responses from large language models (LLMs), which we aggregate to make predictions. In addition to demonstrating sizable improvements over strong baselines via extensive experiments with multiple LLMs, we provide an in-depth error analysis of the effect of media popularity and region on model performance. Further, we conduct an ablation study to highlight the key components of our dataset that contribute to these improvements. To facilitate future research, we released our dataset and code at https://github.com/mbzuai-nlp/llm-media-profiling.
中文: 本研究提出了一种利用大型语言模型评估整个新闻媒体可靠性和政治偏见的新方法,通过关注信息来源特征而非单个声明,改进了传统的核实方式。
English: This study introduces a novel method using large language models to assess the reliability and political bias of entire news outlets, improving upon traditional fact-checking by focusing on source characterization rather than individual claims.
Authors:Catalin E. Brita, Hieu Nguyen, Lohithsai Yadala Chanchu, Domonkos Nagy, Maksim Zhdanov
Abstract:
Self-attention scales quadratically with input size, limiting its use for large-scale physical systems. Although sparse attention mechanisms provide a viable alternative, they are primarily designed for regular structures such as text or images, making them inapplicable for irregular geometries. In this work, we present Ball Sparse Attention (BSA), which adapts Native Sparse Attention (NSA) (Yuan et al., 2025) to unordered point sets by imposing regularity using the Ball Tree structure from the Erwin Transformer (Zhdanov et al., 2025). We modify NSA's components to work with ball-based neighborhoods, yielding a global receptive field at sub-quadratic cost. On an airflow pressure prediction task, we achieve accuracy comparable to Full Attention while significantly reducing the theoretical computational complexity. Our implementation is available at https://github.com/britacatalin/bsa.
中文:球稀疏注意力通过球树结构将稀疏注意力应用于不规则几何形状,在物理系统预测任务中以次二次复杂度实现了与全注意力相当的精度。
English: Ball Sparse Attention adapts sparse attention to irregular geometries using ball tree structures, achieving full-attention accuracy with sub-quadratic complexity on physical systems.
Authors:Nuwan Bandara, Thivya Kandappu, Archan Misra
Abstract:
Event-based eye tracking holds significant promise for fine-grained cognitive state inference, offering high temporal resolution and robustness to motion artifacts, critical features for decoding subtle mental states such as attention, confusion, or fatigue. In this work, we introduce a model-agnostic, inference-time refinement framework designed to enhance the output of existing event-based gaze estimation models without modifying their architecture or requiring retraining. Our method comprises two key post-processing modules: (i) Motion-Aware Median Filtering, which suppresses blink-induced spikes while preserving natural gaze dynamics, and (ii) Optical Flow-Based Local Refinement, which aligns gaze predictions with cumulative event motion to reduce spatial jitter and temporal discontinuities. To complement traditional spatial accuracy metrics, we propose a novel Jitter Metric that captures the temporal smoothness of predicted gaze trajectories based on velocity regularity and local signal complexity. Together, these contributions significantly improve the consistency of event-based gaze signals, making them better suited for downstream tasks such as micro-expression analysis and mind-state decoding. Our results demonstrate consistent improvements across multiple baseline models on controlled datasets, laying the groundwork for future integration with multimodal affect recognition systems in real-world environments. Our code implementations can be found at https://github.com/eye-tracking-for-physiological-sensing/EyeLoRiN.
中文: 本文提出了一种与模型无关的优化框架,通过运动感知滤波和光流局部优化来改进基于事件的视线估计,显著提升了视线信号的稳定性以支持认知状态解码,并引入了评估时间平滑度的新型抖动指标。
English: This paper presents a model-agnostic framework that enhances event-based gaze estimation through motion-aware filtering and optical flow refinement, improving signal consistency for cognitive state inference while introducing a novel jitter metric for temporal smoothness evaluation.
Authors:Runhao Zeng, Qi Deng, Ronghao Zhang, Shuaicheng Niu, Jian Chen, Xiping Hu, Victor C. M. Leung
Abstract:
Test-time adaptation (TTA) aims to boost the generalization capability of a trained model by conducting self-/unsupervised learning during the testing phase. While most existing TTA methods for video primarily utilize visual supervisory signals, they often overlook the potential contribution of inherent audio data. To address this gap, we propose a novel approach that incorporates audio information into video TTA. Our method capitalizes on the rich semantic content of audio to generate audio-assisted pseudo-labels, a new concept in the context of video TTA. Specifically, we propose an audio-to-video label mapping method by first employing pre-trained audio models to classify audio signals extracted from videos and then mapping the audio-based predictions to video label spaces through large language models, thereby establishing a connection between the audio categories and video labels. To effectively leverage the generated pseudo-labels, we present a flexible adaptation cycle that determines the optimal number of adaptation iterations for each sample, based on changes in loss and consistency across different views. This enables a customized adaptation process for each sample. Experimental results on two widely used datasets (UCF101-C and Kinetics-Sounds-C), as well as on two newly constructed audio-video TTA datasets (AVE-C and AVMIT-C) with various corruption types, demonstrate the superiority of our approach. Our method consistently improves adaptation performance across different video classification models and represents a significant step forward in integrating audio information into video TTA. Code: https://github.com/keikeiqi/Audio-Assisted-TTA.
Chinese: 本研究提出了一种音频辅助的视频测试时自适应方法,利用音频生成的伪标签和灵活的自适应周期,在多个受损数据集上显著提升了模型性能。
English: This study introduces an audio-assisted test-time adaptation method for video that leverages audio-generated pseudo-labels and a flexible adaptation cycle to enhance model performance across multiple corrupted datasets.
Authors:Suyeon Kim, SeongKu Kang, Dongwoo Kim, Jungseul Ok, Hwanjo Yu
Abstract:
Graph Neural Networks (GNNs) have achieved state-of-the-art performance in node classification tasks but struggle with label noise in real-world data. Existing studies on graph learning with label noise commonly rely on class-dependent label noise, overlooking the complexities of instance-dependent noise and falling short of capturing real-world corruption patterns. We introduce BeGIN (Benchmarking for Graphs with Instance-dependent Noise), a new benchmark that provides realistic graph datasets with various noise types and comprehensively evaluates noise-handling strategies across GNN architectures, noisy label detection, and noise-robust learning. To simulate instance-dependent corruptions, BeGIN introduces algorithmic methods and LLM-based simulations. Our experiments reveal the challenges of instance-dependent noise, particularly LLM-based corruption, and underscore the importance of node-specific parameterization to enhance GNN robustness. By comprehensively evaluating noise-handling strategies, BeGIN provides insights into their effectiveness, efficiency, and key performance factors. We expect that BeGIN will serve as a valuable resource for advancing research on label noise in graphs and fostering the development of robust GNN training methods. The code is available at https://github.com/kimsu55/BeGIN.
中文摘要:BeGIN基准通过引入实例相关噪声模拟和全面评估抗噪策略,弥补了现有图学习方法的不足,揭示了LLM模拟噪声的特殊挑战,并强调节点特定参数化对提升图神经网络鲁棒性的关键作用。
English Summary: The BeGIN benchmark addresses limitations in existing graph learning methods by introducing realistic instance-dependent noise simulations and comprehensively evaluating noise-handling strategies, revealing the particular challenges of LLM-based corruption while highlighting node-specific parameterization as key for GNN robustness.
Authors:Zichuan Fu, Xian Wu, Guojing Li, Yingying Zhang, Yefeng Zheng, Tianshi Ming, Yejing Wang, Wanyu Wang, Xiangyu Zhao
Abstract:
Large Language Models (LLMs) require continuous updates to maintain accurate and current knowledge as the world evolves. While existing knowledge editing approaches offer various solutions for knowledge updating, they often struggle with sequential editing scenarios and harm the general capabilities of the model, thereby significantly hampering their practical applicability. This paper proposes a two-stage framework combining robust supervised fine-tuning (R-SFT) with model merging for knowledge editing. Our method first fine-tunes the LLM to internalize new knowledge fully, then merges the fine-tuned model with the original foundation model to preserve newly acquired knowledge and general capabilities. Experimental results demonstrate that our approach significantly outperforms existing methods in sequential editing while better preserving the original performance of the model, all without requiring any architectural changes. Code is available at: https://github.com/Applied-Machine-Learning-Lab/MM4KE.
中文: 本文提出了一种结合鲁棒监督微调和模型合并的两阶段框架,能在不改变架构的情况下有效更新大语言模型的知识并保持其通用能力,在连续编辑任务中显著优于现有方法。
English: This paper introduces a two-stage framework that combines robust supervised fine-tuning with model merging to effectively update knowledge in Large Language Models while preserving their general capabilities, outperforming existing methods in sequential editing without architectural modifications.
Authors:Zichuan Fu, Xian Wu, Yejing Wang, Wanyu Wang, Shanshan Ye, Hongzhi Yin, Yi Chang, Yefeng Zheng, Xiangyu Zhao
Abstract:
Large Language Models (LLMs) have demonstrated exceptional capabilities across diverse natural language processing (NLP) tasks. The release of open-source LLMs like LLaMA and Qwen has triggered the development of numerous fine-tuned models tailored for various tasks and languages. In this paper, we explore an important question: is it possible to combine these specialized models to create a unified model with multi-task capabilities. We introduces Hierarchical Iterative Merging (Hi-Merging), a training-free method for unifying different specialized LLMs into a single model. Specifically, Hi-Merging employs model-wise and layer-wise pruning and scaling, guided by contribution analysis, to mitigate parameter conflicts. Extensive experiments on multiple-choice and question-answering tasks in both Chinese and English validate Hi-Merging's ability for multi-task learning. The results demonstrate that Hi-Merging consistently outperforms existing merging techniques and surpasses the performance of models fine-tuned on combined datasets in most scenarios. Code is available at: https://github.com/Applied-Machine-Learning-Lab/Hi-Merging.
中文: 本文提出Hi-Merging方法,通过分层剪枝和缩放将专业大语言模型统一为单一模型,在跨语言多任务学习中优于现有技术且无需额外训练。
English: This paper introduces Hi-Merging, a training-free method that unifies specialized LLMs into a single model through hierarchical pruning and scaling, demonstrating superior multi-task performance across languages compared to existing techniques.
Authors:Yue Wan, Xiaowei Jia, Xiang Lorraine Li
Abstract:
Chain-of-thought (CoT) prompting has been widely adopted to enhance the reasoning capabilities of large language models (LLMs). However, the effectiveness of CoT reasoning is inconsistent across tasks with different reasoning types. This work presents a novel perspective to understand CoT behavior through the lens of \textit{confirmation bias} in cognitive psychology. Specifically, we examine how model internal beliefs, approximated by direct question-answering probabilities, affect both reasoning generation ($Q \to R$) and reasoning-guided answer prediction ($QR \to A$) in CoT. By decomposing CoT into a two-stage process, we conduct a thorough correlation analysis in model beliefs, rationale attributes, and stage-wise performance. Our results provide strong evidence of confirmation bias in LLMs, such that model beliefs not only skew the reasoning process but also influence how rationales are utilized for answer prediction. Furthermore, the interplay between task vulnerability to confirmation bias and the strength of beliefs also provides explanations for CoT effectiveness across reasoning tasks and models. Overall, this study provides a valuable insight for the needs of better prompting strategies that mitigate confirmation bias to enhance reasoning performance. Code is available at \textit{https://github.com/yuewan2/biasedcot}.
Chinese: 本研究揭示了大型语言模型中的确认偏见会扭曲思维链提示中的推理过程和答案预测,解释了其在不同任务中效果不一致的原因,并提出了需要减少偏见的策略。
English: This study reveals that confirmation bias in large language models skews both the reasoning process and answer prediction in chain-of-thought prompting, explaining its inconsistent effectiveness across tasks and suggesting the need for bias-mitigating strategies.
Authors:Thomas Walker, Ahmed Imtiaz Humayun, Randall Balestriero, Richard Baraniuk
Abstract:
A key challenge for the machine learning community is to understand and accelerate the training dynamics of deep networks that lead to delayed generalisation and emergent robustness to input perturbations, also known as grokking. Prior work has associated phenomena like delayed generalisation with the transition of a deep network from a linear to a feature learning regime, and emergent robustness with changes to the network's functional geometry, in particular the arrangement of the so-called linear regions in deep networks employing continuous piecewise affine nonlinearities. Here, we explain how grokking is realised in the Jacobian of a deep network and demonstrate that aligning a network's Jacobians with the training data (in the sense of cosine similarity) ensures grokking under a low-rank Jacobian assumption. Our results provide a strong theoretical motivation for the use of Jacobian regularisation in optimizing deep networks -- a method we introduce as GrokAlign -- which we show empirically to induce grokking much sooner than more conventional regularizers like weight decay. Moreover, we introduce centroid alignment as a tractable and interpretable simplification of Jacobian alignment that effectively identifies and tracks the stages of deep network training dynamics. Accompanying webpage (https://thomaswalker1.github.io/blog/grokalign.html) and code (https://github.com/ThomasWalker1/grokalign).
中文: 本研究阐释了深度网络中的"顿悟"现象是通过网络雅可比矩阵与训练数据的对齐实现的,并提出了GrokAlign这一雅可比正则化方法,相比传统权重衰减等方法能显著加快顿悟过程。
English: This study explains that grokking in deep networks is achieved through the alignment of the network's Jacobians with training data, leading to the introduction of GrokAlign, a Jacobian regularization method that accelerates grokking compared to traditional approaches like weight decay.
Authors:Ella Miray Rajaonson, Mahyar Rajabi Kochi, Luis Martin Mejia Mendoza, Seyed Mohamad Moosavi, Benjamin Sanchez-Lengeling
Abstract:
Developing improved predictive models for multi-molecular systems is crucial, as nearly every chemical product used results from a mixture of chemicals. While being a vital part of the industry pipeline, the chemical mixture space remains relatively unexplored by the Machine Learning community. In this paper, we introduce CheMixHub, a holistic benchmark for molecular mixtures, covering a corpus of 11 chemical mixtures property prediction tasks, from drug delivery formulations to battery electrolytes, totalling approximately 500k data points gathered and curated from 7 publicly available datasets. CheMixHub introduces various data splitting techniques to assess context-specific generalization and model robustness, providing a foundation for the development of predictive models for chemical mixture properties. Furthermore, we map out the modelling space of deep learning models for chemical mixtures, establishing initial benchmarks for the community. This dataset has the potential to accelerate chemical mixture development, encompassing reformulation, optimization, and discovery. The dataset and code for the benchmarks can be found at: https://github.com/chemcognition-lab/chemixhub
中文: 本文提出CheMixHub分子混合物基准平台,涵盖11项性质预测任务和50万数据点,旨在推动混合物建模发展并加速化学制剂的配方优化与发现进程。
English: This paper introduces CheMixHub, a comprehensive benchmark for molecular mixtures featuring 11 property prediction tasks and 500k data points, designed to advance predictive modeling and accelerate chemical mixture development.
Authors:Yuan-Sen Ting
Abstract:
This textbook provides a systematic treatment of statistical machine learning for astronomical research through the lens of Bayesian inference, developing a unified framework that reveals connections between modern data analysis techniques and traditional statistical methods. We show how these techniques emerge from familiar statistical foundations. The consistently Bayesian perspective prioritizes uncertainty quantification and statistical rigor essential for scientific inference in astronomy. The textbook progresses from probability theory and Bayesian inference through supervised learning including linear regression with measurement uncertainties, logistic regression, and classification. Unsupervised learning topics cover Principal Component Analysis and clustering methods. We then introduce computational techniques through sampling and Markov Chain Monte Carlo, followed by Gaussian Processes as probabilistic nonparametric methods and neural networks within the broader statistical context. Our theory-focused pedagogical approach derives each method from first principles with complete mathematical development, emphasizing statistical insight and complementing with astronomical applications. We prioritize understanding why algorithms work, when they are appropriate, and how they connect to broader statistical principles. The treatment builds toward modern techniques including neural networks through a solid foundation in classical methods and their theoretical underpinnings. This foundation enables thoughtful application of these methods to astronomical research, ensuring proper consideration of assumptions, limitations, and uncertainty propagation essential for advancing astronomical knowledge in the era of large astronomical surveys.
这本教材通过贝叶斯推断为天文研究构建了统计机器学习的统一框架,从概率论基础延伸到神经网络等现代技术,始终强调不确定性量化和理论严谨性。
This textbook establishes a unified Bayesian framework for statistical machine learning in astronomy, progressing from foundational probability theory to modern techniques like neural networks while emphasizing uncertainty quantification and theoretical rigor.
Authors:Tony Alex, Sara Ahmed, Armin Mustafa, Muhammad Awais, Philip JB Jackson
Abstract:
Self-supervised pre-trained audio networks have seen widespread adoption in real-world systems, particularly in multi-modal large language models. These networks are often employed in a frozen state, under the assumption that the SSL pre-training has sufficiently equipped them to handle real-world audio. However, a critical question remains: how well do these models actually perform in real-world conditions, where audio is typically polyphonic and complex, involving multiple overlapping sound sources? Current audio SSL methods are often benchmarked on datasets predominantly featuring monophonic audio, such as environmental sounds, and speech. As a result, the ability of SSL models to generalize to polyphonic audio, a common characteristic in natural scenarios, remains underexplored. This limitation raises concerns about the practical robustness of SSL models in more realistic audio settings. To address this gap, we introduce Self-Supervised Learning from Audio Mixtures (SSLAM), a novel direction in audio SSL research, designed to improve, designed to improve the model's ability to learn from polyphonic data while maintaining strong performance on monophonic data. We thoroughly evaluate SSLAM on standard audio SSL benchmark datasets which are predominantly monophonic and conduct a comprehensive comparative analysis against SOTA methods using a range of high-quality, publicly available polyphonic datasets. SSLAM not only improves model performance on polyphonic audio, but also maintains or exceeds performance on standard audio SSL benchmarks. Notably, it achieves up to a 3.9\% improvement on the AudioSet-2M (AS-2M), reaching a mean average precision (mAP) of 50.2. For polyphonic datasets, SSLAM sets new SOTA in both linear evaluation and fine-tuning regimes with performance improvements of up to 9.1\% (mAP).
中文: 自监督音频模型因主要基于单声道音频训练而在处理现实多声道音频时表现不足,为此提出的SSLAM方法显著提升了多声道音频处理能力,同时保持了在标准基准测试中的优异性能。
English: Self-supervised audio models often underperform with polyphonic audio due to training on monophonic datasets, prompting the development of SSLAM, which enhances performance on complex audio while maintaining excellence on standard benchmarks.
Authors:Wenyue Hua, Dujian Ding, Yile Gu, Yujie Ren, Kai Mei, Minghua Ma, William Yang Wang
Abstract:
Conventional operating system scheduling algorithms are largely content-ignorant, making decisions based on factors such as latency or fairness without considering the actual intents or semantics of processes. Consequently, these algorithms often do not prioritize tasks that require urgent attention or carry higher importance, such as in emergency management scenarios. However, recent advances in language models enable semantic analysis of processes, allowing for more intelligent and context-aware scheduling decisions. In this paper, we introduce the concept of semantic scheduling in scheduling of requests from large language models (LLM), where the semantics of the process guide the scheduling priorities. We present a novel scheduling algorithm with optimal time complexity, designed to minimize the overall waiting time in LLM-based prompt scheduling. To illustrate its effectiveness, we present a medical emergency management application, underscoring the potential benefits of semantic scheduling for critical, time-sensitive tasks. The code and data are available at https://github.com/Wenyueh/latency_optimization_with_priority_constraints.
中文: 本文提出了一种针对大语言模型请求的语义调度算法,该算法根据任务内容的重要性确定优先级,并在医疗急救等关键应用中证明了其能有效缩短等待时间。
English: This paper introduces a semantic scheduling algorithm for large language model requests that prioritizes tasks based on their content significance, demonstrating its efficiency in minimizing waiting times for critical applications like medical emergencies.
Authors:Jackson Eshbaugh
Abstract:
Neural networks excel as function approximators, but their complexity often obscures the types of functions they learn, making it difficult to explain their behavior. To address this, the linearity score $λ(f)$ is introduced, a simple and interpretable diagnostic that quantifies how well a regression network's output can be mimicked by a linear model. Defined as the $R^2$ value between the network's predictions and those of a trained linear surrogate, $λ(f)$ measures linear decodability: the extent to which the network's behavior aligns with a structurally simple model. This framework is evaluated on both synthetic and real-world datasets, using dataset-specific networks and surrogates. High $λ(f)$ scores reliably indicate alignment with the network's outputs; however, they do not guarantee accuracy with respect to the ground truth. These results highlight the risk of using surrogate fidelity as a proxy for model understanding, especially in high-stakes regression tasks.
中文: 线性度评分λ(f)作为一种可解释的诊断工具,用于量化神经网络输出与线性替代模型的近似程度,但高分值并不保证与真实数据的准确性,警示在高风险回归任务中过度依赖替代模型保真度的风险。
English: The linearity score λ(f) is introduced as an interpretable diagnostic to measure how closely a neural network's outputs align with a linear surrogate model, though high scores don't guarantee accuracy with ground truth, cautioning against over-reliance on surrogate fidelity for model understanding.
Authors:Yewei Liu, Xiyuan Wang, Muhan Zhang
Abstract:
Network pruning, aimed at reducing network size while preserving accuracy, has attracted significant research interest. Numerous pruning techniques have been proposed over time. They are becoming increasingly effective, but more complex and harder to interpret as well. Given the inherent complexity of neural networks, we argue that manually designing pruning criteria has reached a bottleneck. To address this, we propose a novel approach in which we "use a neural network to prune neural networks". More specifically, we introduce the newly developed idea of metanetwork from meta-learning into pruning. A metanetwork is a network that takes another network as input and produces a modified network as output. In this paper, we first establish a bijective mapping between neural networks and graphs, and then employ a graph neural network as our metanetwork. We train a metanetwork that learns the pruning strategy automatically which can transform a network that is hard to prune into another network that is much easier to prune. Once the metanetwork is trained, our pruning needs nothing more than a feedforward through the metanetwork and the standard finetuning to prune at state-of-the-art. Our method achieved outstanding results on many popular and representative pruning tasks (including ResNet56 on CIFAR10, VGG19 on CIFAR100, ResNet50 on ImageNet). Our code is available at https://github.com/Yewei-Liu/MetaPruning
Chinese: 我们提出了一种全新的元学习网络剪枝框架,通过元网络自动学习复杂剪枝规则,无需针对每个任务进行专门训练即可在各种网络上实现最先进的剪枝效果。
English: We introduce a novel meta-learning framework for network pruning that automatically learns complex pruning rules through a metanetwork, achieving state-of-the-art results across various networks without requiring special training for each task.
Authors:Hao Gu, Lujun Li, Zheyu Wang, Bei Liu, Qiyuan Zhu, Sirui Han, Yike Guo
Abstract:
Binary quantization represents the most extreme form of large language model (LLM) compression, reducing weights to $\pm$1 for maximal memory and computational efficiency. While recent sparsity-aware binarization methods achieve sub-1-bit compression by pruning redundant binary weights, they suffer from three critical challenges: performance deterioration, computational complexity from sparse mask management, and limited hardware compatibility. In this paper, we present BTC-LLM, a novel sub-1-bit LLM quantization framework that leverages adaptive weight transformation and binary pattern clustering to overcome these limitations, delivering both superior accuracy and efficiency. Our approach incorporates two key innovations: (1) a Learnable Transformation that optimizes invertible scaling and rotation matrices to align binarized weights with full-precision distributions, enabling incoherence processing to enhance layer-wise representation quality; (2) a Flash and Accurate Binary Codebook that identifies recurring binary vector clusters, compressing them into compact indices with tailored distance metrics and sign-based centroid updates. This eliminates the need for sparse masks, enabling efficient inference on standard hardware. Our code is available at https://github.com/Chooovy/BTC-LLM.
中文:BTC-LLM是一种新颖的亚1比特量化框架,通过自适应权重变换和二进制模式聚类克服了性能和硬件限制,无需稀疏掩码即可实现卓越效率。
English: BTC-LLM is a novel sub-1-bit quantization framework that overcomes performance and hardware limitations through adaptive weight transformation and binary pattern clustering, achieving superior efficiency without sparse masks.
Authors:Hsi-Che Lin, Yu-Chu Yu, Kai-Po Chang, Yu-Chiang Frank Wang
Abstract:
Open-source foundation models have seen rapid adoption and development, enabling powerful general-purpose capabilities across diverse domains. However, fine-tuning large foundation models for domain-specific or personalized tasks remains prohibitively expensive for most users due to the significant memory overhead beyond that of inference. We introduce EMLoC, an Emulator-based Memory-efficient fine-tuning framework with LoRA Correction, which enables model fine-tuning within the same memory budget required for inference. EMLoC constructs a task-specific light-weight emulator using activation-aware singular value decomposition (SVD) on a small downstream calibration set. Fine-tuning then is performed on this lightweight emulator via LoRA. To tackle the misalignment between the original model and the compressed emulator, we propose a novel compensation algorithm to correct the fine-tuned LoRA module, which thus can be merged into the original model for inference. EMLoC supports flexible compression ratios and standard training pipelines, making it adaptable to a wide range of applications. Extensive experiments demonstrate that EMLoC outperforms other baselines across multiple datasets and modalities. Moreover, without quantization, EMLoC enables fine-tuning of a 38B model on a single 24GB consumer GPU-bringing efficient and practical model adaptation to individual users.
Authors:Yuliang Xu, Siming Huang, Mingmeng Geng, Yao Wan, Xuanhua Shi, Dongping Chen
Abstract:
Coding remains one of the most fundamental modes of interaction between humans and machines. With the rapid advancement of Large Language Models (LLMs), code generation capabilities have begun to significantly reshape programming practices. This development prompts a central question: Have LLMs transformed code style, and how can such transformation be characterized? In this paper, we present a pioneering study that investigates the impact of LLMs on code style, with a focus on naming conventions, complexity, maintainability, and similarity. By analyzing code from over 19,000 GitHub repositories linked to arXiv papers published between 2020 and 2025, we identify measurable trends in the evolution of coding style that align with characteristics of LLM-generated code. For instance, the proportion of snake\_case variable names in Python code increased from 47% in Q1 2023 to 51% in Q1 2025. Furthermore, we investigate how LLMs approach algorithmic problems by examining their reasoning processes. Given the diversity of LLMs and usage scenarios, among other factors, it is difficult or even impossible to precisely estimate the proportion of code generated or assisted by LLMs. Our experimental results provide the first large-scale empirical evidence that LLMs affect real-world programming style.
中文:大型语言模型正显著影响现实世界的代码风格,通过对数万个GitHub仓库的分析,发现命名规范等编程风格正呈现与AI生成代码一致的变化趋势。
English: Large Language Models are measurably influencing real-world coding styles, as evidenced by trends in naming conventions and other style elements across thousands of GitHub repositories.
Authors:Paul Setinek, Gianluca Galletti, Thomas Gross, Dominik Schnürer, Johannes Brandstetter, Werner Zellinger
Abstract:
Neural surrogates for Partial Differential Equations (PDEs) often suffer significant performance degradation when evaluated on unseen problem configurations, such as novel material types or structural dimensions. Meanwhile, Domain Adaptation (DA) techniques have been widely used in vision and language processing to generalize from limited information about unseen configurations. In this work, we address this gap through two focused contributions. First, we introduce SIMSHIFT, a novel benchmark dataset and evaluation suite composed of four industrial simulation tasks: hot rolling, sheet metal forming, electric motor design and heatsink design. Second, we extend established domain adaptation methods to state of the art neural surrogates and systematically evaluate them. These approaches use parametric descriptions and ground truth simulations from multiple source configurations, together with only parametric descriptions from target configurations. The goal is to accurately predict target simulations without access to ground truth simulation data. Extensive experiments on SIMSHIFT highlight the challenges of out of distribution neural surrogate modeling, demonstrate the potential of DA in simulation, and reveal open problems in achieving robust neural surrogates under distribution shifts in industrially relevant scenarios. Our codebase is available at https://github.com/psetinek/simshift
神经PDE代理模型在未见配置下表现不佳,而领域自适应技术可利用目标域的参数化数据(无需真实模拟数据)提升其泛化能力。
Neural PDE surrogates struggle with unseen configurations, but domain adaptation techniques can improve their generalization using parametric data from target domains without ground truth simulations.
Authors:Korbinian Pöppel, Richard Freinschlag, Thomas Schmied, Wei Lin, Sepp Hochreiter
Abstract:
Modern recurrent architectures, such as xLSTM and Mamba, have recently challenged the Transformer in language modeling. However, their structure constrains their applicability to sequences only or requires processing multi-dimensional data structures, such as images or molecular graphs, in a pre-defined sequential order. In contrast, Multi-Dimensional RNNs (MDRNNs) are well suited for data with a higher level structure, like 2D grids, trees, and directed acyclic graphs (DAGs). In this work, we extend the notion of multi-dimensionality to linear RNNs. We introduce parallelizable Linear Source Transition Mark networks (pLSTMs) using Source, Transition, and Mark gates that act on the line graph of a general DAG. This enables parallelization in analogy to parallel associative scans and the chunkwise-recurrent form of sequential linear RNNs, but for DAGs. For regular grids (1D and 2D), like images, this scheme can be efficiently implemented using einsum operations, concatenations, and padding in logarithmic time. pLSTMs tackle the vanishing/exploding activation/gradient problem for long distances in DAGs via two distinct modes: a directed propagation mode (P-mode) and a diffusive distribution mode (D-mode). To showcase the long-range capabilities of pLSTM, we introduce arrow-pointing extrapolation as a synthetic computer vision task that contains long-distance directional information. We demonstrate that pLSTMs generalize well to larger image sizes, whereas Transformers struggle to extrapolate. On established molecular graph and computer vision benchmarks, pLSTMs also show strong performance. Code and Datasets are available at: https://github.com/ml-jku/plstm_experiments.
Chinese Summary: 本研究提出了可并行化的线性源转换标记网络(pLSTMs),将多维线性RNNs扩展到有向无环图处理,通过并行计算解决长距离依赖问题,在需要方向信息的任务和泛化能力上优于Transformer模型。
English Summary: The study introduces parallelizable Linear Source Transition Mark networks (pLSTMs), which extend multi-dimensional linear RNNs to handle directed acyclic graphs with parallel processing and address long-range dependency issues, outperforming Transformers in tasks requiring directional information and generalization.
Authors:Samuel Simko, Mrinmaya Sachan, Bernhard Schölkopf, Zhijing Jin
Abstract:
Large Language Models (LLMs) are powerful tools with profound societal impacts, yet their ability to generate responses to diverse and uncontrolled inputs leaves them vulnerable to adversarial attacks. While existing defenses often struggle to generalize across varying attack types, recent advancements in representation engineering offer promising alternatives. In this work, we propose a defense framework that formulates model defense as a contrastive representation learning (CRL) problem. Our method finetunes a model using a triplet-based loss combined with adversarial hard negative mining to encourage separation between benign and harmful representations. Our experimental results across multiple models demonstrate that our approach outperforms prior representation engineering-based defenses, improving robustness against both input-level and embedding-space attacks without compromising standard performance. Our code is available at https://github.com/samuelsimko/crl-llm-defense
中文: 本研究提出了一种基于对比表征学习的大语言模型防御框架,通过三元组损失和对抗性负样本挖掘增强模型鲁棒性,在保持标准性能的同时有效抵御输入级和嵌入空间攻击。
English: This study introduces a contrastive representation learning framework for defending Large Language Models against adversarial attacks, utilizing triplet loss and adversarial mining to enhance robustness without sacrificing standard performance.
Authors:Zhenyu Hou, Ziniu Hu, Yujiang Li, Rui Lu, Jie Tang, Yuxiao Dong
Abstract:
Reinforcement learning (RL) with tree search has demonstrated superior performance in traditional reasoning tasks. Compared to conventional independent chain sampling strategies with outcome supervision, tree search enables better exploration of the reasoning space and provides dense, on-policy process rewards during RL training but remains under-explored in On-Policy LLM RL. We propose TreeRL, a reinforcement learning framework that directly incorporates on-policy tree search for RL training. Our approach includes intermediate supervision and eliminates the need for a separate reward model training. Existing approaches typically train a separate process reward model, which can suffer from distribution mismatch and reward hacking. We also introduce a cost-effective tree search approach that achieves higher search efficiency under the same generation token budget by strategically branching from high-uncertainty intermediate steps rather than using random branching. Experiments on challenging math and code reasoning benchmarks demonstrate that TreeRL achieves superior performance compared to traditional ChainRL, highlighting the potential of tree search for LLM. TreeRL is open-sourced at https://github.com/THUDM/TreeRL.
中文摘要:TreeRL提出了一种基于策略树搜索的强化学习框架,通过策略性分支和中间监督提升推理性能,无需单独训练奖励模型。
English Summary: TreeRL introduces an on-policy tree search reinforcement learning framework that enhances reasoning performance through strategic branching and intermediate supervision, eliminating the need for separate reward models.
Authors:Maximilian Kreutner, Marlene Lutz, Markus Strohmaier
Abstract:
Large Language Models (LLMs) display remarkable capabilities to understand or even produce political discourse, but have been found to consistently display a progressive left-leaning bias. At the same time, so-called persona or identity prompts have been shown to produce LLM behavior that aligns with socioeconomic groups that the base model is not aligned with. In this work, we analyze whether zero-shot persona prompting with limited information can accurately predict individual voting decisions and, by aggregation, accurately predict positions of European groups on a diverse set of policies. We evaluate if predictions are stable towards counterfactual arguments, different persona prompts and generation methods. Finally, we find that we can simulate voting behavior of Members of the European Parliament reasonably well with a weighted F1 score of approximately 0.793. Our persona dataset of politicians in the 2024 European Parliament and our code are available at https://github.com/dess-mannheim/european_parliament_simulation.
中文摘要:研究表明,零样本角色提示能有效模拟个人投票决策并预测欧洲群体的政策立场,对欧洲议会议员投票行为的模拟加权F1分数达到约0.793。
English Summary: This study demonstrates that zero-shot persona prompting can effectively simulate individual voting decisions and predict policy positions of European groups, achieving a weighted F1 score of approximately 0.793 for simulating European Parliament members' voting behavior.
Authors:Tianqi Du, Haotian Huang, Yifei Wang, Yisen Wang
Abstract:
Large language models (LLMs) have exhibited impressive performance and surprising emergent properties. However, their effectiveness remains limited by the fixed context window of the transformer architecture, posing challenges for long-context modeling. Among these challenges, length generalization -- the ability to generalize to sequences longer than those seen during training -- is a classical and fundamental problem. In this work, we propose a fresh perspective on length generalization, shifting the focus from the conventional emphasis on input features such as positional encodings or data structures to the output distribution of the model. Specifically, through case studies on synthetic tasks, we highlight the critical role of \textbf{long-short alignment} -- the consistency of output distributions across sequences of varying lengths. Extending this insight to natural language tasks, we propose a metric called Long-Short Misalignment to quantify this phenomenon, uncovering a strong correlation between the metric and length generalization performance. Building on these findings, we develop a regularization term that promotes long-short alignment during training. Extensive experiments validate the effectiveness of our approach, offering new insights for achieving more effective long-context modeling in LLMs. Code is available at https://github.com/PKU-ML/LongShortAlignment.
Chinese: 本研究提出了大语言模型中长度泛化的新视角,强调输出分布的长短对齐重要性,通过设计量化指标和正则化方法,有效提升了长上下文任务的性能表现。
English: This research introduces a novel perspective on length generalization in large language models by emphasizing the importance of long-short alignment in output distributions, proposing a metric and regularization method that significantly improves performance on long-context tasks.
Authors:Emre Kavak, Tom Nuno Wolf, Christian Wachinger
Abstract:
Dataset bias often leads deep learning models to exploit spurious correlations instead of task-relevant signals. We introduce the Standard Anti-Causal Model (SAM), a unifying causal framework that characterizes bias mechanisms and yields a conditional independence criterion for causal stability. Building on this theory, we propose DISCO$_m$ and sDISCO, efficient and scalable estimators of conditional distance correlation that enable independence regularization in black-box models. Across five diverse datasets, our methods consistently outperform or are competitive in existing bias mitigation approaches, while requiring fewer hyperparameters and scaling seamlessly to multi-bias scenarios. This work bridges causal theory and practical deep learning, providing both a principled foundation and effective tools for robust prediction. Source Code: https://github.com/***.
中文: 本文提出因果框架和高效估计器来解决深度学习中的数据集偏差问题,在多种数据集上以最少超参数实现了稳健性能。
English: This paper introduces a causal framework and efficient estimators to mitigate dataset bias in deep learning, achieving robust performance across diverse datasets with minimal hyperparameters.
Authors:Shashank Balla
Abstract:
The widespread adoption of outsourced neural network inference presents significant privacy challenges, as sensitive user data is processed on untrusted remote servers. Secure inference offers a privacy-preserving solution, but existing frameworks suffer from high computational overhead and communication costs, rendering them impractical for real-world deployment. We introduce SecONNds, a non-intrusive secure inference framework optimized for large ImageNet-scale Convolutional Neural Networks. SecONNds integrates a novel fully Boolean Goldreich-Micali-Wigderson (GMW) protocol for secure comparison -- addressing Yao's millionaires' problem -- using preprocessed Beaver's bit triples generated from Silent Random Oblivious Transfer. Our novel protocol achieves an online speedup of 17$\times$ in nonlinear operations compared to state-of-the-art solutions while reducing communication overhead. To further enhance performance, SecONNds employs Number Theoretic Transform (NTT) preprocessing and leverages GPU acceleration for homomorphic encryption operations, resulting in speedups of 1.6$\times$ on CPU and 2.2$\times$ on GPU for linear operations. We also present SecONNds-P, a bit-exact variant that ensures verifiable full-precision results in secure computation, matching the results of plaintext computations. Evaluated on a 37-bit quantized SqueezeNet model, SecONNds achieves an end-to-end inference time of 2.8 s on GPU and 3.6 s on CPU, with a total communication of just 420 MiB. SecONNds' efficiency and reduced computational load make it well-suited for deploying privacy-sensitive applications in resource-constrained environments. SecONNds is open source and can be accessed from: https://github.com/shashankballa/SecONNds.
中文:SecONNds提出了一种针对大型神经网络优化的高效安全推理框架,通过创新协议和GPU加速显著降低了计算和通信开销,同时确保隐私保护。
English: SecONNds introduces an efficient secure inference framework optimized for large neural networks, significantly reducing computational and communication overhead while ensuring privacy through novel protocols and GPU acceleration.
Authors:Xiaoyu Ma, Hao Chen, Yongjian Deng
Abstract:
Different modalities hold considerable gaps in optimization trajectories, including speeds and paths, which lead to modality laziness and modality clash when jointly training multimodal models, resulting in insufficient and imbalanced multimodal learning. Existing methods focus on enforcing the weak modality by adding modality-specific optimization objectives, aligning their optimization speeds, or decomposing multimodal learning to enhance unimodal learning. These methods fail to achieve both unimodal sufficiency and multimodal balance. In this paper, we, for the first time, address both concerns by proposing multimodal Data Remixing, including decoupling multimodal data and filtering hard samples for each modality to mitigate modality imbalance; and then batch-level reassembling to align the gradient directions and avoid cross-modal interference, thus enhancing unimodal learning sufficiency. Experimental results demonstrate that our method can be seamlessly integrated with existing approaches, improving accuracy by approximately 6.50%$\uparrow$ on CREMAD and 3.41%$\uparrow$ on Kinetic-Sounds, without training set expansion or additional computational overhead during inference. The source code is available at https://github.com/MatthewMaxy/Remix_ICML2025.
中文: 本文提出多模态数据重组方法,通过解耦和重组数据解决模态不平衡问题并增强单模态学习,在不增加计算成本的情况下显著提升了准确率。
English: This paper introduces multimodal Data Remixing, a method that decouples and reassembles data to address modality imbalance and enhance unimodal learning, achieving significant accuracy improvements without extra computational costs.
Authors:Zhuguanyu Wu, Shihe Wang, Jiayi Zhang, Jiaxin Chen, Yunhong Wang
Abstract:
Post-training quantization (PTQ) has stood out as a cost-effective and promising model compression paradigm in recent years, as it avoids computationally intensive model retraining. Nevertheless, current PTQ methods for Vision Transformers (ViTs) still suffer from significant accuracy degradation, especially under low-bit quantization. To address these shortcomings, we analyze the prevailing Hessian-guided quantization loss, and uncover certain limitations of conventional Hessian approximations. By following the block-wise reconstruction framework, we propose a novel PTQ method for ViTs, dubbed FIMA-Q. Specifically, we firstly establish the connection between KL divergence and FIM, which enables fast computation of the quantization loss during reconstruction. We further propose an efficient FIM approximation method, namely DPLR-FIM, by employing the diagonal plus low-rank principle, and formulate the ultimate quantization loss. Our extensive experiments, conducted across various vision tasks with representative ViT-based architectures on public datasets, demonstrate that our method substantially promotes the accuracy compared to the state-of-the-art approaches, especially in the case of low-bit quantization. The source code is available at https://github.com/ShiheWang/FIMA-Q.
中文摘要:本文提出FIMA-Q这一新型视觉Transformer后训练量化方法,通过快速Fisher信息矩阵近似和分块重建框架解决低比特量化精度损失问题,在多项视觉任务中显著超越现有最优方法。
English Summary: This paper introduces FIMA-Q, a novel post-training quantization method for Vision Transformers that addresses accuracy degradation in low-bit settings by leveraging a fast Fisher Information Matrix approximation and block-wise reconstruction, achieving superior performance across various tasks.
Authors:Xiao Xu, Libo Qin, Wanxiang Che, Min-Yen Kan
Abstract:
Two-Tower Vision--Language Models (VLMs) have demonstrated strong performance across various downstream VL tasks. While BridgeTower further enhances performance by building bridges between encoders, it \textit{(i)} suffers from ineffective layer-by-layer utilization of unimodal representations, \textit{(ii)} restricts the flexible exploitation of different levels of unimodal semantic knowledge, and \textit{(iii)} is limited to the evaluation on traditional low-resolution datasets only with the Two-Tower VLM architecture. In this work, we propose Manager, a lightweight, efficient and effective plugin that adaptively aggregates insights from different levels of pre-trained unimodal experts to facilitate more comprehensive VL alignment and fusion. First, under the Two-Tower VLM architecture, we introduce ManagerTower, a novel VLM that introduces the manager in each cross-modal layer. Whether with or without VL pre-training, ManagerTower outperforms previous strong baselines and achieves superior performance on 4 downstream VL tasks. Moreover, we extend our exploration to the latest Multimodal Large Language Model (MLLM) architecture. We demonstrate that LLaVA-OV-Manager significantly boosts the zero-shot performance of LLaVA-OV across different categories of capabilities, images, and resolutions on 20 downstream datasets, whether the multi-grid algorithm is enabled or not. In-depth analysis reveals that both our manager and the multi-grid algorithm can be viewed as a plugin that improves the visual representation by capturing more diverse visual details from two orthogonal perspectives (depth and width). Their synergy can mitigate the semantic ambiguity caused by the multi-grid algorithm and further improve performance. Code and models are available at https://github.com/LooperXX/ManagerTower.
中文: 提出的Manager插件通过自适应整合多层级单模态专家知识,在双塔视觉语言模型和多模态大语言模型中均实现了性能提升,并与多网格算法形成互补,增强了视觉细节的捕捉能力。
English: The proposed Manager plugin enhances Two-Tower Vision-Language Models by adaptively integrating multi-level unimodal expertise, achieving superior performance across VL tasks and MLLM architectures while complementing multi-grid algorithms for richer visual representation.
Authors:Haotian Ni, Yake Wei, Hang Liu, Gong Chen, Chong Peng, Hao Lin, Di Hu
Abstract:
Multimodal learning faces challenges in effectively fusing information from diverse modalities, especially when modality quality varies across samples. Dynamic fusion strategies, such as attention mechanism in Transformers, aim to address such challenge by adaptively emphasizing modalities based on the characteristics of input data. However, through amounts of carefully designed experiments, we surprisingly observed that the dynamic adaptability of widely-used self-attention models diminishes. Model tends to prefer one modality regardless of data characteristics. This bias triggers a self-reinforcing cycle that progressively overemphasizes the favored modality, widening the distribution gap in attention keys across modalities and deactivating attention mechanism's dynamic properties. To revive adaptability, we propose a simple yet effective method Rolling Query (RollingQ), which balances attention allocation by rotating the query to break the self-reinforcing cycle and mitigate the key distribution gap. Extensive experiments on various multimodal scenarios validate the effectiveness of RollingQ and the restoration of cooperation dynamics is pivotal for enhancing the broader capabilities of widely deployed multimodal Transformers. The source code is available at https://github.com/GeWu-Lab/RollingQ_ICML2025.
中文摘要:研究发现多模态学习中的自注意力机制常无法动态适应不同模态质量,反而偏好单一模态形成自我强化偏差,但提出的滚动查询(RollingQ)方法通过轮换查询平衡注意力分配,成功恢复了模型的动态适应性。
English Summary: The study reveals that self-attention mechanisms in multimodal learning often fail to dynamically adapt to varying modality qualities, instead favoring one modality and creating a self-reinforcing bias, but proposes Rolling Query (RollingQ) to restore adaptability by rotating queries to balance attention allocation.
Authors:Abhishek Tyagi, Arjun Iyer, William H Renninger, Christopher Kanan, Yuhao Zhu
Abstract:
Recent advances in Dynamic Sparse Training (DST) have pushed the frontier of sparse neural network training in structured and unstructured contexts, matching dense-model performance while drastically reducing parameter counts to facilitate model scaling. However, unstructured sparsity often fails to translate into practical speedups on modern hardware. To address this shortcoming, we propose DynaDiag, a novel structured sparse-to-sparse DST method that performs at par with unstructured sparsity. DynaDiag enforces a diagonal sparsity pattern throughout training and preserves sparse computation in forward and backward passes. We further leverage the diagonal structure to accelerate computation via a custom CUDA kernel, rendering the method hardware-friendly. Empirical evaluations on diverse neural architectures demonstrate that our method maintains accuracy on par with unstructured counterparts while benefiting from tangible computational gains. Notably, with 90% sparse linear layers in ViTs, we observe up to a 3.13x speedup in online inference without sacrificing model performance and a 1.59x speedup in training on a GPU compared to equivalent unstructured layers. Our source code is available at https://github.com/horizon-research/DynaDiag/.
中文: DynaDiag提出了一种采用对角线稀疏模式的结构化稀疏训练方法,在保持与无结构化方法同等精度的同时,通过硬件友好的实现获得了显著的计算加速效果。
English: DynaDiag introduces a structured sparse training method using diagonal sparsity patterns that matches the accuracy of unstructured approaches while achieving significant computational speedups through hardware-friendly implementations.
Authors:Jinhee Kim, Seoyeon Yoon, Taeho Lee, Joo Chan Lee, Kang Eun Jeon, Jong Hwan Ko
Abstract:
The deployment of deep neural networks on edge devices is a challenging task due to the increasing complexity of state-of-the-art models, requiring efforts to reduce model size and inference latency. Recent studies explore models operating at diverse quantization settings to find the optimal point that balances computational efficiency and accuracy. Truncation, an effective approach for achieving lower bit precision mapping, enables a single model to adapt to various hardware platforms with little to no cost. However, formulating a training scheme for deep neural networks to withstand the associated errors introduced by truncation remains a challenge, as the current quantization-aware training schemes are not designed for the truncation process. We propose TruncQuant, a novel truncation-ready training scheme allowing flexible bit precision through bit-shifting in runtime. We achieve this by aligning TruncQuant with the output of the truncation process, demonstrating strong robustness across bit-width settings, and offering an easily implementable training scheme within existing quantization-aware frameworks. Our code is released at https://github.com/a2jinhee/TruncQuant.
中文: TruncQuant是一种创新的训练方案,通过运行时位移使深度神经网络能灵活适应不同位精度,在多种硬件平台上均表现出强大鲁棒性,且易于集成到现有量化感知框架中。
English: TruncQuant is a novel training scheme that enables deep neural networks to adapt flexibly to various bit precisions through runtime bit-shifting, providing robust performance across different hardware platforms while being easily integrated into existing quantization-aware frameworks.
Authors:Pradyut Sekhsaria, Marcel Mateos Salles, Hai Huang, Randall Balestriero
Abstract:
Parameter Efficient FineTuning (PEFT), such as Low-Rank Adaptation (LoRA), aligns pre-trained Large Language Models (LLMs) to particular downstream tasks in a resource-efficient manner. Because efficiency has been the main metric of progress, very little attention has been put in understanding possible catastrophic failures. We uncover one such failure: PEFT encourages a model to search for shortcut solutions to solve its fine-tuning tasks. When very small amount of tokens, e.g., one token per prompt, are correlated with downstream task classes, PEFT makes any pretrained model rely predominantly on that token for decision making. While such spurious tokens may emerge accidentally from incorrect data cleaning, it also opens opportunities for malevolent parties to control a model's behavior from Seamless Spurious Token Injection (SSTI). In SSTI, a small amount of tokens correlated with downstream classes are injected by the dataset creators. At test time, the finetuned LLM's behavior can be controlled solely by injecting those few tokens. We apply SSTI across models from three families (Snowflake Arctic, Apple OpenELM, and Meta LLaMA-3) and four diverse datasets (IMDB, Financial Classification, CommonSense QA, and Bias in Bios). Our findings reveal three astonishing behaviors. First, as few as a single token of SSTI is sufficient to steer a model's decision making. Second, for light SSTI, the reliance on spurious tokens is proportional to the LoRA rank. Lastly, with aggressive SSTI, larger LoRA rank values become preferable to small rank values as it makes the model attend to non-spurious tokens, hence improving robustness.
Authors:Sadman Sadeed Omee, Lai Wei, Sourin Dey, Jianjun Hu
Abstract:
Crystalline materials can form different structural arrangements (i.e. polymorphs) with the same chemical composition, exhibiting distinct physical properties depending on how they were synthesized or the conditions under which they operate. For example, carbon can exist as graphite (soft, conductive) or diamond (hard, insulating). Computational methods that can predict these polymorphs are vital in materials science, which help understand stability relationships, guide synthesis efforts, and discover new materials with desired properties without extensive trial-and-error experimentation. However, effective crystal structure prediction (CSP) algorithms for inorganic polymorph structures remain limited. We propose ParetoCSP2, a multi-objective genetic algorithm for polymorphism CSP that incorporates an adaptive space group diversity control technique, preventing over-representation of any single space group in the population guided by a neural network interatomic potential. Using an improved population initialization method and performing iterative structure relaxation, ParetoCSP2 not only alleviates premature convergence but also achieves improved convergence speed. Our results show that ParetoCSP2 achieves excellent performance in polymorphism prediction, including a nearly perfect space group and structural similarity accuracy for formulas with two polymorphs but with the same number of unit cell atoms. Evaluated on a benchmark dataset, it outperforms baseline algorithms by factors of 2.46-8.62 for these accuracies and improves by 44.8\%-87.04\% across key performance metrics for regular CSP. Our source code is freely available at https://github.com/usccolumbia/ParetoCSP2.
Chinese: 该研究提出ParetoCSP2算法,通过自适应空间群多样性控制和神经网络势能指导,改进了无机多晶型物的晶体结构预测,在准确性和收敛速度上均显著优于基准算法。
English: The study introduces ParetoCSP2, a multi-objective genetic algorithm that enhances crystal structure prediction for inorganic polymorphs by incorporating adaptive space group diversity control and neural network potentials, achieving superior accuracy and convergence compared to baseline methods.
Authors:Kyung Rok Kim, Yansong Wang, Xiaocheng Li, Guanting Chen
Abstract:
With the recent rise of generative Artificial Intelligence (AI), the need of selecting high-quality dataset to improve machine learning models has garnered increasing attention. However, some part of this topic remains underexplored, even for simple prediction models. In this work, we study the problem of developing practical algorithms that select appropriate dataset to minimize population loss of our prediction model with high probability. Broadly speaking, we investigate when datasets from different sources can be effectively merged to enhance the predictive model's performance, and propose a practical algorithm with theoretical guarantees. By leveraging an oracle inequality and data-driven estimators, the algorithm reduces population loss with high probability. Numerical experiments demonstrate its effectiveness in both standard linear regression and broader machine learning applications. Code is available at https://github.com/kkrokii/collaborative_prediction.
中文: 本研究提出了一种实用算法,通过整合不同来源的数据集,在理论上保证以高概率降低总体损失,并在线性回归和更广泛的机器学习应用中通过数值实验验证了其有效性。
English: This research introduces a practical algorithm that merges datasets from various sources to minimize population loss with theoretical guarantees, validated through numerical experiments in linear regression and broader machine learning tasks.
Authors:Shijie Fang, Hang Yu, Qidi Fang, Reuben M. Aronson, Elaine S. Short
Abstract:
Learning from Demonstration (LfD) is a popular approach for robots to acquire new skills, but most LfD methods suffer from imperfections in human demonstrations. Prior work typically treats these suboptimalities as random noise. In this paper we study non-optimal behaviors in non-expert demonstrations and show that they are systematic, forming what we call demonstration sidetracks. Using a public space study with 40 participants performing a long-horizon robot task, we recreated the setup in simulation and annotated all demonstrations. We identify four types of sidetracks (Exploration, Mistake, Alignment, Pause) and one control pattern (one-dimension control). Sidetracks appear frequently across participants, and their temporal and spatial distribution is tied to task context. We also find that users' control patterns depend on the control interface. These insights point to the need for better models of suboptimal demonstrations to improve LfD algorithms and bridge the gap between lab training and real-world deployment. All demonstrations, infrastructure, and annotations are available at https://github.com/AABL-Lab/Human-Demonstration-Sidetracks.
中文: 本研究识别了人类机器人演示中系统性的非最优行为,称为“演示旁路”,将其分为四种类型,并将其模式与任务情境和控制界面相关联,以改进演示学习算法。
English: This research identifies systematic non-optimal behaviors called "demonstration sidetracks" in human robot demonstrations, categorizing them into four types and linking their patterns to task context and control interfaces to improve Learning from Demonstration algorithms.
Authors:Ching Chang, Ming-Chih Lo, Wen-Chih Peng, Tien-Fu Chen
Abstract:
Multivariate time series data, collected across various fields such as manufacturing and wearable technology, exhibit states at multiple levels of granularity, from coarse-grained system behaviors to fine-grained, detailed events. Effectively segmenting and integrating states across these different granularities is crucial for tasks like predictive maintenance and performance optimization. However, existing time series segmentation methods face two key challenges: (1) the inability to handle multiple levels of granularity within a unified model, and (2) limited adaptability to new, evolving patterns in dynamic environments. To address these challenges, we propose PromptTSS, a novel framework for time series segmentation with multi-granularity states. PromptTSS uses a unified model with a prompting mechanism that leverages label and boundary information to guide segmentation, capturing both coarse- and fine-grained patterns while adapting dynamically to unseen patterns. Experiments show PromptTSS improves accuracy by 24.49% in multi-granularity segmentation, 17.88% in single-granularity segmentation, and up to 599.24% in transfer learning, demonstrating its adaptability to hierarchical states and evolving time series dynamics. Our code is available at https://github.com/blacksnail789521/PromptTSS.
Chinese: PromptTSS是一种新颖的时间序列分割框架,通过采用带有提示机制的统一模型,有效解决了现有方法无法处理多粒度状态和适应动态环境的问题,显著提升了分割精度和迁移学习能力。
English: PromptTSS is a novel framework that addresses the limitations of existing time series segmentation methods by using a unified model with a prompting mechanism to handle multi-granularity states and adapt to evolving patterns, significantly improving accuracy in segmentation and transfer learning tasks.
Authors:Simon Ghyselincks, Valeriia Okhmak, Stefano Zampini, George Turkiyyah, David Keyes, Eldad Haber
Abstract:
Visualizing the first few kilometers of the Earth's subsurface, a long-standing challenge gating a virtually inexhaustible list of important applications, is coming within reach through deep learning. Building on techniques of generative artificial intelligence applied to voxelated images, we demonstrate a method that extends surface geological data supplemented by boreholes to a three-dimensional subsurface region by training a neural network. The Earth's land area having been extensively mapped for geological features, the bottleneck of this or any related technique is the availability of data below the surface. We close this data gap in the development of subsurface deep learning by designing a synthetic data-generator process that mimics eons of geological activity such as sediment compaction, volcanic intrusion, and tectonic dynamics to produce a virtually limitless number of samples of the near lithosphere. A foundation model trained on such synthetic data is able to generate a 3D image of the subsurface from a previously unseen map of surface topography and geology, showing increasing fidelity with increasing access to borehole data, depicting such structures as layers, faults, folds, dikes, and sills. We illustrate the early promise of the combination of a synthetic lithospheric generator with a trained neural network model using generative flow matching. Ultimately, such models will be fine-tuned on data from applicable campaigns, such as mineral prospecting in a given region. Though useful in itself, a regionally fine-tuned models may be employed not as an end but as a means: as an AI-based regularizer in a more traditional inverse problem application, in which the objective function represents the mismatch of additional data with physical models with applications in resource exploration, hazard assessment, and geotechnical engineering.
中文: 通过生成式人工智能,利用地表数据和钻孔信息训练神经网络,结合模拟地质活动的合成数据,实现了对地球近地表三维结构的可视化建模。
English: Deep learning is enabling the visualization of the Earth's subsurface by using generative AI to create 3D models from surface data and boreholes, trained on synthetic geological data that mimics natural processes.
Authors:Changxin Ke, Rui Zhang, Shuo Wang, Li Ding, Guangli Li, Yuanbo Wen, Shuoming Zhang, Ruiyuan Xu, Jin Qin, Jiaming Guo, Chenxi Wang, Ling Li, Qi Guo, Yunji Chen
Abstract:
The rise of GPU-based high-performance computing (HPC) has driven the widespread adoption of parallel programming models such as CUDA. Yet, the inherent complexity of parallel programming creates a demand for the automated sequential-to-parallel approaches. However, data scarcity poses a significant challenge for machine learning-based sequential-to-parallel code translation. Although recent back-translation methods show promise, they still fail to ensure functional equivalence in the translated code. In this paper, we propose a novel Mutual-Supervised Learning (MSL) framework for sequential-to-parallel code translation to address the functional equivalence issue. MSL consists of two models, a Translator and a Tester. Through an iterative loop consisting of Co-verify and Co-evolve steps, the Translator and the Tester mutually generate data for each other and improve collectively. The Tester generates unit tests to verify and filter functionally equivalent translated code, thereby evolving the Translator, while the Translator generates translated code as augmented input to evolve the Tester. Experimental results demonstrate that MuSL significantly enhances the performance of the base model: when applied to Qwen2.5-Coder, it not only improves Pass@1 by up to 28.91% and boosts Tester performance by 68.90%, but also outperforms the previous state-of-the-art method CodeRosetta by 1.56 and 6.92 in BLEU and CodeBLEU scores, while achieving performance comparable to DeepSeek-R1 and GPT-4.1. Our code is available at https://github.com/kcxain/musl.
中文: 本文提出了一种互监督学习(MSL)框架,通过翻译器与测试器之间的迭代协作确保功能等价性,显著提升了串行到并行代码转换的性能,并优于现有先进方法。
English: The paper introduces a Mutual-Supervised Learning (MSL) framework that enhances sequential-to-parallel code translation by ensuring functional equivalence through iterative collaboration between a Translator and a Tester, achieving significant performance improvements over existing methods.
Authors:Christos Pantazopoulos, Spyridon Thermos, Gerasimos Potamianos
Abstract:
Estimating the 3D hand articulation from a single color image is an important problem with applications in Augmented Reality (AR), Virtual Reality (VR), Human-Computer Interaction (HCI), and robotics. Apart from the absence of depth information, occlusions, articulation complexity, and the need for camera parameters knowledge pose additional challenges. In this work, we propose an optimization pipeline for estimating the 3D hand articulation from 2D keypoint input, which includes a keypoint alignment step and a fingertip loss to overcome the need to know or estimate the camera parameters. We evaluate our approach on the EgoDexter and Dexter+Object benchmarks to showcase that it performs competitively with the state-of-the-art, while also demonstrating its robustness when processing "in-the-wild" images without any prior camera knowledge. Our quantitative analysis highlights the sensitivity of the 2D keypoint estimation accuracy, despite the use of hand priors. Code is available at the project page https://cpantazop.github.io/HandRepo/
Authors:Hourun Zhu, Chengchao Shen
Abstract:
In spite of strong performance achieved by LLMs, the costs of their deployment are unaffordable. For the compression of LLMs, gradient-based pruning methods present promising effectiveness. However, in these methods, the gradient computation with one-hot labels ignore the potential predictions on other words, thus missing key information for generative capability of the original model. To address this issue, we introduce a self-distillation loss during the pruning phase (rather than post-training) to fully exploit the predictions of the original model, thereby obtaining more accurate gradient information for pruning. Moreover, we find that, compared to attention modules, the predictions of LLM are less sensitive to multilayer perceptron (MLP) modules, which take up more than $5 \times$ parameters (LLaMA3.2-1.2B). To this end, we focus on the pruning of MLP modules, to significantly compress LLM without obvious performance degradation. Experimental results on extensive zero-shot benchmarks demonstrate that our method significantly outperforms existing pruning methods. Furthermore, our method achieves very competitive performance among 1B-scale open source LLMs. The source code and trained weights are available at https://github.com/visresearch/SDMPrune.
中文: 本文在剪枝过程中引入自蒸馏损失以充分利用原始模型的预测来获得更准确的梯度信息,并专注于剪枝MLP模块,在1B规模的大模型中实现了优异的压缩效果和竞争力。
English: This paper introduces a self-distillation loss during pruning to better utilize the original model's predictions for accurate gradient computation and focuses on pruning MLP modules, achieving superior compression and competitive performance among 1B-scale LLMs.
Authors:Jaeho Lee, Atharv Chowdhary
Abstract:
Recent benchmarks have probed factual consistency and rhetorical robustness in Large Language Models (LLMs). However, a knowledge gap exists regarding how directional framing of factually true statements influences model agreement, a common scenario for LLM users. AssertBench addresses this by sampling evidence-supported facts from FEVEROUS, a fact verification dataset. For each (evidence-backed) fact, we construct two framing prompts: one where the user claims the statement is factually correct, and another where the user claims it is incorrect. We then record the model's agreement and reasoning. The desired outcome is that the model asserts itself, maintaining consistent truth evaluation across both framings, rather than switching its evaluation to agree with the user. AssertBench isolates framing-induced variability from the model's underlying factual knowledge by stratifying results based on the model's accuracy on the same claims when presented neutrally. In doing so, this benchmark aims to measure an LLM's ability to "stick to its guns" when presented with contradictory user assertions about the same fact. The complete source code is available at https://github.com/achowd32/assert-bench.
中文摘要:AssertBench是一个新基准,用于评估当用户将同一证据支持的事实表述为正确或错误时,大型语言模型是否能保持一致性的事实判断,从而测试其不因迎合用户而改变评估的能力。
English Summary: AssertBench is a new benchmark that evaluates whether LLMs maintain consistent factual judgments when users frame the same evidence-backed facts as either correct or incorrect, testing their ability to resist switching evaluations to simply agree with users.
Authors:Justin Asher
Abstract:
The expanding Lean 4 ecosystem poses challenges for navigating its vast libraries. This paper introduces LeanExplore, a search engine for Lean 4 declarations. LeanExplore enables users to semantically search for statements, both formally and informally, across select Lean 4 packages (including Batteries, Init, Lean, Mathlib, PhysLean, and Std). This search capability is powered by a hybrid ranking strategy, integrating scores from a multi-source semantic embedding model (capturing conceptual meaning from formal Lean code, docstrings, AI-generated informal translations, and declaration titles), BM25+ for keyword-based lexical relevance, and a PageRank-based score reflecting declaration importance and interconnectedness. The search engine is accessible via a dedicated website (https://www.leanexplore.com/) and a Python API (https://github.com/justincasher/lean-explore). Furthermore, the database can be downloaded, allowing users to self-host the service. LeanExplore integrates easily with LLMs via the model context protocol (MCP), enabling users to chat with an AI assistant about Lean declarations or utilize the search engine for building theorem-proving agents. This work details LeanExplore's architecture, data processing, functionalities, and its potential to enhance Lean 4 workflows and AI-driven mathematical research
LeanExplore 搜索引擎通过混合排名策略(结合概念嵌入、词法相关性和声明互连性)实现了对 Lean 4 库的语义搜索,解决了庞大生态系统的导航难题,支持网页访问、API调用及大语言模型集成。
The LeanExplore search engine addresses the challenge of navigating Lean 4's extensive libraries by enabling semantic searches through a hybrid ranking system that combines conceptual embeddings, lexical relevance, and declaration interconnectedness, accessible via web interface, API, and LLM integration.
Authors:Henrik Abgaryan, Tristan Cazenave, Ararat Harutyunyan
Abstract:
Large Language Models (LLMs) have demonstrated impressive reasoning capabilities, yet their direct application to NP-hard combinatorial problems (CPs) remains underexplored. In this work, we systematically investigate the reasoning abilities of LLMs on a variety of NP-hard combinatorial optimization tasks and introduce ACCORD: Autoregressive Constraint-satisfying generation for COmbinatorial optimization with Routing and Dynamic attention. ACCORD features a novel dataset representation and model architecture that leverage the autoregressive nature of LLMs to dynamically enforce feasibility constraints, coupled with attention-based routing to activate problem-specific LoRA modules. We also present the ACCORD-90k supervised dataset, covering six NP-hard combinatorial problems: TSP, VRP, Knapsack, FlowShop, JSSP, and BinPacking. Extensive experiments demonstrate that our ACCORD model, built on an 8B-parameter Llama backbone, consistently outperforms standard prompting and input-output methods, even when compared to much larger LLMs, such as gpt-4. Ablation studies further show that our output structure enhances solution feasibility. To the best of our knowledge, this is the first large-scale, end-to-end framework for exploring the applications of LLMs to a broad spectrum of combinatorial optimization problems. The codes are publicly available at https://github.com/starjob42/ACCORD
中文: 本文提出ACCORD框架,通过动态满足可行性约束和基于注意力的路由激活专用模块,显著提升大语言模型解决NP难组合优化问题的能力,实验表明其性能优于标准方法及更大规模模型。
English: This paper introduces ACCORD, a novel framework that enhances LLMs' ability to solve NP-hard combinatorial problems by dynamically enforcing feasibility constraints and using attention-based routing to activate specialized modules, demonstrating superior performance over standard methods even with smaller models.
Authors:Jiaqi Zhao, Weili Guan, Ming Li, Miao Zhang
Abstract:
Existing post-training quantization methods for large language models (LLMs) offer remarkable success. However, the increasingly marginal performance gains suggest that existing quantization strategies are insufficient to support the development of more compressed models. To inspire new directions for future research, this paper introduces the concept of null space into LLMs quantization. We argue that the quantization error can be effectively alleviated by constraining the post-quantization weight perturbation to lie within the null space of input activations. To prove this idea, we propose a plug-and-play null space projection module for existing milestone PTQ baselines named Q2N. Specifically, we first design an efficient and accurate null space projection approximation method tailored to the characteristics of LLMs. Subsequently, we theoretically derive a closed-form solution for an equivalent vector of the obtained projection matrix, which satisfies practical inference condition while avoiding additional memory overhead. Extensive experiments are conducted on various state-of-the-art LLMs (LLaMA3, DeepSeek, Qwen3) and baselines, demonstrating the effectiveness of both our Q2N and the perspective of null space optimization for LLMs quantization. We view this paper the first step to further alleviate the quantization error based on the insights of null space, hoping it inspiring future researchers to design more advanced quantization methods. Codes are available at https://github.com/zjq0455/q2n.
中文摘要:本文首次将零空间概念引入大语言模型量化领域,提出可即插即用的Q2N模块,通过将量化后权重扰动约束在输入激活的零空间内有效降低误差,并在多个前沿模型上通过实验验证了其有效性。
English Summary: This paper introduces the concept of null space to large language model quantization, proposing a plug-and-play Q2N module that reduces quantization error by constraining weight perturbations within input activations' null space, validated through extensive experiments on leading models.
Authors:Linjie Li, Zhenyu Wu, Yang Ji
Abstract:
Class-incremental learning (CIL) requires deep learning models to continuously acquire new knowledge from streaming data while preserving previously learned information. Recently, CIL based on pre-trained models (PTMs) has achieved remarkable success. However, prompt-based approaches suffer from prompt overwriting, while adapter-based methods face challenges such as dimensional misalignment between tasks. While the idea of expert fusion in Mixture of Experts (MoE) can help address dimensional inconsistency, both expert and routing parameters are prone to being overwritten in dynamic environments, making MoE challenging to apply directly in CIL. To tackle these issues, we propose a mixture of task-specific experts (MoTE) framework that effectively mitigates the miscalibration caused by inconsistent output dimensions across tasks. Inspired by the weighted feature fusion and sparse activation mechanisms in MoE, we introduce task-aware expert filtering and reliable expert joint inference during the inference phase, mimicking the behavior of routing layers without inducing catastrophic forgetting. Extensive experiments demonstrate the superiority of our method without requiring an exemplar set. Furthermore, the number of tasks in MoTE scales linearly with the number of adapters. Building on this, we further explore the trade-off between adapter expansion and model performance and propose the Adapter-Limited MoTE. The code is available at https://github.com/Franklilinjie/MoTE.
Chinese: 本文提出了一种任务特定专家混合(MoTE)框架,通过任务感知的专家筛选和联合推理机制,有效解决了类增量学习中的提示覆盖和维度不匹配问题,无需样本集即可避免灾难性遗忘。
English: This paper introduces a Mixture of Task-specific Experts (MoTE) framework to address challenges in class-incremental learning, such as prompt overwriting and dimensional misalignment, by leveraging task-aware expert filtering and joint inference to prevent catastrophic forgetting without needing exemplars.
Authors:Zoher Kachwala, Danishjeet Singh, Danielle Yang, Filippo Menczer
Abstract:
As image generators produce increasingly realistic images, concerns about potential misuse continue to grow. Supervised detection relies on large, curated datasets and struggles to generalize across diverse generators. In this work, we investigate the use of pre-trained Vision-Language Models (VLMs) for zero-shot detection of AI-generated images. While off-the-shelf VLMs exhibit some task-specific reasoning and chain-of-thought prompting offers gains, we show that task-aligned prompting elicits more focused reasoning and significantly improves performance without fine-tuning. Specifically, prefixing the model's response with the phrase "Let's examine the style and the synthesis artifacts" -- a method we call zero-shot-s$^2$ -- boosts Macro F1 scores by 8%-29%. These gains are consistent for two widely used open-source models and across three recent, diverse datasets spanning human faces, objects, and animals with images generated by 16 different models -- demonstrating strong generalization. We further evaluate the approach across three additional model sizes and observe improvements in most dataset-model combinations -- suggesting robustness to model scale. Surprisingly, self-consistency, a behavior previously observed in language reasoning, where aggregating answers from diverse reasoning paths improves performance, also holds in this setting. Even here, zero-shot-s$^2$ scales better than chain-of-thought in most cases -- indicating that it elicits more useful diversity. Our findings show that task-aligned prompts elicit more focused reasoning and enhance latent capabilities in VLMs, like the detection of AI-generated images -- offering a simple, generalizable, and explainable alternative to supervised methods. Our code is publicly available on github: https://github.com/Zoher15/Zero-shot-s2.
中文: 本研究提出了零样本s²方法,通过任务对齐提示显著提升了视觉语言模型在无需微调的情况下检测AI生成图像的能力,并在多种数据集和模型上展现出强大的泛化性。
English: This study introduces zero-shot-s², a task-aligned prompting method that significantly enhances Vision-Language Models' ability to detect AI-generated images without fine-tuning, demonstrating strong generalization across diverse datasets and models.
Authors:Donghoon Ahn, Jiwon Kang, Sanghyun Lee, Minjae Kim, Jaewon Min, Wooseok Jang, Sangwu Lee, Sayak Paul, Susung Hong, Seungryong Kim
Abstract:
Recent guidance methods in diffusion models steer reverse sampling by perturbing the model to construct an implicit weak model and guide generation away from it. Among these approaches, attention perturbation has demonstrated strong empirical performance in unconditional scenarios where classifier-free guidance is not applicable. However, existing attention perturbation methods lack principled approaches for determining where perturbations should be applied, particularly in Diffusion Transformer (DiT) architectures where quality-relevant computations are distributed across layers. In this paper, we investigate the granularity of attention perturbations, ranging from the layer level down to individual attention heads, and discover that specific heads govern distinct visual concepts such as structure, style, and texture quality. Building on this insight, we propose "HeadHunter", a systematic framework for iteratively selecting attention heads that align with user-centric objectives, enabling fine-grained control over generation quality and visual attributes. In addition, we introduce SoftPAG, which linearly interpolates each selected head's attention map toward an identity matrix, providing a continuous knob to tune perturbation strength and suppress artifacts. Our approach not only mitigates the oversmoothing issues of existing layer-level perturbation but also enables targeted manipulation of specific visual styles through compositional head selection. We validate our method on modern large-scale DiT-based text-to-image models including Stable Diffusion 3 and FLUX.1, demonstrating superior performance in both general quality enhancement and style-specific guidance. Our work provides the first head-level analysis of attention perturbation in diffusion models, uncovering interpretable specialization within attention layers and enabling practical design of effective perturbation strategies.
Authors:Yixin Ou, Yujie Luo, Jingsheng Zheng, Lanning Wei, Shuofei Qiao, Jintian Zhang, Da Zheng, Huajun Chen, Ningyu Zhang
Abstract:
Large Language Model (LLM) agents have shown great potential in addressing real-world data science problems. LLM-driven data science agents promise to automate the entire machine learning pipeline, yet their real-world effectiveness remains limited. Existing frameworks depend on rigid, pre-defined workflows and inflexible coding strategies; consequently, they excel only on relatively simple, classical problems and fail to capture the empirical expertise that human practitioners bring to complex, innovative tasks. In this work, we introduce AutoMind, an adaptive, knowledgeable LLM-agent framework that overcomes these deficiencies through three key advances: (1) a curated expert knowledge base that grounds the agent in domain expert knowledge, (2) an agentic knowledgeable tree search algorithm that strategically explores possible solutions, and (3) a self-adaptive coding strategy that dynamically tailors code generation to task complexity. Evaluations on two automated data science benchmarks demonstrate that AutoMind delivers superior performance versus state-of-the-art baselines. Additional analyses confirm favorable effectiveness, efficiency, and qualitative solution quality, highlighting AutoMind as an efficient and robust step toward fully automated data science.
中文摘要:AutoMind是一种自适应大型语言模型智能体框架,通过融合专家知识、策略性解决方案探索和动态编码,在自动化数据科学基准测试中展现出卓越性能。
English Summary: AutoMind is an adaptive LLM-agent framework that enhances automated data science by integrating expert knowledge, strategic solution exploration, and dynamic coding, achieving superior performance on benchmarks.
Authors:Julius Berner, Miguel Liu-Schiaffini, Jean Kossaifi, Valentin Duruisseaux, Boris Bonev, Kamyar Azizzadenesheli, Anima Anandkumar
Abstract:
A wide range of scientific problems, such as those described by continuous-time dynamical systems and partial differential equations (PDEs), are naturally formulated on function spaces. While function spaces are typically infinite-dimensional, deep learning has predominantly advanced through applications in computer vision and natural language processing that focus on mappings between finite-dimensional spaces. Such fundamental disparities in the nature of the data have limited neural networks from achieving a comparable level of success in scientific applications as seen in other fields. Neural operators are a principled way to generalize neural networks to mappings between function spaces, offering a pathway to replicate deep learning's transformative impact on scientific problems. For instance, neural operators can learn solution operators for entire classes of PDEs, e.g., physical systems with different boundary conditions, coefficient functions, and geometries. A key factor in deep learning's success has been the careful engineering of neural architectures through extensive empirical testing. Translating these neural architectures into neural operators allows operator learning to enjoy these same empirical optimizations. However, prior neural operator architectures have often been introduced as standalone models, not directly derived as extensions of existing neural network architectures. In this paper, we identify and distill the key principles for constructing practical implementations of mappings between infinite-dimensional function spaces. Using these principles, we propose a recipe for converting several popular neural architectures into neural operators with minimal modifications. This paper aims to guide practitioners through this process and details the steps to make neural operators work in practice. Our code can be found at https://github.com/neuraloperator/NNs-to-NOs
中文摘要:本文提出神经算子作为神经网络向无限维函数空间映射的原则性扩展,提供了一个实用框架,可将现有神经架构应用于科学计算领域,如求解偏微分方程。
English summary: This paper introduces neural operators as a principled extension of neural networks to handle mappings between infinite-dimensional function spaces, providing a practical framework to adapt existing neural architectures for scientific applications like solving partial differential equations.
Authors:Houyi Li, Wenzhen Zheng, Qiufeng Wang, Zhenyu Ding, Haoying Wang, Zili Wang, Shijie Xuyang, Ning Ding, Shuigeng Zhou, Xiangyu Zhang, Daxin Jiang
Abstract:
Training Large Language Models (LLMs) is prohibitively expensive, creating a critical scaling gap where insights from small-scale experiments often fail to transfer to resource-intensive production systems, thereby hindering efficient innovation. To bridge this, we introduce Farseer, a novel and refined scaling law offering enhanced predictive accuracy across scales. By systematically constructing a model loss surface $L(N,D)$, Farseer achieves a significantly better fit to empirical data than prior laws (e.g., Chinchilla's law). Our methodology yields accurate, robust, and highly generalizable predictions, demonstrating excellent extrapolation capabilities, improving upon Chinchilla's law by reducing extrapolation error by 433\%. This allows for the reliable evaluation of competing training strategies across all $(N,D)$ settings, enabling conclusions from small-scale ablation studies to be confidently extrapolated to predict large-scale performance. Furthermore, Farseer provides new insights into optimal compute allocation, better reflecting the nuanced demands of modern LLM training. To validate our approach, we trained an extensive suite of approximately 1,000 LLMs across diverse scales and configurations, consuming roughly 3 million NVIDIA H100 GPU hours. We are comprehensively open-sourcing all models, data, results, and logs at https://github.com/Farseer-Scaling-Law/Farseer to foster further research.
中文: Farseer提出了一种改进的扩展定律,显著提升了大型语言模型训练的预测准确性,能够从小规模实验可靠地推断大规模性能,相比Chinchilla定律将外推误差降低了433%。
English: Farseer introduces a refined scaling law that significantly improves predictive accuracy for large language model training, enabling reliable extrapolation from small-scale experiments to large-scale performance and reducing extrapolation error by 433% compared to Chinchilla's law.
Authors:Yuanhui Huang, Weiliang Chen, Wenzhao Zheng, Yueqi Duan, Jie Zhou, Jiwen Lu
Abstract:
Autoregressive visual generation has garnered increasing attention due to its scalability and compatibility with other modalities compared with diffusion models. Most existing methods construct visual sequences as spatial patches for autoregressive generation. However, image patches are inherently parallel, contradicting the causal nature of autoregressive modeling. To address this, we propose a Spectral AutoRegressive (SpectralAR) visual generation framework, which realizes causality for visual sequences from the spectral perspective. Specifically, we first transform an image into ordered spectral tokens with Nested Spectral Tokenization, representing lower to higher frequency components. We then perform autoregressive generation in a coarse-to-fine manner with the sequences of spectral tokens. By considering different levels of detail in images, our SpectralAR achieves both sequence causality and token efficiency without bells and whistles. We conduct extensive experiments on ImageNet-1K for image reconstruction and autoregressive generation, and SpectralAR achieves 3.02 gFID with only 64 tokens and 310M parameters. Project page: https://huang-yh.github.io/spectralar/.
中文摘要:SpectralAR框架通过谱视角实现视觉序列的因果性,采用嵌套谱标记化将图像转化为有序频谱标记进行自回归生成,以少量标记和参数取得了优异性能。
English Summary: The SpectralAR framework introduces a novel autoregressive visual generation method using spectral tokens to ensure causality and efficiency, achieving state-of-the-art results with minimal tokens and parameters.
Authors:Kangwei Liu, Siyuan Cheng, Bozhong Tian, Xiaozhuan Liang, Yuyang Yin, Meng Han, Ningyu Zhang, Bryan Hooi, Xi Chen, Shumin Deng
Abstract:
Large language models (LLMs) have been increasingly applied to automated harmful content detection tasks, assisting moderators in identifying policy violations and improving the overall efficiency and accuracy of content review. However, existing resources for harmful content detection are predominantly focused on English, with Chinese datasets remaining scarce and often limited in scope. We present a comprehensive, professionally annotated benchmark for Chinese content harm detection, which covers six representative categories and is constructed entirely from real-world data. Our annotation process further yields a knowledge rule base that provides explicit expert knowledge to assist LLMs in Chinese harmful content detection. In addition, we propose a knowledge-augmented baseline that integrates both human-annotated knowledge rules and implicit knowledge from large language models, enabling smaller models to achieve performance comparable to state-of-the-art LLMs. Code and data are available at https://github.com/zjunlp/ChineseHarm-bench.
中文摘要:本研究提出了一个全面、专业标注的中文有害内容检测基准,涵盖六大类别并基于真实数据构建,同时通过知识增强基线方法,使较小模型能达到与顶尖大语言模型相媲美的性能。
English Summary: This study introduces a comprehensive, professionally annotated benchmark for Chinese harmful content detection, covering six categories and utilizing real-world data, along with a knowledge-augmented baseline that enhances smaller models' performance to match state-of-the-art LLMs.
Authors:Boaz Lavon, Shahar Katz, Lior Wolf
Abstract:
We present a novel approach to neural code generation that incorporates real-time execution signals into the language model generation process. While large language models (LLMs) have demonstrated impressive code generation capabilities, they typically do not utilize execution feedback during inference, a critical signal that human programmers regularly leverage. Our method, Execution-Guided Classifier-Free Guidance (EG-CFG), dynamically incorporates execution signals as the model generates code, providing line-by-line feedback that guides the generation process toward executable solutions. EG-CFG employs a multi-stage process: first, we conduct beam search to sample candidate program completions for each line; second, we extract execution signals by executing these candidates against test cases; and finally, we incorporate these signals into the prompt during generation. By maintaining consistent signals across tokens within the same line and refreshing signals at line boundaries, our approach provides coherent guidance while preserving syntactic structure. Moreover, the method naturally supports native parallelism at the task level in which multiple agents operate in parallel, exploring diverse reasoning paths and collectively generating a broad set of candidate solutions. Our experiments across diverse coding tasks demonstrate that EG-CFG significantly improves code generation performance compared to standard approaches, achieving state-of-the-art results across various levels of complexity, from foundational problems to challenging competitive programming tasks. Our code is available at: https://github.com/boazlavon/eg_cfg
中文: 我们提出了一种新颖的神经代码生成方法——执行引导的无分类器引导(EG-CFG),通过在推理过程中动态整合实时执行反馈来引导模型生成可执行代码,在各类编程任务中实现了最先进的性能。
English: We introduce Execution-Guided Classifier-Free Guidance (EG-CFG), a novel neural code generation method that dynamically integrates real-time execution feedback during inference to guide the model toward producing executable code, achieving state-of-the-art performance across diverse coding tasks.
Authors:Adam Zweiger, Jyothish Pari, Han Guo, Ekin Akyürek, Yoon Kim, Pulkit Agrawal
Abstract:
Large language models (LLMs) are powerful but static; they lack mechanisms to adapt their weights in response to new tasks, knowledge, or examples. We introduce Self-Adapting LLMs (SEAL), a framework that enables LLMs to self-adapt by generating their own finetuning data and update directives. Given a new input, the model produces a self-edit-a generation that may restructure the information in different ways, specify optimization hyperparameters, or invoke tools for data augmentation and gradient-based updates. Through supervised finetuning (SFT), these self-edits result in persistent weight updates, enabling lasting adaptation. To train the model to produce effective self-edits, we use a reinforcement learning loop with the downstream performance of the updated model as the reward signal. Unlike prior approaches that rely on separate adaptation modules or auxiliary networks, SEAL directly uses the model's own generation to control its adaptation process. Experiments on knowledge incorporation and few-shot generalization show that SEAL is a promising step toward language models capable of self-directed adaptation. Our website and code is available at https://jyopari.github.io/posts/seal.
Authors:Jiancheng Huang, Gengwei Zhang, Zequn Jie, Siyu Jiao, Yinlong Qian, Ling Chen, Yunchao Wei, Lin Ma
Abstract:
Text-to-video generation has significantly enriched content creation and holds the potential to evolve into powerful world simulators. However, modeling the vast spatiotemporal space remains computationally demanding, particularly when employing Transformers, which incur quadratic complexity in sequence processing and thus limit practical applications. Recent advancements in linear-time sequence modeling, particularly the Mamba architecture, offer a more efficient alternative. Nevertheless, its plain design limits its direct applicability to multi-modal and spatiotemporal video generation tasks. To address these challenges, we introduce M4V, a Multi-Modal Mamba framework for text-to-video generation. Specifically, we propose a multi-modal diffusion Mamba (MM-DiM) block that enables seamless integration of multi-modal information and spatiotemporal modeling through a multi-modal token re-composition design. As a result, the Mamba blocks in M4V reduce FLOPs by 45% compared to the attention-based alternative when generating videos at 768$\times$1280 resolution. Additionally, to mitigate the visual quality degradation in long-context autoregressive generation processes, we introduce a reward learning strategy that further enhances per-frame visual realism. Extensive experiments on text-to-video benchmarks demonstrate M4V's ability to produce high-quality videos while significantly lowering computational costs. Code and models will be publicly available at https://huangjch526.github.io/M4V_project.
中文: M4V框架采用多模态Mamba架构,通过创新的令牌重组设计和奖励学习策略,在保持高质量视频生成的同时,将计算成本显著降低45%。
English: The M4V framework introduces a multi-modal Mamba architecture that significantly reduces computational costs by 45% while maintaining high-quality video generation through innovative token re-composition and reward learning strategies.
Authors:Subham Sekhar Sahoo, Justin Deschenaux, Aaron Gokaslan, Guanghan Wang, Justin Chiu, Volodymyr Kuleshov
Abstract:
Uniform-state discrete diffusion models hold the promise of fast text generation due to their inherent ability to self-correct. However, they are typically outperformed by autoregressive models and masked diffusion models. In this work, we narrow this performance gap by leveraging a key insight: Uniform-state diffusion processes naturally emerge from an underlying Gaussian diffusion. Our method, Duo, transfers powerful techniques from Gaussian diffusion to improve both training and sampling. First, we introduce a curriculum learning strategy guided by the Gaussian process, doubling training speed by reducing variance. Models trained with curriculum learning surpass autoregressive models in zero-shot perplexity on 3 of 7 benchmarks. Second, we present Discrete Consistency Distillation, which adapts consistency distillation from the continuous to the discrete setting. This algorithm unlocks few-step generation in diffusion language models by accelerating sampling by two orders of magnitude. We provide the code and model checkpoints on the project page: http://s-sahoo.github.io/duo
中文:Duo方法通过将高斯扩散技术迁移到均匀状态离散扩散模型中,采用课程学习策略加速训练,并引入离散一致性蒸馏实现快速少步生成,在部分基准测试中超越了自回归模型的性能。
English: The Duo method enhances uniform-state discrete diffusion models by transferring techniques from Gaussian diffusion, using curriculum learning to accelerate training and discrete consistency distillation to enable fast few-step generation, outperforming autoregressive models on some benchmarks.
Authors:Xiaozhe Li, Jixuan Chen, Xinyu Fang, Shengyuan Ding, Haodong Duan, Qingwen Liu, Kai Chen
Abstract:
Large Language Models (LLMs) have shown remarkable capabilities in solving diverse tasks. However, their proficiency in iteratively optimizing complex solutions through learning from previous feedback remains insufficiently explored. To bridge this gap, we present OPT-BENCH, a comprehensive benchmark designed to evaluate LLM agents on large-scale search space optimization problems. OPT-BENCH includes 20 real-world machine learning tasks sourced from Kaggle and 10 classical NP problems, offering a diverse and challenging environment for assessing LLM agents on iterative reasoning and solution refinement. To enable rigorous evaluation, we introduce OPT-Agent, an end-to-end optimization framework that emulates human reasoning when tackling complex problems by generating, validating, and iteratively improving solutions through leveraging historical feedback. Through extensive experiments on 9 state-of-the-art LLMs from 6 model families, we analyze the effects of optimization iterations, temperature settings, and model architectures on solution quality and convergence. Our results demonstrate that incorporating historical context significantly enhances optimization performance across both ML and NP tasks. All datasets, code, and evaluation tools are open-sourced to promote further research in advancing LLM-driven optimization and iterative reasoning. Project page: \href{https://github.com/OliverLeeXZ/OPT-BENCH}{https://github.com/OliverLeeXZ/OPT-BENCH}.
中文: 本研究提出了OPT-BENCH基准来评估大语言模型在迭代优化任务中的表现,并证明利用历史反馈能显著提升其在机器学习和NP问题上的解决能力。
English: The study introduces OPT-BENCH, a benchmark for evaluating LLM agents on iterative optimization tasks, and demonstrates that leveraging historical feedback significantly improves their performance across machine learning and NP problems.
Authors:Marco Spinaci, Marek Polewczyk, Maximilian Schambach, Sam Thelin
Abstract:
Tabular in-context learning (ICL) has recently achieved state-of-the-art (SOTA) performance on several tabular prediction tasks. Previously restricted to classification problems on small tables, recent advances such as TabPFN and TabICL have extended its use to larger datasets. Although current table-native ICL architectures are architecturally efficient and well-adapted to tabular data structures, their exclusive training on synthetic data limits their ability to fully leverage the rich semantics and world knowledge contained in real-world tabular data. At the other end of the spectrum, tabular ICL models based on pretrained large language models such as TabuLa-8B integrate deep semantic understanding and world knowledge but are only able to make use of a small amount of context due to inherent architectural limitations. With the aim to combine the best of both these worlds, we introduce ConTextTab, integrating semantic understanding and alignment into a table-native ICL framework. By employing specialized embeddings for different data modalities and by training on large-scale real-world tabular data, our model is competitive with SOTA across a broad set of benchmarks while setting a new standard on the semantically rich CARTE benchmark. Code and model checkpoints are available at: https://github.com/SAP-samples/contexttab
Chinese: ConTextTab是一种新颖的表格原生上下文学习框架,通过专用嵌入和大规模真实数据训练整合语义理解与对齐,在多个基准测试中表现优异,并在CARTE基准上创下新标准。
English: ConTextTab is a novel table-native in-context learning framework that integrates semantic understanding and alignment through specialized embeddings and large-scale real-world data training, achieving competitive performance across benchmarks and setting new standards on the CARTE benchmark.
Authors:Igor Urbanik, PaweÅ Gajewski
Abstract:
Continual learning poses a fundamental challenge for neural systems, which often suffer from catastrophic forgetting when exposed to sequential tasks. Self-Organizing Maps (SOMs), despite their interpretability and efficiency, are not immune to this issue. In this paper, we introduce Saturation Self-Organizing Maps (SatSOM)-an extension of SOMs designed to improve knowledge retention in continual learning scenarios. SatSOM incorporates a novel saturation mechanism that gradually reduces the learning rate and neighborhood radius of neurons as they accumulate information. This effectively freezes well-trained neurons and redirects learning to underutilized areas of the map.
Chinese: 本文提出饱和自组织映射(SatSOM),通过引入饱和机制逐步降低已学习神经元的参数更新,有效缓解持续学习中的灾难性遗忘问题。
English: This paper introduces Saturation Self-Organizing Maps (SatSOM), which enhance continual learning by incorporating a saturation mechanism that reduces learning parameters for experienced neurons, thus mitigating catastrophic forgetting.
Authors:Alexander Lobashev, Dmitry Guskov, Maria Larchenko, Mikhail Tamm
Abstract:
This paper presents a novel method for analyzing the latent space geometry of generative models, including statistical physics models and diffusion models, by reconstructing the Fisher information metric. The method approximates the posterior distribution of latent variables given generated samples and uses this to learn the log-partition function, which defines the Fisher metric for exponential families. Theoretical convergence guarantees are provided, and the method is validated on the Ising and TASEP models, outperforming existing baselines in reconstructing thermodynamic quantities. Applied to diffusion models, the method reveals a fractal structure of phase transitions in the latent space, characterized by abrupt changes in the Fisher metric. We demonstrate that while geodesic interpolations are approximately linear within individual phases, this linearity breaks down at phase boundaries, where the diffusion model exhibits a divergent Lipschitz constant with respect to the latent space. These findings provide new insights into the complex structure of diffusion model latent spaces and their connection to phenomena like phase transitions. Our source code is available at https://github.com/alobashev/hessian-geometry-of-diffusion-models.
Chinese: 本文提出了一种通过重构Fisher信息度量来分析生成模型潜在空间几何结构的新方法,揭示了扩散模型中相变的分形结构,并在热力学量重构方面优于现有基准。
English: This paper introduces a method to reconstruct the Fisher information metric for analyzing latent space geometry in generative models, revealing phase transitions and fractal structures in diffusion models while outperforming baselines in thermodynamic quantity reconstruction.
Authors:Sergio Burdisso, Esaú Villatoro-Tello, Petr Motlicek
Abstract:
The advancement of conversational AI systems relies on the availability of high-quality, flexible, and reproducible synthetic dialogues for training, evaluation, and benchmarking. SDialog is a modular, extensible Python toolkit designed to address the challenges of synthetic dialogue generation and analysis. By leveraging instruction-tuned Large Language Models (LLMs), SDialog provides abstractions for personas, orchestration, and scenario management, enabling the creation of realistic, diverse, and controllable conversational data for research and development. SDialog supports workflows such as multi-agent simulation and scenario-driven generation, and represents a step forward in the standardization of tools and frameworks for synthetic data generation, a crucial advancement for ensuring reproducibility in today's fast-evolving research landscape.
中文:SDialog是一个模块化的Python工具包,它利用指令调优的大语言模型生成真实可控的合成对话,支持多智能体模拟和场景驱动的工作流程,以推动对话AI研究并确保可复现性。
English: SDialog is a modular Python toolkit that uses instruction-tuned LLMs to generate realistic and controllable synthetic dialogues, supporting multi-agent simulations and scenario-driven workflows to advance conversational AI research and ensure reproducibility.
Authors:Reza Karbasi, Masoud Rahimi, Abdol-Hossein Vahabie, Hadi Moradi
Abstract:
This paper addresses the persistent challenge of accurately digitizing paper-based electrocardiogram (ECG) recordings, with a particular focus on robustly handling single leads compromised by signal overlaps-a common yet under-addressed issue in existing methodologies. We propose a two-stage pipeline designed to overcome this limitation. The first stage employs a U-Net based segmentation network, trained on a dataset enriched with overlapping signals and fortified with custom data augmentations, to accurately isolate the primary ECG trace. The subsequent stage converts this refined binary mask into a time-series signal using established digitization techniques, enhanced by an adaptive grid detection module for improved versatility across different ECG formats and scales. Our experimental results demonstrate the efficacy of our approach. The U-Net architecture achieves an IoU of 0.87 for the fine-grained segmentation task. Crucially, our proposed digitization method yields superior performance compared to a well-established baseline technique across both non-overlapping and challenging overlapping ECG samples. For non-overlapping signals, our method achieved a Mean Squared Error (MSE) of 0.0010 and a Pearson Correlation Coefficient (rho) of 0.9644, compared to 0.0015 and 0.9366, respectively, for the baseline. On samples with signal overlap, our method achieved an MSE of 0.0029 and a rho of 0.9641, significantly improving upon the baseline's 0.0178 and 0.8676. This work demonstrates an effective strategy to significantly enhance digitization accuracy, especially in the presence of signal overlaps, thereby laying a strong foundation for the reliable conversion of analog ECG records into analyzable digital data for contemporary research and clinical applications. The implementation is publicly available at this GitHub repository: https://github.com/masoudrahimi39/ECG-code.
本文提出了一种采用U-Net分割和自适应数字化技术的两阶段流程,能精准将纸质心电图转换为数字信号,相比基线方法在处理信号重叠情况时表现出显著优势。
This paper introduces a two-stage pipeline using U-Net segmentation and adaptive digitization to accurately convert paper ECG recordings into digital signals, significantly improving performance especially for overlapping signals compared to baseline methods.
Authors:Guowei Zhong, Ruohong Huan, Mingzhen Wu, Ronghua Liang, Peng Chen
Abstract:
Recent advancements in Multimodal Emotion Recognition (MER) face challenges in addressing both modality missing and Out-Of-Distribution (OOD) data simultaneously. Existing methods often rely on specific models or introduce excessive parameters, which limits their practicality. To address these issues, we propose a novel robust MER framework, Causal Inference Distiller (CIDer), and introduce a new task, Random Modality Feature Missing (RMFM), to generalize the definition of modality missing. CIDer integrates two key components: a Model-Specific Self-Distillation (MSSD) module and a Model-Agnostic Causal Inference (MACI) module. MSSD enhances robustness under the RMFM task through a weight-sharing self-distillation approach applied across low-level features, attention maps, and high-level representations. Additionally, a Word-level Self-aligned Attention Module (WSAM) reduces computational complexity, while a Multimodal Composite Transformer (MCT) facilitates efficient multimodal fusion. To tackle OOD challenges, MACI employs a tailored causal graph to mitigate label and language biases using a Multimodal Causal Module (MCM) and fine-grained counterfactual texts. Notably, MACI can independently enhance OOD generalization with minimal additional parameters. Furthermore, we also introduce the new repartitioned MER OOD datasets. Experimental results demonstrate that CIDer achieves robust performance in both RMFM and OOD scenarios, with fewer parameters and faster training compared to state-of-the-art methods. The implementation of this work is publicly accessible at https://github.com/gw-zhong/CIDer.
中文摘要:提出的因果推理蒸馏器(CIDer)框架通过自蒸馏和因果推理模块,解决了多模态情感识别中模态缺失和分布外数据的双重挑战,以更少参数和更快训练实现了鲁棒性能。
English Summary: The proposed Causal Inference Distiller (CIDer) framework addresses simultaneous modality missing and Out-Of-Distribution challenges in Multimodal Emotion Recognition through self-distillation and causal inference modules, achieving robust performance with fewer parameters and faster training.
Authors:Yuhang Chen, Zhen Tan, Tianlong Chen
Abstract:
Reward Models (RMs), vital for large model alignment, are underexplored for complex embodied tasks like Embodied Question Answering (EQA) where nuanced evaluation of agents' spatial, temporal, and logical understanding is critical yet not considered by generic approaches. We introduce EQA-RM, a novel generative multimodal reward model specifically architected for EQA, trained via our innovative Contrastive Group Relative Policy Optimization (C-GRPO) strategy to learn fine-grained behavioral distinctions. The generative nature of EQA-RM provides interpretable, structured reward feedback (beyond simple scalars), uniquely enabling test-time scaling to dynamically adjust evaluation granularity, from concise scores to detailed critiques of reasoning and grounding, at inference without retraining. Concurrently, we introduce EQARewardBench, a new benchmark built on OpenEQA for standardized EQA reward model assessment. Demonstrating high sample efficiency, EQA-RM (fine-tuning Qwen2-VL-2B-Instruct) achieves 61.9\% accuracy on EQA-RM-Bench with only 700 samples, outperforming strong proprietary baselines, including Gemini-2.5-Flash, GPT-4o, Claude-3.5-Haiku, and open-sourced state-of-the-art models such as RoVRM and VisualPRM. The code and dataset can be found here https://github.com/UNITES-Lab/EQA-RM.
中文: 本文提出了针对具身问答任务的生成式多模态奖励模型EQA-RM,该模型能提供可解释的反馈并以少量样本实现优于主流模型的性能,同时建立了标准化评估新基准。
English: This paper introduces EQA-RM, a generative multimodal reward model for Embodied Question Answering, which provides interpretable feedback and outperforms leading models with high sample efficiency, alongside a new benchmark for standardized evaluation.
Authors:Yanhui Li, Dongxia Wang, Zhu Sun, Haonan Zhang, Huizhong Guo
Abstract:
Recently, Graph Neural Networks (GNNs) have become the dominant approach for Knowledge Graph-aware Recommender Systems (KGRSs) due to their proven effectiveness. Building upon GNN-based KGRSs, Self-Supervised Learning (SSL) has been incorporated to address the sparity issue, leading to longer training time. However, through extensive experiments, we reveal that: (1)compared to other KGRSs, the existing GNN-based KGRSs fail to keep their superior performance under sparse interactions even with SSL. (2) More complex models tend to perform worse in sparse interaction scenarios and complex mechanisms, like attention mechanism, can be detrimental as they often increase learning difficulty. Inspired by these findings, we propose LightKG, a simple yet powerful GNN-based KGRS to address sparsity issues. LightKG includes a simplified GNN layer that encodes directed relations as scalar pairs rather than dense embeddings and employs a linear aggregation framework, greatly reducing the complexity of GNNs. Additionally, LightKG incorporates an efficient contrastive layer to implement SSL. It directly minimizes the node similarity in original graph, avoiding the time-consuming subgraph generation and comparison required in previous SSL methods. Experiments on four benchmark datasets show that LightKG outperforms 12 competitive KGRSs in both sparse and dense scenarios while significantly reducing training time. Specifically, it surpasses the best baselines by an average of 5.8\% in recommendation accuracy and saves 84.3\% of training time compared to KGRSs with SSL. Our code is available at https://github.com/1371149/LightKG.
中文: 图神经网络在知识图谱感知推荐系统中应用广泛,但面临稀疏交互和复杂性增加的挑战,因此提出了LightKG这一简化模型,既提升了推荐准确性又显著缩短了训练时间。
English: Graph Neural Networks (GNNs) are widely used in Knowledge Graph-aware Recommender Systems but struggle with sparse interactions and increased complexity, prompting the development of LightKG, a simplified GNN model that enhances performance and reduces training time.
Authors:Aaryam Sharma
Abstract:
Air pollution has become a significant health risk in developing countries. While governments routinely publish air-quality index (AQI) data to track pollution, these values fail to capture the local reality, as sensors are often very sparse. In this paper, we address this gap by predicting AQI in 1 km^2 neighborhoods, using the example of AirDelhi dataset. Using Spatio-temporal GNNs we surpass existing works by 71.654 MSE a 79% reduction, even on unseen coordinates. New insights about AQI such as the existence of strong repetitive short-term patterns and changing spatial relations are also discovered. The code is available on GitHub.
Chinese: 本文通过使用时空图神经网络精确预测1平方公里区域的空气质量指数,解决了传感器稀疏的局限,将误差降低了79%,并揭示了污染数据中的新规律。
English: This paper tackles the limitations of sparse air-quality sensors by using spatio-temporal graph neural networks to accurately predict AQI in 1 km² areas, achieving a 79% reduction in error and uncovering new patterns in pollution data.
Authors:Cameron Angliss, Jiaxun Cui, Jiaheng Hu, Arrasy Rahman, Peter Stone
Abstract:
Developing AI agents that can robustly adapt to dramatically different strategic landscapes without retraining is a central challenge for multi-agent learning. Pokémon Video Game Championships (VGC) is a domain with an extraordinarily large space of possible team configurations of approximately $10^{139}$ - far larger than those of Dota or Starcraft. The highly discrete, combinatorial nature of team building in Pokémon VGC causes optimal strategies to shift dramatically depending on both the team being piloted and the opponent's team, making generalization uniquely challenging. To advance research on this problem, we introduce VGC-Bench: a benchmark that provides critical infrastructure, standardizes evaluation protocols, and supplies human-play datasets and a range of baselines - from large-language-model agents and behavior cloning to reinforcement learning and empirical game-theoretic methods such as self-play, fictitious play, and double oracle. In the restricted setting where an agent is trained and evaluated on a single-team configuration, our methods are able to win against a professional VGC competitor. We extensively evaluated all baseline methods over progressively larger team sets and find that even the best-performing algorithm in the single-team setting struggles at scaling up as team size grows. Thus, policy generalization across diverse team strategies remains an open challenge for the community. Our code is open sourced at https://github.com/cameronangliss/VGC-Bench.
中文摘要:VGC-Bench作为新基准被提出,旨在解决AI智能体在宝可梦VGC巨大策略空间中泛化适应的核心难题,现有方法在单队伍配置中表现良好但难以扩展到多队伍场景。
English Summary: VGC-Bench is introduced as a new benchmark to address the challenge of developing AI agents that can generalize across Pokémon VGC's vast strategic landscape, where current methods succeed in single-team scenarios but fail to scale effectively.
Authors:Cheng Wang, Siqi Chen, Donghua Mi, Yang Chen, Yudong Zhang, Yinsheng Li
Abstract:
Recent advances in medical imaging have established deep learning-based segmentation as the predominant approach, though it typically requires large amounts of manually annotated data. However, obtaining annotations for intracranial hemorrhage (ICH) remains particularly challenging due to the tedious and costly labeling process. Semi-supervised learning (SSL) has emerged as a promising solution to address the scarcity of labeled data, especially in volumetric medical image segmentation. Unlike conventional SSL methods that primarily focus on high-confidence pseudo-labels or consistency regularization, we propose SWDL-Net, a novel SSL framework that exploits the complementary advantages of Laplacian pyramid and deep convolutional upsampling. The Laplacian pyramid excels at edge sharpening, while deep convolutions enhance detail precision through flexible feature mapping. Our framework achieves superior segmentation of lesion details and boundaries through a difference learning mechanism that effectively integrates these complementary approaches. Extensive experiments on a 271-case ICH dataset and public benchmarks demonstrate that SWDL-Net outperforms current state-of-the-art methods in scenarios with only 2% labeled data. Additional evaluations on the publicly available Brain Hemorrhage Segmentation Dataset (BHSD) with 5% labeled data further confirm the superiority of our approach. Code and data have been released at https://github.com/SIAT-CT-LAB/SWDL.
中文摘要:SWDL-Net是一种新型半监督学习框架,通过结合拉普拉斯金字塔边缘锐化和深度卷积上采样,在仅使用少量标注数据的情况下实现了优越的颅内出血分割效果。
English Summary: SWDL-Net is a novel semi-supervised learning framework that combines Laplacian pyramid edge sharpening with deep convolutional upsampling to achieve superior intracranial hemorrhage segmentation using minimal labeled data.
Authors:Paul Janson, Benjamin Therien, Quentin Anthony, Xiaolong Huang, Abhinav Moudgil, Eugene Belilovsky
Abstract:
Learned optimizers have been an active research topic over the past decade, with increasing progress toward practical, general-purpose optimizers that can serve as drop-in replacements for widely used methods like Adam. However, recent advances -- such as VeLO, which was meta-trained for 4000 TPU-months -- remain largely inaccessible to the broader community, in part due to their reliance on JAX and the absence of user-friendly packages for applying the optimizers after meta-training. To address this gap, we introduce PyLO, a PyTorch-based library that brings learned optimizers to the broader machine learning community through familiar, widely adopted workflows. Unlike prior work focused on synthetic or convex tasks, our emphasis is on applying learned optimization to real-world large-scale pre-training tasks. Our release includes a CUDA-accelerated version of the small_fc_lopt learned optimizer architecture from (Metz et al., 2022a), delivering substantial speedups -- from 39.36 to 205.59 samples/sec throughput for training ViT B/16 with batch size 32. PyLO also allows us to easily combine learned optimizers with existing optimization tools such as learning rate schedules and weight decay. When doing so, we find that learned optimizers can substantially benefit. Our code is available at https://github.com/Belilovsky-Lab/pylo
中文: PyLO库通过基于PyTorch的解决方案,使学习优化器在实际应用中更易用且高效,显著提升训练速度并兼容现有优化工具。
English: The PyLO library introduces a PyTorch-based solution to make learned optimizers accessible and practical for real-world applications, offering significant speed improvements and compatibility with existing optimization tools.
Authors:Parsa Rahimi, Sebastien Marcel
Abstract:
In this paper, we propose ScoreMix, a novel yet simple data augmentation strategy leveraging the score compositional properties of diffusion models to enhance discriminator performance, particularly under scenarios with limited labeled data. By convexly mixing the scores from different class-conditioned trajectories during diffusion sampling, we generate challenging synthetic samples that significantly improve discriminative capabilities in all studied benchmarks. We systematically investigate class-selection strategies for mixing and discover that greater performance gains arise when combining classes distant in the discriminator's embedding space, rather than close in the generator's condition space. Moreover, we empirically show that, under standard metrics, the correlation between the generator's learned condition space and the discriminator's embedding space is minimal. Our approach achieves notable performance improvements without extensive parameter searches, demonstrating practical advantages for training discriminative models while effectively mitigating problems regarding collections of large datasets. Paper website: https://parsa-ra.github.io/scoremix
Authors:Yuhui Ding, Thomas Hofmann
Abstract:
Equivariant diffusion models have achieved impressive performance in 3D molecule generation. These models incorporate Euclidean symmetries of 3D molecules by utilizing an SE(3)-equivariant denoising network. However, specialized equivariant architectures limit the scalability and efficiency of diffusion models. In this paper, we propose an approach that relaxes such equivariance constraints. Specifically, our approach learns a sample-dependent SO(3) transformation for each molecule to construct an aligned latent space. A non-equivariant diffusion model is then trained over the aligned representations. Experimental results demonstrate that our approach performs significantly better than previously reported non-equivariant models. It yields sample quality comparable to state-of-the-art equivariant diffusion models and offers improved training and sampling efficiency. Our code is available at https://github.com/skeletondyh/RADM
中文摘要:我们的方法通过放宽严格的等变性约束,利用样本相关的分子对齐构建潜在空间,并采用非等变扩散模型,在保持顶尖生成质量的同时显著提升了3D分子生成的训练与采样效率。
English Summary: Our method enhances 3D molecule generation by relaxing rigid equivariance constraints, achieving state-of-the-art quality with improved efficiency through sample-dependent molecular alignment and non-equivariant diffusion modeling.
Authors:Mohammad Jalali, Haoyu Lei, Amin Gohari, Farzan Farnia
Abstract:
Diffusion models have demonstrated remarkable success in high-fidelity image synthesis and prompt-guided generative modeling. However, ensuring adequate diversity in generated samples of prompt-guided diffusion models remains a challenge, particularly when the prompts span a broad semantic spectrum and the diversity of generated data needs to be evaluated in a prompt-aware fashion across semantically similar prompts. Recent methods have introduced guidance via diversity measures to encourage more varied generations. In this work, we extend the diversity measure-based approaches by proposing the Scalable Prompt-Aware Rény Kernel Entropy Diversity Guidance (SPARKE) method for prompt-aware diversity guidance. SPARKE utilizes conditional entropy for diversity guidance, which dynamically conditions diversity measurement on similar prompts and enables prompt-aware diversity control. While the entropy-based guidance approach enhances prompt-aware diversity, its reliance on the matrix-based entropy scores poses computational challenges in large-scale generation settings. To address this, we focus on the special case of Conditional latent RKE Score Guidance, reducing entropy computation and gradient-based optimization complexity from the $O(n^3)$ of general entropy measures to $O(n)$. The reduced computational complexity allows for diversity-guided sampling over potentially thousands of generation rounds on different prompts. We numerically test the SPARKE method on several text-to-image diffusion models, demonstrating that the proposed method improves the prompt-aware diversity of the generated data without incurring significant computational costs. We release our code on the project page: https://mjalali.github.io/SPARKE
Authors:Shangpin Peng, Weinong Wang, Zhuotao Tian, Senqiao Yang, Xing Wu, Haotian Xu, Chengquan Zhang, Takashi Isobe, Baotian Hu, Min Zhang
Abstract:
Direct Preference Optimization (DPO) has become a cornerstone of reinforcement learning from human feedback (RLHF) due to its simplicity and efficiency. However, existing DPO-based approaches typically treat all preference pairs uniformly, ignoring critical variations in their inherent quality and learning utility, leading to suboptimal data utilization and performance. To address this challenge, we propose Omni-DPO, a dual-perspective optimization framework that jointly accounts for (1) the inherent quality of each preference pair and (2) the model's evolving performance on those pairs. By adaptively weighting samples according to both data quality and the model's learning dynamics during training, Omni-DPO enables more effective training data utilization and achieves better performance. Experimental results on various models and benchmarks demonstrate the superiority and generalization capabilities of Omni-DPO. On textual understanding tasks, Gemma-2-9b-it finetuned with Omni-DPO beats the leading LLM, Claude 3 Opus, by a significant margin of 6.7 points on the Arena-Hard benchmark. On mathematical reasoning tasks, Omni-DPO consistently outperforms the baseline methods across all benchmarks, providing strong empirical evidence for the effectiveness and robustness of our approach. Code and models will be available at https://github.com/pspdada/Omni-DPO.
中文摘要:Omni-DPO是一种新颖的双视角优化框架,通过基于数据固有质量和模型学习动态的自适应加权机制改进直接偏好优化方法,在多个基准测试中实现了卓越性能。
English Summary: Omni-DPO is a novel dual-perspective optimization framework that enhances Direct Preference Optimization by adaptively weighting preference pairs based on both inherent data quality and the model's learning dynamics, achieving superior performance across various benchmarks.
Authors:Chieh Hubert Lin, Zhaoyang Lv, Songyin Wu, Zhen Xu, Thu Nguyen-Phuoc, Hung-Yu Tseng, Julian Straub, Numair Khan, Lei Xiao, Ming-Hsuan Yang, Yuheng Ren, Richard Newcombe, Zhao Dong, Zhengqin Li
Abstract:
We introduce the Deformable Gaussian Splats Large Reconstruction Model (DGS-LRM), the first feed-forward method predicting deformable 3D Gaussian splats from a monocular posed video of any dynamic scene. Feed-forward scene reconstruction has gained significant attention for its ability to rapidly create digital replicas of real-world environments. However, most existing models are limited to static scenes and fail to reconstruct the motion of moving objects. Developing a feed-forward model for dynamic scene reconstruction poses significant challenges, including the scarcity of training data and the need for appropriate 3D representations and training paradigms. To address these challenges, we introduce several key technical contributions: an enhanced large-scale synthetic dataset with ground-truth multi-view videos and dense 3D scene flow supervision; a per-pixel deformable 3D Gaussian representation that is easy to learn, supports high-quality dynamic view synthesis, and enables long-range 3D tracking; and a large transformer network that achieves real-time, generalizable dynamic scene reconstruction. Extensive qualitative and quantitative experiments demonstrate that DGS-LRM achieves dynamic scene reconstruction quality comparable to optimization-based methods, while significantly outperforming the state-of-the-art predictive dynamic reconstruction method on real-world examples. Its predicted physically grounded 3D deformation is accurate and can readily adapt for long-range 3D tracking tasks, achieving performance on par with state-of-the-art monocular video 3D tracking methods.
Authors:Jaewon Min, Jin Hyeon Kim, Paul Hyunbin Cho, Jaeeun Lee, Jihye Park, Minkyu Park, Sangpil Kim, Hyunhee Park, Seungryong Kim
Abstract:
Image restoration aims to recover degraded images. However, existing diffusion-based restoration methods, despite great success in natural image restoration, often struggle to faithfully reconstruct textual regions in degraded images. Those methods frequently generate plausible but incorrect text-like patterns, a phenomenon we refer to as text-image hallucination. In this paper, we introduce Text-Aware Image Restoration (TAIR), a novel restoration task that requires the simultaneous recovery of visual contents and textual fidelity. To tackle this task, we present SA-Text, a large-scale benchmark of 100K high-quality scene images densely annotated with diverse and complex text instances. Furthermore, we propose a multi-task diffusion framework, called TeReDiff, that integrates internal features from diffusion models into a text-spotting module, enabling both components to benefit from joint training. This allows for the extraction of rich text representations, which are utilized as prompts in subsequent denoising steps. Extensive experiments demonstrate that our approach consistently outperforms state-of-the-art restoration methods, achieving significant gains in text recognition accuracy. See our project page: https://cvlab-kaist.github.io/TAIR/
Authors:Sushant Gautam, Michael A. Riegler, PÃ¥l Halvorsen
Abstract:
Medical Visual Question Answering (MedVQA) is a promising field for developing clinical decision support systems, yet progress is often limited by the available datasets, which can lack clinical complexity and visual diversity. To address these gaps, we introduce Kvasir-VQA-x1, a new, large-scale dataset for gastrointestinal (GI) endoscopy. Our work significantly expands upon the original Kvasir-VQA by incorporating 159,549 new question-answer pairs that are designed to test deeper clinical reasoning. We developed a systematic method using large language models to generate these questions, which are stratified by complexity to better assess a model's inference capabilities. To ensure our dataset prepares models for real-world clinical scenarios, we have also introduced a variety of visual augmentations that mimic common imaging artifacts. The dataset is structured to support two main evaluation tracks: one for standard VQA performance and another to test model robustness against these visual perturbations. By providing a more challenging and clinically relevant benchmark, Kvasir-VQA-x1 aims to accelerate the development of more reliable and effective multimodal AI systems for use in clinical settings. The dataset is fully accessible and adheres to FAIR data principles, making it a valuable resource for the wider research community. Code and data: https://github.com/Simula/Kvasir-VQA-x1 and https://huggingface.co/datasets/SimulaMet/Kvasir-VQA-x1
中文: Kvasir-VQA-x1是一个大规模胃肠道内窥镜数据集,通过15.9万个问题-答案对和模拟影像伪影的视觉增强,提升了临床复杂性和视觉多样性,旨在推动临床多模态AI系统的发展。
English: Kvasir-VQA-x1 is a large-scale gastrointestinal endoscopy dataset that expands clinical complexity and visual diversity through 159,549 question-answer pairs and augmented imaging artifacts, aiming to advance multimodal AI for clinical decision support.
Authors:Chengpeng Li, Zhengyang Tang, Ziniu Li, Mingfeng Xue, Keqin Bao, Tian Ding, Ruoyu Sun, Benyou Wang, Xiang Wang, Junyang Lin, Dayiheng Liu
Abstract:
Large Reasoning Models (LRMs) like o1 and DeepSeek-R1 have shown remarkable progress in natural language reasoning with long chain-of-thought (CoT), yet they remain inefficient or inaccurate when handling complex mathematical operations. Addressing these limitations through computational tools (e.g., computation libraries and symbolic solvers) is promising, but it introduces a technical challenge: Code Interpreter (CI) brings external knowledge beyond the model's internal text representations, thus the direct combination is not efficient. This paper introduces CoRT, a post-training framework for teaching LRMs to leverage CI effectively and efficiently. As a first step, we address the data scarcity issue by synthesizing code-integrated reasoning data through Hint-Engineering, which strategically inserts different hints at appropriate positions to optimize LRM-CI interaction. We manually create 30 high-quality samples, upon which we post-train models ranging from 1.5B to 32B parameters, with supervised fine-tuning, rejection fine-tuning and reinforcement learning. Our experimental results demonstrate that Hint-Engineering models achieve 4\% and 8\% absolute improvements on DeepSeek-R1-Distill-Qwen-32B and DeepSeek-R1-Distill-Qwen-1.5B respectively, across five challenging mathematical reasoning datasets. Furthermore, Hint-Engineering models use about 30\% fewer tokens for the 32B model and 50\% fewer tokens for the 1.5B model compared with the natural language models. The models and code are available at https://github.com/ChengpengLi1003/CoRT.
中文:CoRT是一种后训练框架,通过提示工程集成代码解释器,有效提升大型推理模型在数学推理中的性能,显著减少计算资源消耗并提高准确率。
English: CoRT is a post-training framework that enhances Large Reasoning Models' efficiency in mathematical reasoning by integrating Code Interpreters through Hint-Engineering, achieving significant performance improvements and token reduction.
Authors:Maik Dannecker, Vasiliki Sideri-Lampretsa, Sophie Starck, Angeline Mihailov, Mathieu Milh, Nadine Girard, Guillaume Auzias, Daniel Rueckert
Abstract:
Magnetic resonance imaging of fetal and neonatal brains reveals rapid neurodevelopment marked by substantial anatomical changes unfolding within days. Studying this critical stage of the developing human brain, therefore, requires accurate brain models-referred to as atlases-of high spatial and temporal resolution. To meet these demands, established traditional atlases and recently proposed deep learning-based methods rely on large and comprehensive datasets. This poses a major challenge for studying brains in the presence of pathologies for which data remains scarce. We address this limitation with CINeMA (Conditional Implicit Neural Multi-Modal Atlas), a novel framework for creating high-resolution, spatio-temporal, multimodal brain atlases, suitable for low-data settings. Unlike established methods, CINeMA operates in latent space, avoiding compute-intensive image registration and reducing atlas construction times from days to minutes. Furthermore, it enables flexible conditioning on anatomical features including GA, birth age, and pathologies like ventriculomegaly (VM) and agenesis of the corpus callosum (ACC). CINeMA supports downstream tasks such as tissue segmentation and age prediction whereas its generative properties enable synthetic data creation and anatomically informed data augmentation. Surpassing state-of-the-art methods in accuracy, efficiency, and versatility, CINeMA represents a powerful tool for advancing brain research. We release the code and atlases at https://github.com/m-dannecker/CINeMA.
Chinese: CINeMA是一种新型框架,通过在潜在空间操作,能在低数据环境下快速构建高分辨率多模态脑图谱,支持对解剖特征的灵活条件设置,并实现分割和数据增强等下游任务。
English: CINeMA is a novel framework that rapidly constructs high-resolution, multimodal brain atlases in low-data settings by operating in latent space, enabling flexible conditioning on anatomical features and supporting downstream tasks like segmentation and data augmentation.
Authors:Kunyu Peng, Junchao Huang, Xiangsheng Huang, Di Wen, Junwei Zheng, Yufan Chen, Kailun Yang, Jiamin Wu, Chongqing Hao, Rainer Stiefelhagen
Abstract:
Action segmentation is a core challenge in high-level video understanding, aiming to partition untrimmed videos into segments and assign each a label from a predefined action set. Existing methods primarily address single-person activities with fixed action sequences, overlooking multi-person scenarios. In this work, we pioneer textual reference-guided human action segmentation in multi-person settings, where a textual description specifies the target person for segmentation. We introduce the first dataset for Referring Human Action Segmentation, i.e., RHAS133, built from 133 movies and annotated with 137 fine-grained actions with 33h video data, together with textual descriptions for this new task. Benchmarking existing action recognition methods on RHAS133 using VLM-based feature extractors reveals limited performance and poor aggregation of visual cues for the target person. To address this, we propose a holistic-partial aware Fourier-conditioned diffusion framework, i.e., HopaDIFF, leveraging a novel cross-input gate attentional xLSTM to enhance holistic-partial long-range reasoning and a novel Fourier condition to introduce more fine-grained control to improve the action segmentation generation. HopaDIFF achieves state-of-the-art results on RHAS133 in diverse evaluation settings. The code is available at https://github.com/KPeng9510/HopaDIFF.git.
中文: 本文开创性地提出文本参照引导的多人物动作分割方法,构建了首个RHAS133数据集,并设计了HopaDIFF框架——通过交叉输入门控注意力机制和傅里叶条件调控实现整体-局部感知的扩散模型,在多样化评估设置中取得了最优性能。
English: This paper introduces a novel textual reference-guided approach for multi-person action segmentation, presenting the RHAS133 dataset and proposing HopaDIFF—a holistic-partial aware diffusion framework that achieves state-of-the-art performance by enhancing long-range reasoning and fine-grained control through cross-input gate attention and Fourier conditioning.
Authors:Tianjun Yao, Haoxuan Li, Zhiqiang Shen, Pan Li, Tongliang Liu, Kun Zhang
Abstract:
Large Language Models (LLMs) have shown strong inductive reasoning ability across various domains, but their reliability is hindered by the outdated knowledge and hallucinations. Retrieval-Augmented Generation mitigates these issues by grounding LLMs with external knowledge; however, most existing RAG pipelines rely on unstructured text, limiting interpretability and structured reasoning. Knowledge graphs, which represent facts as relational triples, offer a more structured and compact alternative. Recent studies have explored integrating knowledge graphs with LLMs for knowledge graph question answering (KGQA), with a significant proportion adopting the retrieve-then-reasoning paradigm. In this framework, graph-based retrievers have demonstrated strong empirical performance, yet they still face challenges in generalization ability. In this work, we propose RAPL, a novel framework for efficient and effective graph retrieval in KGQA. RAPL addresses these limitations through three aspects: (1) a two-stage labeling strategy that combines heuristic signals with parametric models to provide causally grounded supervision; (2) a model-agnostic graph transformation approach to capture both intra- and inter-triple interactions, thereby enhancing representational capacity; and (3) a path-based reasoning strategy that facilitates learning from the injected rational knowledge, and supports downstream reasoner through structured inputs. Empirically, RAPL outperforms state-of-the-art methods by $2.66\%-20.34\%$, and significantly reduces the performance gap between smaller and more powerful LLM-based reasoners, as well as the gap under cross-dataset settings, highlighting its superior retrieval capability and generalizability. Codes are available at: https://github.com/tianyao-aka/RAPL.
中文: 大语言模型因知识过时和幻觉问题影响可靠性,而RAPL框架通过创新的图检索方法增强结构化推理和泛化能力,有效提升了知识图谱问答的性能。
English: Large Language Models face reliability issues due to outdated knowledge and hallucinations, which the proposed RAPL framework addresses through a novel graph retrieval approach that enhances structured reasoning and generalizability in knowledge graph question answering.
Authors:Songze Li, Mingxuan Zhang, Kang Wei, Shouling Ji
Abstract:
Deep reinforcement learning (DRL) has achieved remarkable success in a wide range of sequential decision-making domains, including robotics, healthcare, smart grids, and finance. Recent research demonstrates that attackers can efficiently exploit system vulnerabilities during the training phase to execute backdoor attacks, producing malicious actions when specific trigger patterns are present in the state observations. However, most existing backdoor attacks rely primarily on simplistic and heuristic trigger configurations, overlooking the potential efficacy of trigger optimization. To address this gap, we introduce TooBadRL (Trigger Optimization to Boost Effectiveness of Backdoor Attacks on DRL), the first framework to systematically optimize DRL backdoor triggers along three critical axes, i.e., temporal, spatial, and magnitude. Specifically, we first introduce a performance-aware adaptive freezing mechanism for injection timing. Then, we formulate dimension selection as a cooperative game, utilizing Shapley value analysis to identify the most influential state variable for the injection dimension. Furthermore, we propose a gradient-based adversarial procedure to optimize the injection magnitude under environment constraints. Evaluations on three mainstream DRL algorithms and nine benchmark tasks show that TooBadRL significantly improves attack success rates, while ensuring minimal degradation of normal task performance. These results highlight the previously underappreciated importance of principled trigger optimization in DRL backdoor attacks. The source code of TooBadRL can be found at https://github.com/S3IC-Lab/TooBadRL.
Chinese: 深度强化学习面临优化后门攻击的严重威胁,TooBadRL框架通过系统优化触发器的时间、空间和强度维度,显著提升攻击成功率,同时确保正常任务性能影响最小。
English: Deep reinforcement learning faces significant threats from optimized backdoor attacks, as demonstrated by the TooBadRL framework, which systematically enhances trigger effectiveness across temporal, spatial, and magnitude dimensions to achieve high attack success with minimal performance impact.
Authors:Taku Okawara, Kenji Koide, Aoki Takanose, Shuji Oishi, Masashi Yokozuka, Kentaro Uno, Kazuya Yoshida
Abstract:
In this letter, we present tightly coupled LiDAR-IMU-leg odometry, which is robust to challenging conditions such as featureless environments and deformable terrains. We developed an online learning-based leg kinematics model named the neural leg kinematics model, which incorporates tactile information (foot reaction force) to implicitly express the nonlinear dynamics between robot feet and the ground. Online training of this model enhances its adaptability to weight load changes of a robot (e.g., assuming delivery or transportation tasks) and terrain conditions. According to the \textit{neural adaptive leg odometry factor} and online uncertainty estimation of the leg kinematics model-based motion predictions, we jointly solve online training of this kinematics model and odometry estimation on a unified factor graph to retain the consistency of both. The proposed method was verified through real experiments using a quadruped robot in two challenging situations: 1) a sandy beach, representing an extremely featureless area with a deformable terrain, and 2) a campus, including multiple featureless areas and terrain types of asphalt, gravel (deformable terrain), and grass. Experimental results showed that our odometry estimation incorporating the \textit{neural leg kinematics model} outperforms state-of-the-art works. Our project page is available for further details: https://takuokawara.github.io/RAL2025_project_page/
中文: 本研究提出了一种紧密耦合的LiDAR-IMU-腿足里程计系统,通过在线学习的神经腿足运动学模型,在特征缺失环境和可变形地形中实现了更强的适应性,经实验验证其性能优于现有先进方法。
English: This study introduces a tightly coupled LiDAR-IMU-leg odometry system enhanced by an online learning-based neural leg kinematics model, which improves robustness in featureless environments and on deformable terrains through adaptive training and integrated factor graph optimization.
Authors:Taesoo Park, Mungwi Jeong, Mingyu Park, Narae Kim, Junyoung Kim, Mujung Kim, Jisang Yoo, Hoyun Lee, Sanghoon Kim, Soonchul Kwon
Abstract:
This paper presents a tutorial-style survey and implementation guide of BemaGANv2, an advanced GAN-based vocoder designed for high-fidelity and long-term audio generation. Built upon the original BemaGAN architecture, BemaGANv2 incorporates major architectural innovations by replacing traditional ResBlocks in the generator with the Anti-aliased Multi-Periodicity composition (AMP) module, which internally applies the Snake activation function to better model periodic structures. In the discriminator framework, we integrate the Multi-Envelope Discriminator (MED), a novel architecture we originally proposed, to extract rich temporal envelope features crucial for periodicity detection. Coupled with the Multi-Resolution Discriminator (MRD), this combination enables more accurate modeling of long-range dependencies in audio. We systematically evaluate various discriminator configurations, including MSD + MED, MSD + MRD, and MPD + MED + MRD, using objective metrics (FAD, SSIM, PLCC, MCD) and subjective evaluations (MOS, SMOS). This paper also provides a comprehensive tutorial on the model architecture, training methodology, and implementation to promote reproducibility. The code and pre-trained models are available at: https://github.com/dinhoitt/BemaGANv2.
中文: 本文介绍了BemaGANv2这一改进的GAN声码器,其生成器采用带Snake激活函数的AMP模块,判别器结合新型MED与MRD技术,能实现高质量音频生成,并通过完整评估和开源代码确保可复现性。
English: This paper introduces BemaGANv2, an enhanced GAN vocoder featuring AMP modules with Snake activation in the generator and a novel MED discriminator combined with MRD for superior high-fidelity audio generation, supported by comprehensive evaluations and open-source implementation.
Authors:Haiyang Yu, Yuchao Lin, Xuan Zhang, Xiaofeng Qian, Shuiwang Ji
Abstract:
We consider the task of predicting Hamiltonian matrices to accelerate electronic structure calculations, which plays an important role in physics, chemistry, and materials science. Motivated by the inherent relationship between the off-diagonal blocks of the Hamiltonian matrix and the SO(2) local frame, we propose a novel and efficient network, called QHNetV2, that achieves global SO(3) equivariance without the costly SO(3) Clebsch-Gordan tensor products. This is achieved by introducing a set of new efficient and powerful SO(2)-equivariant operations and performing all off-diagonal feature updates and message passing within SO(2) local frames, thereby eliminating the need of SO(3) tensor products. Moreover, a continuous SO(2) tensor product is performed within the SO(2) local frame at each node to fuse node features, mimicking the symmetric contraction operation. Extensive experiments on the large QH9 and MD17 datasets demonstrate that our model achieves superior performance across a wide range of molecular structures and trajectories, highlighting its strong generalization capability. The proposed SO(2) operations on SO(2) local frames offer a promising direction for scalable and symmetry-aware learning of electronic structures. Our code will be released as part of the AIRS library https://github.com/divelab/AIRS.
中文: QHNetV2通过引入高效的SO(2)等变操作在局部坐标系中实现全局SO(3)等变性,在预测哈密顿矩阵的任务中展现了卓越性能和泛化能力,同时避免了昂贵的SO(3)张量积计算。
English: QHNetV2 introduces efficient SO(2)-equivariant operations within local frames to achieve global SO(3) equivariance for predicting Hamiltonian matrices, demonstrating superior performance and generalization on molecular datasets while eliminating costly SO(3) tensor products.
Authors:Jiaqi Tang, Yu Xia, Yi-Feng Wu, Yuwei Hu, Yuhui Chen, Qing-Guo Chen, Xiaogang Xu, Xiangyu Wu, Hao Lu, Yanqing Ma, Shiyin Lu, Qifeng Chen
Abstract:
The advent of autonomous agents is transforming interactions with Graphical User Interfaces (GUIs) by employing natural language as a powerful intermediary. Despite the predominance of Supervised Fine-Tuning (SFT) methods in current GUI agents for achieving spatial localization, these methods face substantial challenges due to their limited capacity to accurately perceive positional data. Existing strategies, such as reinforcement learning, often fail to assess positional accuracy effectively, thereby restricting their utility. In response, we introduce Location Preference Optimization (LPO), a novel approach that leverages locational data to optimize interaction preferences. LPO uses information entropy to predict interaction positions by focusing on zones rich in information. Besides, it further introduces a dynamic location reward function based on physical distance, reflecting the varying importance of interaction positions. Supported by Group Relative Preference Optimization (GRPO), LPO facilitates an extensive exploration of GUI environments and significantly enhances interaction precision. Comprehensive experiments demonstrate LPO's superior performance, achieving SOTA results across both offline benchmarks and real-world online evaluations. Our code will be made publicly available soon, at https://github.com/AIDC-AI/LPO.
Chinese: 该研究提出位置偏好优化(LPO)方法,通过信息熵和动态奖励机制利用位置数据提升自主代理与图形用户界面的交互精度,在离线和在线评估中均取得了最先进的性能表现。
English: The study introduces Location Preference Optimization (LPO), a novel method that enhances autonomous agents' interaction precision with Graphical User Interfaces by utilizing locational data through information entropy and dynamic rewards, achieving state-of-the-art results in both offline and online evaluations.
Authors:Yuxuan Kuang, Haoran Geng, Amine Elhafsi, Tan-Dzung Do, Pieter Abbeel, Jitendra Malik, Marco Pavone, Yue Wang
Abstract:
Humanoid robots hold significant potential in accomplishing daily tasks across diverse environments thanks to their flexibility and human-like morphology. Recent works have made significant progress in humanoid whole-body control and loco-manipulation leveraging optimal control or reinforcement learning. However, these methods require tedious task-specific tuning for each task to achieve satisfactory behaviors, limiting their versatility and scalability to diverse tasks in daily scenarios. To that end, we introduce SkillBlender, a novel hierarchical reinforcement learning framework for versatile humanoid loco-manipulation. SkillBlender first pretrains goal-conditioned task-agnostic primitive skills, and then dynamically blends these skills to accomplish complex loco-manipulation tasks with minimal task-specific reward engineering. We also introduce SkillBench, a parallel, cross-embodiment, and diverse simulated benchmark containing three embodiments, four primitive skills, and eight challenging loco-manipulation tasks, accompanied by a set of scientific evaluation metrics balancing accuracy and feasibility. Extensive simulated experiments show that our method significantly outperforms all baselines, while naturally regularizing behaviors to avoid reward hacking, resulting in more accurate and feasible movements for diverse loco-manipulation tasks in our daily scenarios. Our code and benchmark will be open-sourced to the community to facilitate future research. Project page: https://usc-gvl.github.io/SkillBlender-web/.
中文摘要:SkillBlender是一种分层强化学习框架,通过动态融合预训练的基础技能实现多功能人形机器人移动操作,在减少任务特定调整的同时显著优于现有方法,产生更自然可行的动作。
English Summary: SkillBlender is a hierarchical reinforcement learning framework that blends pretrained primitive skills to enable versatile humanoid loco-manipulation with minimal task-specific tuning, outperforming existing methods while producing more natural and feasible movements.
Authors:Xuemei Cao, Hanlin Gu, Xin Yang, Bingjun Wei, Haoyang Liang, Xiangkun Wang, Tianrui Li
Abstract:
Continual Learning (CL) primarily aims to retain knowledge to prevent catastrophic forgetting and transfer knowledge to facilitate learning new tasks. Unlike traditional methods, we propose a novel perspective: CL not only needs to prevent forgetting, but also requires intentional forgetting.This arises from existing CL methods ignoring biases in real-world data, leading the model to learn spurious correlations that transfer and amplify across tasks. From feature extraction and prediction results, we find that data biases simultaneously reduce CL's ability to retain and transfer knowledge. To address this, we propose ErrorEraser, a universal plugin that removes erroneous memories caused by biases in CL, enhancing performance in both new and old tasks. ErrorEraser consists of two modules: Error Identification and Error Erasure. The former learns the probability density distribution of task data in the feature space without prior knowledge, enabling accurate identification of potentially biased samples. The latter ensures only erroneous knowledge is erased by shifting the decision space of representative outlier samples. Additionally, an incremental feature distribution learning strategy is designed to reduce the resource overhead during error identification in downstream tasks. Extensive experimental results show that ErrorEraser significantly mitigates the negative impact of data biases, achieving higher accuracy and lower forgetting rates across three types of CL methods. The code is available at https://github.com/diadai/ErrorEraser.
中文摘要:本文提出ErrorEraser这一持续学习通用插件,通过误差识别和擦除双模块解决数据偏差问题,无需先验知识即可有效消除错误记忆,显著提升新旧任务性能。
English Summary: This paper introduces ErrorEraser, a universal plugin for continual learning that addresses data bias issues by identifying and erasing erroneous memories through two specialized modules, significantly improving performance across tasks.
Authors:Siheng Li, Zhanhui Zhou, Wai Lam, Chao Yang, Chaochao Lu
Abstract:
Reinforcement learning (RL) is vital for optimizing large language models (LLMs). Recent Group Relative Policy Optimization (GRPO) estimates advantages using multiple on-policy outputs per prompt, leading to high computational costs and low data efficiency. To address this, we introduce Replay-Enhanced Policy Optimization (RePO), which leverages diverse replay strategies to retrieve off-policy samples from a replay buffer, allowing policy optimization based on a broader and more diverse set of samples for each prompt. Experiments on five LLMs across seven mathematical reasoning benchmarks demonstrate that RePO achieves absolute average performance gains of $18.4$ and $4.1$ points for Qwen2.5-Math-1.5B and Qwen3-1.7B, respectively, compared to GRPO. Further analysis indicates that RePO increases computational cost by $15\%$ while raising the number of effective optimization steps by $48\%$ for Qwen3-1.7B, with both on-policy and off-policy sample numbers set to $8$. The repository can be accessed at https://github.com/SihengLi99/RePO.
中文: RePO通过回放策略利用离策略样本进行更高效的策略优化,在适度增加计算成本的同时,相比GRPO实现了显著的性能提升。
English: RePO introduces replay strategies to utilize off-policy samples for more efficient policy optimization in LLMs, achieving significant performance gains over GRPO with a moderate increase in computational cost.
Authors:Yeonju Ro, Zhenyu Zhang, Souvik Kundu, Zhangyang Wang, Aditya Akella
Abstract:
Large language models (LLMs) excel at capturing global token dependencies via self-attention but face prohibitive compute and memory costs on lengthy inputs. While sub-quadratic methods (e.g., linear attention) can reduce these costs, they often degrade accuracy due to overemphasizing recent tokens. In this work, we first propose dual-state linear attention (DSLA), a novel design that maintains two specialized hidden states-one for preserving historical context and one for tracking recency-thereby mitigating the short-range bias typical of linear-attention architectures. To further balance efficiency and accuracy under dynamic workload conditions, we introduce DSLA-Serve, an online adaptive distillation framework that progressively replaces Transformer layers with DSLA layers at inference time, guided by a sensitivity-based layer ordering. DSLA-Serve uses a chained fine-tuning strategy to ensure that each newly converted DSLA layer remains consistent with previously replaced layers, preserving the overall quality. Extensive evaluations on commonsense reasoning, long-context QA, and text summarization demonstrate that DSLA-Serve yields 2.3x faster inference than Llama2-7B and 3.0x faster than the hybrid Zamba-7B, while retaining comparable performance across downstream tasks. Our ablation studies show that DSLA's dual states capture both global and local dependencies, addressing the historical-token underrepresentation seen in prior linear attentions. Codes are available at https://github.com/utnslab/DSLA-Serve.
中文摘要:DSLA-Serve通过双状态线性注意力机制解决传统线性注意力对近期标记的过度关注问题,并采用自适应蒸馏框架在保持任务性能的同时,实现比同类模型快2.3-3倍的推理速度。
English Summary: DSLA-Serve introduces dual-state linear attention to mitigate linear attention's recency bias and an adaptive distillation framework that accelerates inference by 2.3-3x over comparable models while maintaining task performance.
Authors:Mojtaba Nafez, Amirhossein Koochakian, Arad Maleki, Jafar Habibi, Mohammad Hossein Rohban
Abstract:
Anomaly Detection (AD) and Anomaly Localization (AL) are crucial in fields that demand high reliability, such as medical imaging and industrial monitoring. However, current AD and AL approaches are often susceptible to adversarial attacks due to limitations in training data, which typically include only normal, unlabeled samples. This study introduces PatchGuard, an adversarially robust AD and AL method that incorporates pseudo anomalies with localization masks within a Vision Transformer (ViT)-based architecture to address these vulnerabilities. We begin by examining the essential properties of pseudo anomalies, and follow it by providing theoretical insights into the attention mechanisms required to enhance the adversarial robustness of AD and AL systems. We then present our approach, which leverages Foreground-Aware Pseudo-Anomalies to overcome the deficiencies of previous anomaly-aware methods. Our method incorporates these crafted pseudo-anomaly samples into a ViT-based framework, with adversarial training guided by a novel loss function designed to improve model robustness, as supported by our theoretical analysis. Experimental results on well-established industrial and medical datasets demonstrate that PatchGuard significantly outperforms previous methods in adversarial settings, achieving performance gains of $53.2\%$ in AD and $68.5\%$ in AL, while also maintaining competitive accuracy in non-adversarial settings. The code repository is available at https://github.com/rohban-lab/PatchGuard .
Chinese: 本研究提出了PatchGuard,一种利用伪异常和视觉Transformer框架的鲁棒异常检测与定位方法,显著提升了对抗攻击的防御能力,在对抗性和标准环境下均实现了显著性能提升。
English: This study introduces PatchGuard, a robust method for anomaly detection and localization that uses pseudo anomalies and a Vision Transformer framework to significantly enhance resistance against adversarial attacks, achieving notable performance improvements in both adversarial and standard settings.
Authors:Val Andrei Fajardo, David B. Emerson, Amandeep Singh, Veronica Chatrath, Marcelo Lotif, Ravi Theja, Alex Cheung, Izuki Matsuba
Abstract:
Retrieval-augmented generation (RAG) systems have been shown to be effective in addressing many of the drawbacks of relying solely on the parametric memory of large language models. Recent work has demonstrated that RAG systems can be improved via fine-tuning of their retriever and generator models. In this work, we introduce FedRAG, a framework for fine-tuning RAG systems across centralized and federated architectures. FedRAG supports state-of-the-art fine-tuning methods, offering a simple and intuitive interface and a seamless conversion from centralized to federated training tasks. FedRAG is also deeply integrated with the modern RAG ecosystem, filling a critical gap in available tools.
中文: FedRAG是一个框架,支持在集中式和联邦式架构下对检索增强生成系统进行微调,集成了先进方法与现代RAG工具,填补了现有工具的空白。
English: FedRAG is a framework that enables fine-tuning of retrieval-augmented generation systems across both centralized and federated architectures, integrating advanced methods and modern RAG tools to bridge existing gaps.
Authors:Yilin Zhuang, Karthik Duraisamy
Abstract:
Accurate probabilistic weather forecasting demands both high accuracy and efficient uncertainty quantification, challenges that overburden both ensemble numerical weather prediction (NWP) and recent machine-learning methods. We introduce LaDCast, the first global latent-diffusion framework for medium-range ensemble forecasting, which generates hourly ensemble forecasts entirely in a learned latent space. An autoencoder compresses high-dimensional ERA5 reanalysis fields into a compact representation, and a transformer-based diffusion model produces sequential latent updates with arbitrary hour initialization. The model incorporates Geometric Rotary Position Embedding (GeoRoPE) to account for the Earth's spherical geometry, a dual-stream attention mechanism for efficient conditioning, and sinusoidal temporal embeddings to capture seasonal patterns. LaDCast achieves deterministic and probabilistic skill close to that of the European Centre for Medium-Range Forecast IFS-ENS, without any explicit perturbations. Notably, LaDCast demonstrates superior performance in tracking rare extreme events such as cyclones, capturing their trajectories more accurately than established models. By operating in latent space, LaDCast reduces storage and compute by orders of magnitude, demonstrating a practical path toward forecasting at kilometer-scale resolution in real time. We open-source our code and models and provide the training and evaluation pipelines at: https://github.com/tonyzyl/ladcast.
中文摘要:LaDCast首次提出全球潜在扩散框架用于中期集合天气预报,在保持与主流模型相当准确度的同时大幅降低计算成本,并在追踪极端天气事件方面展现出卓越性能。
English Summary: LaDCast introduces the first global latent-diffusion framework for medium-range ensemble weather forecasting, achieving accuracy comparable to leading models while significantly reducing computational costs and demonstrating superior performance in tracking extreme weather events.
Authors:Haoyuan Cai, Zhenghao Peng, Bolei Zhou
Abstract:
Interactive Imitation Learning (IIL) allows agents to acquire desired behaviors through human interventions, but current methods impose high cognitive demands on human supervisors. We propose the Adaptive Intervention Mechanism (AIM), a novel robot-gated IIL algorithm that learns an adaptive criterion for requesting human demonstrations. AIM utilizes a proxy Q-function to mimic the human intervention rule and adjusts intervention requests based on the alignment between agent and human actions. By assigning high Q-values when the agent deviates from the expert and decreasing these values as the agent becomes proficient, the proxy Q-function enables the agent to assess the real-time alignment with the expert and request assistance when needed. Our expert-in-the-loop experiments reveal that AIM significantly reduces expert monitoring efforts in both continuous and discrete control tasks. Compared to the uncertainty-based baseline Thrifty-DAgger, our method achieves a 40% improvement in terms of human take-over cost and learning efficiency. Furthermore, AIM effectively identifies safety-critical states for expert assistance, thereby collecting higher-quality expert demonstrations and reducing overall expert data and environment interactions needed. Code and demo video are available at https://github.com/metadriverse/AIM.
中文: 提出的自适应干预机制(AIM)通过代理Q函数自适应地请求专家演示,显著减少了交互式模仿学习中的人工监督需求,与基线方法相比效率提升40%,并能更有效地识别安全关键状态。
English: The proposed Adaptive Intervention Mechanism (AIM) significantly reduces human supervision in Interactive Imitation Learning by using a proxy Q-function to adaptively request expert demonstrations, achieving 40% improvement in efficiency and better identification of safety-critical states compared to baseline methods.
Authors:MÃriam Barrabés, Daniel Mas Montserrat, Kapal Dev, Alexander G. Ioannidis
Abstract:
Feature shifts between data sources are present in many applications involving healthcare, biomedical, socioeconomic, financial, survey, and multi-sensor data, among others, where unharmonized heterogeneous data sources, noisy data measurements, or inconsistent processing and standardization pipelines can lead to erroneous features. Localizing shifted features is important to address the underlying cause of the shift and correct or filter the data to avoid degrading downstream analysis. While many techniques can detect distribution shifts, localizing the features originating them is still challenging, with current solutions being either inaccurate or not scalable to large and high-dimensional datasets. In this work, we introduce the Feature Shift Localization Network (FSL-Net), a neural network that can localize feature shifts in large and high-dimensional datasets in a fast and accurate manner. The network, trained with a large number of datasets, learns to extract the statistical properties of the datasets and can localize feature shifts from previously unseen datasets and shifts without the need for re-training. The code and ready-to-use trained model are available at https://github.com/AI-sandbox/FSL-Net.
Chinese: FSL-Net 是一种神经网络,能够无需重新训练即可快速准确地定位大型高维数据集中的特征偏移。
English: FSL-Net is a neural network that accurately and efficiently localizes feature shifts in large, high-dimensional datasets without requiring retraining for new data.
Authors:Chaoyang Zhou, Shunyu Liu, Zengmao Wang, Di Wang, Rong-Cheng Tu, Bo Du, Dacheng Tao
Abstract:
Reward models are critical for improving large language models (LLMs), particularly in reinforcement learning from human feedback (RLHF) or inference-time verification. Current reward modeling typically relies on scores of overall responses to learn the outcome rewards for the responses. However, since the response-level scores are coarse-grained supervision signals, the reward model struggles to identify the specific components within a response trajectory that truly correlate with the scores, leading to poor generalization on unseen responses. In this paper, we propose to leverage generation probabilities to establish reward consistency between processes in the response trajectory, which allows the response-level supervisory signal to propagate across processes, thereby providing additional fine-grained signals for reward learning. Building on analysis under the Bayesian framework, we develop an intra-trajectory consistency regularization to enforce that adjacent processes with higher next-token generation probability maintain more consistent rewards. We apply the proposed regularization to the advanced outcome reward model, improving its performance on RewardBench. Besides, we show that the reward model trained with the proposed regularization induces better DPO-aligned policies and achieves better best-of-N (BON) inference-time verification results. Our code is provided in https://github.com/chaoyang101/ICRM.
中文: 本文提出轨迹内一致性正则化方法,利用生成概率在响应过程中建立奖励一致性,从而提升奖励模型在基准测试中的表现,并优化策略对齐和推理时验证效果。
English: This paper introduces intra-trajectory consistency regularization, which uses generation probabilities to align rewards across response processes, enhancing reward model performance on benchmarks and improving policy alignment and inference-time verification.
Authors:Shayan Shekarforoush, David B. Lindell, Marcus A. Brubaker, David J. Fleet
Abstract:
Cryo-EM is a transformational paradigm in molecular biology where computational methods are used to infer 3D molecular structure at atomic resolution from extremely noisy 2D electron microscope images. At the forefront of research is how to model the structure when the imaged particles exhibit non-rigid conformational flexibility and compositional variation where parts are sometimes missing. We introduce a novel 3D reconstruction framework with a hierarchical Gaussian mixture model, inspired in part by Gaussian Splatting for 4D scene reconstruction. In particular, the structure of the model is grounded in an initial process that infers a part-based segmentation of the particle, providing essential inductive bias in order to handle both conformational and compositional variability. The framework, called CryoSPIRE, is shown to reveal biologically meaningful structures on complex experimental datasets, and establishes a new state-of-the-art on CryoBench, a benchmark for cryo-EM heterogeneity methods.
Authors:Haozhen Zhang, Tao Feng, Jiaxuan You
Abstract:
The rapid emergence of diverse large language models (LLMs) has spurred the development of LLM routers that assign user queries to the most suitable model. However, existing LLM routers typically perform a single-round, one-to-one mapping (\textit{i.e.}, assigning each query to a single model in isolation), which limits their capability to tackle complex tasks that demand the complementary strengths of multiple LLMs. In this paper, we present \textbf{Router-R1}, a reinforcement learning (RL)-based framework that formulates multi-LLM routing and aggregation as a sequential decision process. Router-R1 instantiates the router itself as a capable LLM, leveraging its reasoning ability to interleave "think" actions (internal deliberation) with "route" actions (dynamic model invocation), and integrates each response into its evolving context. To facilitate learning, we employ a lightweight rule-based reward comprising format rewards, final outcome rewards, and a novel cost reward for optimizing the balance between performance and cost, opening a pathway toward enhancing performance-cost trade-offs via RL. Router-R1 also conditions only on simple model descriptors such as pricing, latency, and example performance, enabling strong generalization to unseen model selection. Experiments on seven general and multi-hop QA benchmarks show that Router-R1 outperforms several strong baselines, achieving superior performance while maintaining robust generalization and cost management.
中文: 本文提出Router-R1强化学习框架,通过将多模型路由构建为序列决策过程,动态调用并整合不同大语言模型的优势,在多个基准测试中实现了性能与成本的最优平衡。
English: This paper introduces Router-R1, a reinforcement learning framework that enhances multi-LLM routing by dynamically selecting and aggregating models through sequential decision-making, achieving superior performance and cost efficiency across diverse benchmarks.
Authors:Chenyu Lian, Hong-Yu Zhou, Dongyun Liang, Jing Qin, Liansheng Wang
Abstract:
Medical vision-language alignment through cross-modal contrastive learning shows promising performance in image-text matching tasks, such as retrieval and zero-shot classification. However, conventional cross-modal contrastive learning (CLIP-based) methods suffer from suboptimal visual representation capabilities, which also limits their effectiveness in vision-language alignment. In contrast, although the models pretrained via multimodal masked modeling struggle with direct cross-modal matching, they excel in visual representation. To address this contradiction, we propose ALTA (ALign Through Adapting), an efficient medical vision-language alignment method that utilizes only about 8% of the trainable parameters and less than 1/5 of the computational consumption required for masked record modeling. ALTA achieves superior performance in vision-language matching tasks like retrieval and zero-shot classification by adapting the pretrained vision model from masked record modeling. Additionally, we integrate temporal-multiview radiograph inputs to enhance the information consistency between radiographs and their corresponding descriptions in reports, further improving the vision-language alignment. Experimental evaluations show that ALTA outperforms the best-performing counterpart by over 4% absolute points in text-to-image accuracy and approximately 6% absolute points in image-to-text retrieval accuracy. The adaptation of vision-language models during efficient alignment also promotes better vision and language understanding. Code is publicly available at https://github.com/DopamineLcy/ALTA.
中文: ALTA方法通过适配预训练的视觉模型,以极低的计算成本实现了医学视觉与语言的高效对齐,在检索和零样本分类任务中表现卓越。
English: The proposed ALTA method efficiently aligns medical vision and language by adapting a pretrained vision model from masked record modeling, achieving superior performance in retrieval and zero-shot classification with minimal computational resources.
Authors:Victoria Hankemeier, Malte Schilling
Abstract:
Developments in Deep Learning have significantly improved time series forecasting by enabling more accurate modeling of complex temporal dependencies inherent in sequential data. The effectiveness of such models is often demonstrated on limited sets of specific real-world data. Although this allows for comparative analysis, it still does not demonstrate how specific data characteristics align with the architectural strengths of individual models. Our research aims at uncovering clear connections between time series characteristics and particular models. We introduce a novel dataset generated using Gaussian Processes, specifically designed to display distinct, known characteristics for targeted evaluations of model adaptability to them. Furthermore, we present TimeFlex, a new model that incorporates a modular architecture tailored to handle diverse temporal dynamics, including trends and periodic patterns. This model is compared to current state-of-the-art models, offering a deeper understanding of how models perform under varied time series conditions.
Chinese: 深度学习进展通过建模复杂依赖提升了时间序列预测能力,但现有评估未能将数据特性与模型优势关联,为此引入高斯过程生成的数据集和TimeFlex模型,以针对性评估模型适应性。
English: Deep learning advancements have enhanced time series forecasting by modeling complex dependencies, yet current evaluations fail to link data traits with model strengths, prompting the introduction of a Gaussian Processes dataset and TimeFlex model for targeted adaptability assessments.
Authors:Hang Ye, Gaoxiang Duan, Haoran Zeng, Yangxin Zhu, Lingxue Meng, Xiaoying Zheng, Yongxin Zhu
Abstract:
Multivariate long-term and efficient time series forecasting is a key requirement for a variety of practical applications, and there are complex interleaving time dynamics in time series data that require decomposition modeling. Traditional time series decomposition methods are single and rely on fixed rules, which are insufficient for mining the potential information of the series and adapting to the dynamic characteristics of complex series. On the other hand, the Transformer-based models for time series forecasting struggle to effectively model long sequences and intricate dynamic relationships due to their high computational complexity. To overcome these limitations, we introduce KARMA, with an Adaptive Time Channel Decomposition module (ATCD) to dynamically extract trend and seasonal components. It further integrates a Hybrid Frequency-Time Decomposition module (HFTD) to further decompose Series into frequency-domain and time-domain. These components are coupled with multi-scale Mamba-based KarmaBlock to efficiently process global and local information in a coordinated manner. Experiments on eight real-world datasets from diverse domains well demonstrated that KARMA significantly outperforms mainstream baseline methods in both predictive accuracy and computational efficiency. Code and full results are available at this repository: https://github.com/yedadasd/KARMA
中文摘要:KARMA通过自适应时间通道分解和混合频时分解模块动态提取时序成分,结合多尺度Mamba模块协同处理信息,在多个真实数据集上显著超越了主流基线方法的预测精度与计算效率。
English Summary: KARMA introduces adaptive decomposition modules and Mamba-based blocks to dynamically extract time series components, significantly outperforming existing methods in both accuracy and efficiency across multiple real-world datasets.
Authors:Andrew Shin
Abstract:
While large language models (LLMs) have achieved remarkable performance in various tasks including mathematical reasoning, their development typically demands prohibitive computational resources. Recent advancements have reduced costs for training capable models, yet even these approaches rely on high-end hardware clusters. In this paper, we demonstrate that a single average gaming GPU can train a solid mathematical reasoning model, by integrating reinforcement learning and memory optimization techniques. Specifically, we train a 1.5B parameter mathematical reasoning model on RTX 3080 Ti of 16GB memory that achieves comparable or better performance on mathematical reasoning benchmarks than models several times larger, in resource-constrained environments. Our results challenge the paradigm that state-of-the-art mathematical reasoning necessitates massive infrastructure, democratizing access to high-performance AI research. https://github.com/shinandrew/YouronMath.
Chinese: 该研究通过强化学习和内存优化技术,在单个游戏显卡上成功训练出性能优异的15亿参数数学推理模型,打破了高性能AI研究必须依赖庞大计算资源的传统模式。
English: This research demonstrates that a single gaming GPU can train a competitive 1.5B-parameter mathematical reasoning model using reinforcement learning and memory optimization, challenging the need for massive computational infrastructure.
Authors:Chongyi Zheng, Seohong Park, Sergey Levine, Benjamin Eysenbach
Abstract:
Large-scale pre-training has fundamentally changed how machine learning research is done today: large foundation models are trained once, and then can be used by anyone in the community (including those without data or compute resources to train a model from scratch) to adapt and fine-tune to specific tasks. Applying this same framework to reinforcement learning (RL) is appealing because it offers compelling avenues for addressing core challenges in RL, including sample efficiency and robustness. However, there remains a fundamental challenge to pre-train large models in the context of RL: actions have long-term dependencies, so training a foundation model that reasons across time is important. Recent advances in generative AI have provided new tools for modeling highly complex distributions. In this paper, we build a probabilistic model to predict which states an agent will visit in the temporally distant future (i.e., an occupancy measure) using flow matching. As large datasets are often constructed by many distinct users performing distinct tasks, we include in our model a latent variable capturing the user intention. This intention increases the expressivity of our model, and enables adaptation with generalized policy improvement. We call our proposed method intention-conditioned flow occupancy models (InFOM). Comparing with alternative methods for pre-training, our experiments on $36$ state-based and $4$ image-based benchmark tasks demonstrate that the proposed method achieves $1.8 \times$ median improvement in returns and increases success rates by $36\%$. Website: https://chongyi-zheng.github.io/infom Code: https://github.com/chongyi-zheng/infom
大规模预训练通过基于意图的流匹配模型预测强化学习中具有长期依赖性的未来状态占用,相比现有方法显著提升了任务成功率与回报中位数。
Large-scale pre-training in reinforcement learning addresses long-term action dependencies by modeling future state occupancy with intention-aware flow matching, achieving significant performance gains over existing methods.
Authors:Yizhao Gao, Shuming Guo, Shijie Cao, Yuqing Xia, Yu Cheng, Lei Wang, Lingxiao Ma, Yutao Sun, Tianzhu Ye, Li Dong, Hayden Kwok-Hay So, Yu Hua, Ting Cao, Fan Yang, Mao Yang
Abstract:
We introduce SeerAttention-R, a sparse attention framework specifically tailored for the long decoding of reasoning models. Extended from SeerAttention, SeerAttention-R retains the design of learning attention sparsity through a self-distilled gating mechanism, while removing query pooling to accommodate auto-regressive decoding. With a lightweight plug-in gating, SeerAttention-R is flexible and can be easily integrated into existing pretrained model without modifying the original parameters. We demonstrate that SeerAttention-R, trained on just 0.4B tokens, maintains near-lossless reasoning accuracy with 4K token budget in AIME benchmark under large sparse attention block sizes (64/128). Using TileLang, we develop a highly optimized sparse decoding kernel that achieves near-theoretical speedups of up to 9x over FlashAttention-3 on H100 GPU at 90% sparsity. Code is available at: https://github.com/microsoft/SeerAttention.
Chinese: SeerAttention-R 是一种专为推理模型长解码设计的稀疏注意力框架,在保持接近无损精度的同时,通过优化内核在 H100 GPU 上实现了高达 9 倍的加速。
English: SeerAttention-R is a sparse attention framework designed for efficient long decoding in reasoning models, maintaining near-lossless accuracy with high sparsity and achieving up to 9x speedup on H100 GPUs through optimized kernels.
Authors:Shiqin Tang, Shujian Yu
Abstract:
Extracting meaningful latent representations from high-dimensional sequential data is a crucial challenge in machine learning, with applications spanning natural science and engineering. We introduce InfoDPCCA, a dynamic probabilistic Canonical Correlation Analysis (CCA) framework designed to model two interdependent sequences of observations. InfoDPCCA leverages a novel information-theoretic objective to extract a shared latent representation that captures the mutual structure between the data streams and balances representation compression and predictive sufficiency while also learning separate latent components that encode information specific to each sequence. Unlike prior dynamic CCA models, such as DPCCA, our approach explicitly enforces the shared latent space to encode only the mutual information between the sequences, improving interpretability and robustness. We further introduce a two-step training scheme to bridge the gap between information-theoretic representation learning and generative modeling, along with a residual connection mechanism to enhance training stability. Through experiments on synthetic and medical fMRI data, we demonstrate that InfoDPCCA excels as a tool for representation learning. Code of InfoDPCCA is available at https://github.com/marcusstang/InfoDPCCA.
中文: InfoDPCCA是一种动态概率典型相关分析框架,通过信息论目标提取共享和序列特定的潜在表示,在建模相互依赖的序列数据时提升了可解释性和鲁棒性。
English: InfoDPCCA is a dynamic probabilistic CCA framework that uses an information-theoretic objective to extract shared and sequence-specific latent representations, improving interpretability and robustness in modeling interdependent sequential data.
Authors:Zike Wu, Qi Yan, Xuanyu Yi, Lele Wang, Renjie Liao
Abstract:
Real-time reconstruction of dynamic 3D scenes from uncalibrated video streams is crucial for numerous real-world applications. However, existing methods struggle to jointly address three key challenges: 1) processing uncalibrated inputs in real time, 2) accurately modeling dynamic scene evolution, and 3) maintaining long-term stability and computational efficiency. To this end, we introduce StreamSplat, the first fully feed-forward framework that transforms uncalibrated video streams of arbitrary length into dynamic 3D Gaussian Splatting (3DGS) representations in an online manner, capable of recovering scene dynamics from temporally local observations. We propose two key technical innovations: a probabilistic sampling mechanism in the static encoder for 3DGS position prediction, and a bidirectional deformation field in the dynamic decoder that enables robust and efficient dynamic modeling. Extensive experiments on static and dynamic benchmarks demonstrate that StreamSplat consistently outperforms prior works in both reconstruction quality and dynamic scene modeling, while uniquely supporting online reconstruction of arbitrarily long video streams. Code and models are available at https://github.com/nickwzk/StreamSplat.
中文: StreamSplat是一种开创性的前馈框架,能将未校准视频流在线转换为动态3D高斯溅射表示,在实时重建、动态建模和计算效率方面表现卓越。
English: StreamSplat is a pioneering feed-forward framework that converts uncalibrated video streams into dynamic 3D Gaussian Splatting representations online, excelling in real-time reconstruction, dynamic modeling, and computational efficiency.
Authors:Yuni Susanti, Michael Färber
Abstract:
Inferring causal relationships between variable pairs is crucial for understanding multivariate interactions in complex systems. Knowledge-based causal discovery -- which involves inferring causal relationships by reasoning over the metadata of variables (e.g., names or textual context) -- offers a compelling alternative to traditional methods that rely on observational data. However, existing methods using Large Language Models (LLMs) often produce unstable and inconsistent results, compromising their reliability for causal inference. To address this, we introduce a novel approach that integrates Knowledge Graphs (KGs) with LLMs to enhance knowledge-based causal discovery. Our approach identifies informative metapath-based subgraphs within KGs and further refines the selection of these subgraphs using Learning-to-Rank-based models. The top-ranked subgraphs are then incorporated into zero-shot prompts, improving the effectiveness of LLMs in inferring the causal relationship. Extensive experiments on biomedical and open-domain datasets demonstrate that our method outperforms most baselines by up to 44.4 points in F1 scores, evaluated across diverse LLMs and KGs. Our code and datasets are available on GitHub: https://github.com/susantiyuni/path-to-causality
中文摘要:本研究提出了一种将知识图谱与大型语言模型相结合的新方法,通过元路径子图排序和零样本提示增强基于知识的因果发现,在多个数据集上相比基线方法F1分数最高提升44.4分。
English Summary: This study introduces a novel approach that integrates Knowledge Graphs with Large Language Models to enhance knowledge-based causal discovery, achieving up to 44.4-point F1 score improvements over baselines through metapath subgraph ranking and zero-shot prompting.
Authors:Michael Färber, David Lamprecht, Yuni Susanti
Abstract:
Graph Neural Networks (GNNs) have substantially advanced the field of recommender systems. However, despite the creation of more than a thousand knowledge graphs (KGs) under the W3C standard RDF, their rich semantic information has not yet been fully leveraged in GNN-based recommender systems. To address this gap, we propose a comprehensive integration of RDF KGs with GNNs that utilizes both the topological information from RDF object properties and the content information from RDF datatype properties. Our main focus is an in-depth evaluation of various GNNs, analyzing how different semantic feature initializations and types of graph structure heterogeneity influence their performance in recommendation tasks. Through experiments across multiple recommendation scenarios involving multi-million-node RDF graphs, we demonstrate that harnessing the semantic richness of RDF KGs significantly improves recommender systems and lays the groundwork for GNN-based recommender systems for the Linked Open Data cloud. The code and data are available on our GitHub repository: https://github.com/davidlamprecht/rdf-gnn-recommendation
中文: 本研究提出将RDF知识图谱与图神经网络相结合,充分利用其拓扑结构和内容信息,并通过大规模实验证明,利用RDF语义丰富性能显著提升推荐系统的性能。
English: This study proposes integrating RDF knowledge graphs with Graph Neural Networks to leverage both topological and content information, demonstrating through large-scale experiments that utilizing RDF semantic richness significantly enhances recommendation performance.
Authors:Mahesh Godavarti
Abstract:
Transformers have demonstrated remarkable success in sequence modeling, yet effectively incorporating positional information remains a challenging and active area of research. In this paper, we introduce JoFormer, a journey-based Transformer architecture grounded in a recently proposed non-commutative algebra for composing transformations across positions. JoFormer represents relative positions through learnable directional transforms that are sequentially composed along the input, thereby extending and generalizing existing approaches based on relative position representations. We derive the JoFormer attention mechanism from first principles and show that it subsumes standard methods such as rotary transformations as special cases. To evaluate its effectiveness, we compare JoFormer to the RoFormer baseline on the Tiny Shakespeare character-level language modeling task. Our results demonstrate that
JoFormer consistently achieves lower perplexity and faster convergence, highlighting the advantages of its more expressive, journey-based treatment of position. Notably, the per-token JoFormer is still a primitive, conceptual variant with layer-independent angles, yet it already demonstrates strong performance-underscoring its promise as a proof of concept for more expressive architectures. We conclude by discussing how JoFormer offers a principled approach to integrating positional structure into Transformer architectures. The code used in this work is available at https://github.com/mahesh-godavarti/joformer.
中文: JoFormer是一种基于可学习方向变换的新型Transformer架构,通过相对位置建模在语言建模任务中实现了比RoFormer更低的困惑度和更快的收敛速度。
English: JoFormer is a novel Transformer architecture that uses learnable directional transforms to model relative positions, achieving lower perplexity and faster convergence than RoFormer in language modeling tasks.
Authors:Mingyu Zheng, Zhifan Feng, Jia Wang, Lanrui Wang, Zheng Lin, Yang Hao, Weiping Wang
Abstract:
Despite the commendable progress of recent LLM-based data synthesis methods, they face two limitations in generating table instruction tuning data. First, they can not thoroughly explore the vast input space of table understanding tasks, leading to limited data diversity. Second, they ignore the weaknesses in table understanding ability of the target LLM and blindly pursue the increase of data quantity, resulting in suboptimal data efficiency. In this paper, we introduce a progressive and weakness-guided data synthesis framework tailored for table instruction tuning, named TableDreamer, to mitigate the above issues. Specifically, we first synthesize diverse tables and related instructions as seed data, and then perform an iterative exploration of the input space under the guidance of the newly identified weakness data, which eventually serve as the final training data for fine-tuning the target LLM. Extensive experiments on 10 tabular benchmarks demonstrate the effectiveness of the proposed framework, which boosts the average accuracy of Llama3.1-8B-instruct by 11.62% (49.07% to 60.69%) with 27K GPT-4o synthetic data and outperforms state-of-the-art data synthesis baselines which use more training data. The code and data is available at https://github.com/SpursGoZmy/TableDreamer
中文: TableDreamer提出了一种渐进式、弱点引导的数据合成框架,通过生成多样化数据并迭代修正大语言模型在表格理解中的不足,以少量合成数据显著提升了模型性能。
English: TableDreamer introduces a progressive, weakness-guided framework to enhance table instruction tuning by synthesizing diverse data and iteratively addressing LLM weaknesses, significantly improving accuracy with minimal synthetic data.
Authors:Simon Roschmann, Quentin Bouniot, Vasilii Feofanov, Ievgen Redko, Zeynep Akata
Abstract:
Time series classification is a fundamental task in healthcare and industry, yet the development of time series foundation models (TSFMs) remains limited by the scarcity of publicly available time series datasets. In this work, we propose Time Vision Transformer (TiViT), a framework that converts time series into images to leverage the representational power of frozen Vision Transformers (ViTs) pretrained on large-scale image datasets. First, we theoretically motivate our approach by analyzing the 2D patching of ViTs for time series, showing that it can increase the number of label-relevant tokens and reduce the sample complexity. Second, we empirically demonstrate that TiViT achieves state-of-the-art performance on standard time series classification benchmarks by utilizing the hidden representations of large OpenCLIP models. We explore the structure of TiViT representations and find that intermediate layers with high intrinsic dimension are the most effective for time series classification. Finally, we assess the alignment between TiViT and TSFM representation spaces and identify a strong complementarity, with further performance gains achieved by combining their features. Our findings reveal a new direction for reusing vision representations in a non-visual domain. Code is available at https://github.com/ExplainableML/TiViT.
中文: TiViT框架将时间序列转换为图像以利用预训练的视觉变换器,实现了最先进的分类性能,并揭示了在非视觉领域重用视觉表示的新方向。
English: The TiViT framework transforms time series into images to utilize pretrained Vision Transformers, achieving state-of-the-art classification performance and revealing a novel approach for reusing visual representations in non-visual domains.
Authors:Xianquan Yan, Hakan Akgün, Kenji Kawaguchi, N. Duane Loh, Ching Hua Lee
Abstract:
Existing graph benchmarks assume non-spatial, simple edges, collapsing physically distinct paths into a single link. We introduce HSG-12M, the first large-scale dataset of $\textbf{spatial multigraphs}-$graphs embedded in a metric space where multiple geometrically distinct trajectories between two nodes are retained as separate edges. HSG-12M contains 11.6 million static and 5.1 million dynamic $\textit{Hamiltonian spectral graphs}$ across 1401 characteristic-polynomial classes, derived from 177 TB of spectral potential data. Each graph encodes the full geometry of a 1-D crystal's energy spectrum on the complex plane, producing diverse, physics-grounded topologies that transcend conventional node-coordinate datasets. To enable future extensions, we release $\texttt{Poly2Graph}$: a high-performance, open-source pipeline that maps arbitrary 1-D crystal Hamiltonians to spectral graphs. Benchmarks with popular GNNs expose new challenges in learning from multi-edge geometry at scale. Beyond its practical utility, we show that spectral graphs serve as universal topological fingerprints of polynomials, vectors, and matrices, forging a new algebra-to-graph link. HSG-12M lays the groundwork for geometry-aware graph learning and new opportunities of data-driven scientific discovery in condensed matter physics and beyond.
中文摘要:本文提出了HSG-12M,这是首个大规模空间多重图数据集,通过保留节点间多条几何路径并基于光谱势数据构建,为几何感知的图学习及科学发现奠定了基础。
English Summary: This paper introduces HSG-12M, the first large-scale spatial multigraph dataset that preserves multiple geometrically distinct paths between nodes, derived from spectral potential data to enable geometry-aware graph learning and scientific discovery.
Authors:Kiran Purohit, V Venktesh, Sourangshu Bhattacharya, Avishek Anand
Abstract:
The in-context learning paradigm with LLMs has been instrumental in advancing a wide range of natural language processing tasks. The selection of few-shot examples (exemplars / demonstration samples) is essential for constructing effective prompts under context-length budget constraints. In this paper, we formulate the exemplar selection task as a top-m best arms identification problem. A key challenge in this setup is the exponentially large number of arms that need to be evaluated to identify the m-best arms. We propose CASE (Challenger Arm Sampling for Exemplar selection), a novel sample-efficient selective exploration strategy that maintains a shortlist of "challenger" arms, which are current candidates for the top-m arms. In each iteration, only one of the arms from this shortlist or the current topm set is pulled, thereby reducing sample complexity and, consequently, the number of LLM evaluations. Furthermore, we model the scores of exemplar subsets (arms) using a parameterized linear scoring function, leading to stochastic linear bandits setting. CASE achieves remarkable efficiency gains of up to 7x speedup in runtime while requiring 7x fewer LLM calls (87% reduction) without sacrificing performance compared to state-of-the-art exemplar selection methods. We release our code and data at https://github.com/kiranpurohit/CASE
中文: 本文提出CASE方法,将示例选择建模为最优臂识别问题,通过高效采样策略在保持性能的同时,将运行速度提升7倍并减少87%的大语言模型调用次数。
English: This paper introduces CASE, a sample-efficient strategy that formulates exemplar selection as a top-m best arms identification problem, achieving up to 7x faster runtime and 87% fewer LLM calls while maintaining performance compared to existing methods.
Authors:Liyan Xu, Zhenlin Su, Mo Yu, Jiangnan Li, Fandong Meng, Jie Zhou
Abstract:
This work stems from an observed limitation of text encoders: embeddings may not be able to recognize fine-grained entities or events within encoded semantics, resulting in failed retrieval even in simple cases. To examine such behaviors, we first introduce a new evaluation dataset, CapRetrieval, in which passages are image captions and queries are phrases targeting entity or event concepts in diverse forms. Zero-shot evaluation suggests that encoders often struggle with these fine-grained matching, regardless of training sources or model size. Aiming for enhancement, we proceed to finetune encoders with our proposed data generation strategies, enabling a small 0.1B encoder to outperform the state-of-the-art 7B model. Within this process, we further uncover the granularity dilemma, a challenge for embeddings to capture fine-grained salience while aligning with overall semantics. Our dataset, code and models in this work are publicly released at https://github.com/lxucs/CapRetrieval.
中文: 本研究针对文本编码器在识别细粒度实体和事件方面的局限性,通过引入CapRetrieval数据集并提出数据生成策略,使小型编码器性能超越大型模型,同时揭示了嵌入语义中的粒度困境。
English: This study addresses the limitation of text encoders in recognizing fine-grained entities and events by introducing the CapRetrieval dataset and proposing data generation strategies that enable a small encoder to outperform larger models, while also identifying the granularity dilemma in embedding semantics.
Authors:Chengchao Shen, Hourun Zhu, Gongfan Fang, Jianxin Wang, Xinchao Wang
Abstract:
Transformer models achieve excellent scaling property, where the performance is improved with the increment of model capacity. However, large-scale model parameters lead to an unaffordable cost of computing and memory. We analyze popular transformer architectures and find that multilayer perceptron (MLP) modules take up the majority of model parameters. To this end, we focus on the recoverability of the compressed models and propose a Diversity-Guided MLP Reduction (DGMR) method to significantly reduce the parameters of large vision transformers with only negligible performance degradation. Specifically, we conduct a Gram-Schmidt weight pruning strategy to eliminate redundant neurons of MLP hidden layer, while preserving weight diversity for better performance recover during distillation. Compared to the model trained from scratch, our pruned model only requires 0.06\% data of LAION-2B (for the training of large vision transformers) without labels (ImageNet-1K) to recover the original performance. Experimental results on several state-of-the-art large vision transformers demonstrate that our method achieves a more than 57.0\% parameter and FLOPs reduction in a near lossless manner. Notably, for EVA-CLIP-E (4.4B), our method accomplishes a 71.5\% parameter and FLOPs reduction without performance degradation. The source code and trained weights are available at https://github.com/visresearch/DGMR.
中文: 提出的多样性引导多层感知机缩减方法通过剪枝冗余神经元并保持权重多样性,以近乎无损的方式将大型视觉变换器的参数和计算量减少超过57%。
English: The proposed Diversity-Guided MLP Reduction (DGMR) method significantly compresses large vision transformers by pruning redundant MLP neurons while preserving weight diversity, achieving over 57% parameter and FLOPs reduction with negligible performance loss.
Authors:Sunny Gupta, Nikita Jangid, Shounak Das, Amit Sethi
Abstract:
Domain Generalization (DG) seeks to train models that perform reliably on unseen target domains without access to target data during training. While recent progress in smoothing the loss landscape has improved generalization, existing methods often falter under long-tailed class distributions and conflicting optimization objectives. We introduce FedTAIL, a federated domain generalization framework that explicitly addresses these challenges through sharpness-guided, gradient-aligned optimization. Our method incorporates a gradient coherence regularizer to mitigate conflicts between classification and adversarial objectives, leading to more stable convergence. To combat class imbalance, we perform class-wise sharpness minimization and propose a curvature-aware dynamic weighting scheme that adaptively emphasizes underrepresented tail classes. Furthermore, we enhance conditional distribution alignment by integrating sharpness-aware perturbations into entropy regularization, improving robustness under domain shift. FedTAIL unifies optimization harmonization, class-aware regularization, and conditional alignment into a scalable, federated-compatible framework. Extensive evaluations across standard domain generalization benchmarks demonstrate that FedTAIL achieves state-of-the-art performance, particularly in the presence of domain shifts and label imbalance, validating its effectiveness in both centralized and federated settings. Code: https://github.com/sunnyinAI/FedTail
中文摘要:FedTAIL作为一种联邦领域泛化框架,通过锐度感知优化解决了类别不平衡与目标冲突问题,在领域偏移和标签不平衡场景下实现了最先进的性能表现。
English Summary: FedTAIL is a federated domain generalization framework that tackles class imbalance and conflicting objectives through sharpness-aware optimization, achieving state-of-the-art performance under domain shifts and label imbalance.
Authors:Shuo Yang, Qihui Zhang, Yuyang Liu, Yue Huang, Xiaojun Jia, Kunpeng Ning, Jiayu Yao, Jigang Wang, Hailiang Dai, Yibing Song, Li Yuan
Abstract:
Large language models (LLMs) are vulnerable to safety risks during fine-tuning, where small amounts of malicious or harmless data can compromise safeguards. In this paper, building on the concept of alignment direction -- defined by the weight difference between aligned and unaligned models -- we observe that perturbations along this direction preserve model safety. In contrast, perturbations along directions orthogonal to this alignment are strongly linked to harmful direction perturbations, rapidly degrading safety and framing the parameter space as a narrow safety basin. Based on this insight, we propose a methodology for safety fine-tuning called AsFT (Anchoring Safety in Fine-Tuning), which integrates a regularization term into the training objective. This term uses the alignment direction as an anchor to suppress updates in harmful directions, ensuring that fine-tuning is constrained within the narrow safety basin. Extensive experiments on multiple datasets show that AsFT outperforms Safe LoRA, reducing harmful behavior by 7.60 percent, improving model performance by 3.44 percent, and maintaining robust performance across various experimental settings. Code is available at https://github.com/PKU-YuanGroup/AsFT
Chinese: 本研究提出AsFT安全微调方法,以对齐方向为锚点将更新约束在狭窄安全区域内,实验表明该方法能有效减少7.60%有害行为并提升3.44%模型性能。
English: The study introduces AsFT, a safety fine-tuning method that uses the alignment direction as an anchor to restrict updates within a narrow safety basin, effectively reducing harmful behavior by 7.60% and improving model performance by 3.44% in experiments.
Authors:Hyunseok Seung, Jaewoo Lee, Hyunsuk Ko
Abstract:
Second-order optimization methods for training neural networks, such as KFAC, exhibit superior convergence by utilizing curvature information of loss landscape. However, it comes at the expense of high computational burden. In this work, we analyze the two components that constitute the layer-wise Fisher information matrix (FIM) used in KFAC: the Kronecker factors related to activations and pre-activation gradients. Based on empirical observations on their eigenspectra, we propose efficient approximations for them, resulting in a computationally efficient optimization method called MAC. To the best of our knowledge, MAC is the first algorithm to apply the Kronecker factorization to the FIM of attention layers used in transformers and explicitly integrate attention scores into the preconditioning. We also study the convergence property of MAC on nonlinear neural networks and provide two conditions under which it converges to global minima. Our extensive evaluations on various network architectures and datasets show that the proposed method outperforms KFAC and other state-of-the-art methods in terms of accuracy, end-to-end training time, and memory usage. Code is available at https://github.com/hseung88/mac.
中文: 提出的MAC方法高效近似KFAC中Fisher信息矩阵的Kronecker因子,在多种神经网络上实现了更高的精度、更快的训练速度和更低的内存消耗,并且是首个将Kronecker分解应用于Transformer注意力层的算法。
English: The proposed MAC method efficiently approximates the Kronecker factors in KFAC's Fisher information matrix, achieving superior accuracy, faster training, and lower memory usage across various neural networks while being the first to apply Kronecker factorization to transformer attention layers.
Authors:Utkarsh Pratiush, Austin Houston, Kamyar Barakati, Aditya Raghavan, Dasol Yoon, Harikrishnan KP, Zhaslan Baraissov, Desheng Ma, Samuel S. Welborn, Mikolaj Jakowski, Shawn-Patrick Barhorst, Alexander J. Pattison, Panayotis Manganaris, Sita Sirisha Madugula, Sai Venkata Gayathri Ayyagari, Vishal Kennedy, Ralph Bulanadi, Michelle Wang, Kieran J. Pang, Ian Addison-Smith, Willy Menacho, Horacio V. Guzman, Alexander Kiefer, Nicholas Furth, Nikola L. Kolev, Mikhail Petrov, Viktoriia Liu, Sergey Ilyev, Srikar Rairao, Tommaso Rodani, Ivan Pinto-Huguet, Xuli Chen, Josep Cruañes, Marta Torrens, Jovan Pomar, Fanzhi Su, Pawan Vedanti, Zhiheng Lyu, Xingzhi Wang, Lehan Yao, Amir Taqieddin, Forrest Laskowski, Xiangyu Yin, Yu-Tsun Shao, Benjamin Fein-Ashley, Yi Jiang, Vineet Kumar, Himanshu Mishra, Yogesh Paul, Adib Bazgir, Rama chandra Praneeth Madugula, Yuwen Zhang, Pravan Omprakash, Jian Huang, Eric Montufar-Morales, Vivek Chawla, Harshit Sethi, Jie Huang, Lauri Kurki, Grace Guinan, Addison Salvador, Arman Ter-Petrosyan, Madeline Van Winkle, Steven R. Spurgeon, Ganesh Narasimha, Zijie Wu, Richard Liu, Yongtao Liu, Boris Slautin, Andrew R Lupini, Rama Vasudevan, Gerd Duscher, Sergei V. Kalinin
Abstract:
Microscopy is a primary source of information on materials structure and functionality at nanometer and atomic scales. The data generated is often well-structured, enriched with metadata and sample histories, though not always consistent in detail or format. The adoption of Data Management Plans (DMPs) by major funding agencies promotes preservation and access. However, deriving insights remains difficult due to the lack of standardized code ecosystems, benchmarks, and integration strategies. As a result, data usage is inefficient and analysis time is extensive. In addition to post-acquisition analysis, new APIs from major microscope manufacturers enable real-time, ML-based analytics for automated decision-making and ML-agent-controlled microscope operation. Yet, a gap remains between the ML and microscopy communities, limiting the impact of these methods on physics, materials discovery, and optimization. Hackathons help bridge this divide by fostering collaboration between ML researchers and microscopy experts. They encourage the development of novel solutions that apply ML to microscopy, while preparing a future workforce for instrumentation, materials science, and applied ML. This hackathon produced benchmark datasets and digital twins of microscopes to support community growth and standardized workflows. All related code is available at GitHub: https://github.com/KalininGroup/Mic-hackathon-2024-codes-publication/tree/1.0.0.1
中文: 显微镜技术产生丰富但格式不一的数据,缺乏标准化工具和集成策略阻碍了高效分析,而黑客松活动通过连接机器学习与显微学领域,推动了基准数据集和数字孪生等解决方案的开发。
English: Microscopy generates rich but inconsistently formatted data, where the lack of standardized tools and integration hinders efficient analysis, though hackathons bridge ML and microscopy communities to develop solutions like benchmarks and digital twins.
Authors:Edoardo Cetin, Tianyu Zhao, Yujin Tang
Abstract:
Training reasoning language models (LMs) with reinforcement learning (RL) for one-hot correctness inherently relies on the LM being able to explore and solve its task with some chance at initialization. Furthermore, a key use case of reasoning LMs is to act as teachers for distilling new students and cold-starting future RL iterations rather than being deployed themselves. From these considerations, we introduce a new framework that avoids RL's exploration challenge by training a new class of Reinforcement-Learned Teachers (RLTs) focused on yielding the most effective downstream distillation. RLTs are prompted with both the question and solution to each problem, and tasked to simply "connect-the-dots" with detailed explanations tailored for their students. We train RLTs with dense rewards obtained by feeding each explanation to the student and testing its understanding of the problem's solution. In practice, the raw outputs of a 7B RLT provide higher final performance on competition and graduate-level tasks than existing distillation and cold-starting pipelines that collect and postprocess the reasoning traces of orders of magnitude larger LMs. Furthermore, RLTs maintain their effectiveness when training larger students and when applied zero-shot to out-of-distribution tasks, unlocking new levels of efficiency and re-usability for the RL reasoning framework.
中文摘要:本文提出强化学习教师(RLT)框架,通过训练模型生成针对下游学生的详细解释来规避强化学习的探索难题,相比传统方法能以更小模型实现更优性能,并保持跨任务的有效性。
English Summary: The paper introduces Reinforcement-Learned Teachers (RLTs), a framework that trains reasoning models to generate detailed explanations for distillation, overcoming exploration challenges in RL while achieving superior performance with smaller models compared to traditional methods.
Authors:Hyunseok Seung, Jaewoo Lee, Hyunsuk Ko
Abstract:
Adaptive gradient methods are computationally efficient and converge quickly, but they often suffer from poor generalization. In contrast, second-order methods enhance convergence and generalization but typically incur high computational and memory costs. In this work, we introduce NysAct, a scalable first-order gradient preconditioning method that strikes a balance between state-of-the-art first-order and second-order optimization methods. NysAct leverages an eigenvalue-shifted Nystrom method to approximate the activation covariance matrix, which is used as a preconditioning matrix, significantly reducing time and memory complexities with minimal impact on test accuracy. Our experiments show that NysAct not only achieves improved test accuracy compared to both first-order and second-order methods but also demands considerably less computational resources than existing second-order methods. Code is available at https://github.com/hseung88/nysact.
中文摘要:NysAct是一种可扩展的一阶梯度预处理方法,通过近似激活协方差矩阵平衡了计算效率与测试精度,在泛化性能上优于一阶和二阶方法,同时显著降低了资源消耗。
English Summary: NysAct is a scalable first-order gradient preconditioning method that balances computational efficiency with improved test accuracy by approximating the activation covariance matrix, outperforming both first-order and second-order methods in generalization while using fewer resources.
Authors:Hyunseok Seung, Jaewoo Lee, Hyunsuk Ko
Abstract:
We introduce AdaAct, a novel optimization algorithm that adjusts learning rates according to activation variance. Our method enhances the stability of neuron outputs by incorporating neuron-wise adaptivity during the training process, which subsequently leads to better generalization -- a complementary approach to conventional activation regularization methods. Experimental results demonstrate AdaAct's competitive performance across standard image classification benchmarks. We evaluate AdaAct on CIFAR and ImageNet, comparing it with other state-of-the-art methods. Importantly, AdaAct effectively bridges the gap between the convergence speed of Adam and the strong generalization capabilities of SGD, all while maintaining competitive execution times. Code is available at https://github.com/hseung88/adaact.
中文: AdaAct是一种新颖的优化算法,通过根据激活方差调整学习率来提高训练稳定性和泛化能力,有效弥合了Adam与SGD之间的性能差距,同时保持高效运行。
English: AdaAct is a novel optimization algorithm that enhances training stability and generalization by adapting learning rates based on activation variance, effectively bridging the performance gap between Adam and SGD while maintaining competitive efficiency.
Authors:Yinan Huang, Haoteng Yin, Eli Chien, Rongzhe Wei, Pan Li
Abstract:
Learning with relational and network-structured data is increasingly vital in sensitive domains where protecting the privacy of individual entities is paramount. Differential Privacy (DP) offers a principled approach for quantifying privacy risks, with DP-SGD emerging as a standard mechanism for private model training. However, directly applying DP-SGD to relational learning is challenging due to two key factors: (i) entities often participate in multiple relations, resulting in high and difficult-to-control sensitivity; and (ii) relational learning typically involves multi-stage, potentially coupled (interdependent) sampling procedures that make standard privacy amplification analyses inapplicable. This work presents a principled framework for relational learning with formal entity-level DP guarantees. We provide a rigorous sensitivity analysis and introduce an adaptive gradient clipping scheme that modulates clipping thresholds based on entity occurrence frequency. We also extend the privacy amplification results to a tractable subclass of coupled sampling, where the dependence arises only through sample sizes. These contributions lead to a tailored DP-SGD variant for relational data with provable privacy guarantees. Experiments on fine-tuning text encoders over text-attributed network-structured relational data demonstrate the strong utility-privacy trade-offs of our approach. Our code is available at https://github.com/Graph-COM/Node_DP.
中文摘要:本文提出了一个具有形式化实体级差分隐私保障的关系学习框架,通过自适应梯度裁剪和针对耦合采样的扩展分析,解决了敏感性控制和隐私放大难题。
English Summary: This paper introduces a principled framework for relational learning with formal entity-level differential privacy guarantees, addressing challenges in sensitivity control and privacy amplification through adaptive gradient clipping and extended analysis for coupled sampling.
Authors:Alan N. Amin, Nate Gruver, Andrew Gordon Wilson
Abstract:
Discrete diffusion models, like continuous diffusion models, generate high-quality samples by gradually undoing noise applied to datapoints with a Markov process. Gradual generation in theory comes with many conceptual benefits; for example, inductive biases can be incorporated into the noising Markov process, and access to improved sampling algorithms. In practice, however, the consistently best performing discrete diffusion model is, surprisingly, masking diffusion, which does not denoise gradually. Here we explain the superior performance of masking diffusion by noting that it makes use of a fundamental difference between continuous and discrete Markov processes: discrete Markov processes evolve by discontinuous jumps at a fixed rate and, unlike other discrete diffusion models, masking diffusion builds in the known distribution of jump times and only learns where to jump to. We show that we can similarly bake in the known distribution of jump times into any discrete diffusion model. The resulting models - schedule-conditioned discrete diffusion (SCUD) - generalize classical discrete diffusion and masking diffusion. By applying SCUD to models with noising processes that incorporate inductive biases on images, text, and protein data, we build models that outperform masking.
中文摘要:该研究通过揭示离散马尔可夫过程中跳跃时间分布的已知特性解释了掩码扩散的优越性,并提出SCUD模型将这一机制融入各类离散扩散过程,在图像、文本和蛋白质数据上实现了性能突破。
English Summary: The study explains that masking diffusion excels in discrete diffusion models by leveraging the known distribution of jump times in discrete Markov processes, and introduces SCUD models that incorporate this insight to outperform existing methods across various data types.
Authors:Katherine Tieu, Dongqi Fu, Zihao Li, Ross Maciejewski, Jingrui He
Abstract:
Accurate predictions rely on the expressiveness power of graph deep learning frameworks like graph neural networks and graph transformers, where a positional encoding mechanism has become much more indispensable in recent state-of-the-art works to record the canonical position information. However, the current positional encoding is limited in three aspects: (1) most positional encoding methods use pre-defined, and fixed functions, which are inadequate to adapt to the complex attributed graphs; (2) a few pioneering works proposed the learnable positional encoding but are still limited to the structural information, not considering the real-world time-evolving topological and feature information; (3) most positional encoding methods are equipped with transformers' attention mechanism to fully leverage their capabilities, where the dense or relational attention is often unaffordable on large-scale structured data. Hence, we aim to develop Learnable Spatial-Temporal Positional Encoding in an effective and efficient manner and propose a simple temporal link prediction model named L-STEP. Briefly, for L-STEP, we (1) prove the proposed positional learning scheme can preserve the graph property from the spatial-temporal spectral viewpoint, (2) verify that MLPs can fully exploit the expressiveness and reach transformers' performance on that encoding, (3) change different initial positional encoding inputs to show robustness, (4) analyze the theoretical complexity and obtain less empirical running time than SOTA, and (5) demonstrate its temporal link prediction out-performance on 13 classic datasets and with 10 algorithms in both transductive and inductive settings using 3 different sampling strategies. Also, L-STEP obtains the leading performance in the newest large-scale TGB benchmark. Our code is available at https://github.com/kthrn22/L-STEP.
Chinese: 当前图学习中的位置编码方法在适应性、动态信息整合和计算效率方面存在局限,因此我们开发了L-STEP这一可学习的时空模型,该模型在多个数据集和设置下展现出卓越的性能、效率及鲁棒性。
English: Current positional encoding methods in graph learning face limitations in adaptability, dynamic information integration, and computational efficiency, prompting the development of L-STEP, a learnable spatial-temporal model that demonstrates superior performance, efficiency, and robustness across multiple datasets and settings.
Authors:Zhanke Zhou, Xiao Feng, Zhaocheng Zhu, Jiangchao Yao, Sanmi Koyejo, Bo Han
Abstract:
While existing benchmarks probe the reasoning abilities of large language models (LLMs) across diverse domains, they predominantly assess passive reasoning, providing models with all the information needed to reach a solution. By contrast, active reasoning-where an LLM must interact with external systems to acquire missing evidence or data-has received little systematic attention. To address this shortfall, we present AR-Bench, a novel benchmark designed explicitly to evaluate an LLM's active reasoning skills. AR-Bench comprises three task families-detective cases, situation puzzles, and guessing numbers-that together simulate real-world, agentic scenarios and measure performance across commonsense, logical, and symbolic reasoning challenges. Empirical evaluation on AR-Bench demonstrates that contemporary LLMs exhibit pronounced difficulties with active reasoning: they frequently fail to acquire or leverage the information needed to solve tasks. This gap highlights a stark divergence between their passive and active reasoning abilities. Moreover, ablation studies indicate that even advanced strategies, such as tree-based searching or post-training approaches, yield only modest gains and fall short of the levels required for real-world deployment. Collectively, these findings highlight the critical need to advance methodology for active reasoning, e.g., incorporating interactive learning, real-time feedback loops, and environment-aware objectives for training. The benchmark is publicly available at: https://github.com/tmlr-group/AR-Bench.
中文摘要:AR-Bench是一个专为评估大语言模型主动推理能力设计的新基准,揭示了模型在获取和利用外部信息方面相比被动推理存在显著困难,并强调了改进交互学习等方法的必要性。
English Summary: AR-Bench is a new benchmark designed to evaluate large language models' active reasoning skills, revealing their significant struggles in acquiring and using external information compared to passive reasoning, and highlighting the need for improved methodologies like interactive learning.
Authors:Xie Yi, Zhanke Zhou, Chentao Cao, Qiyu Niu, Tongliang Liu, Bo Han
Abstract:
Multi-agent frameworks can substantially boost the reasoning power of large language models (LLMs), but they typically incur heavy computational costs and lack convergence guarantees. To overcome these challenges, we recast multi-LLM coordination as an incomplete-information game and seek a Bayesian Nash equilibrium (BNE), in which each agent optimally responds to its probabilistic beliefs about the strategies of others. We introduce Efficient Coordination via Nash Equilibrium (ECON), a hierarchical reinforcement-learning paradigm that marries distributed reasoning with centralized final output. Under ECON, each LLM independently selects responses that maximize its expected reward, conditioned on its beliefs about co-agents, without requiring costly inter-agent exchanges. We mathematically prove that ECON attains a markedly tighter regret bound than non-equilibrium multi-agent schemes. Empirically, ECON outperforms existing multi-LLM approaches by 11.2% on average across six benchmarks spanning complex reasoning and planning tasks. Further experiments demonstrate ECON's ability to flexibly incorporate additional models, confirming its scalability and paving the way toward larger, more powerful multi-LLM ensembles. The code is publicly available at: https://github.com/tmlr-group/ECON.
Chinese: ECON通过将多智能体协调建模为不完全信息博弈,提出了一种结合分布式推理与集中输出的分层强化学习范式,在显著降低通信成本的同时实现了更强的性能与理论保障。
English: ECON introduces a hierarchical reinforcement-learning framework that models multi-LLM coordination as an incomplete-information game, achieving superior performance with a tighter regret bound and eliminating costly inter-agent communication.
Authors:Subba Reddy Oota, Khushbu Pahwa, Prachi Jindal, Satya Sai Srinath Namburi, Maneesh Singh, Tanmoy Chakraborty, Bapi S. Raju, Manish Gupta
Abstract:
Recent voxel-wise multimodal brain encoding studies have shown that multimodal large language models (MLLMs) exhibit a higher degree of brain alignment compared to unimodal models in both unimodal and multimodal stimulus settings. More recently, instruction-tuned multimodal models have shown to generate task-specific representations that align strongly with brain activity. However, prior work evaluating the brain alignment of MLLMs has primarily focused on unimodal settings or relied on non-instruction-tuned multimodal models for multimodal stimuli. To address this gap, we investigated brain alignment, that is, measuring the degree of predictivity of neural activity recorded while participants were watching naturalistic movies (video along with audio) with representations derived from MLLMs. We utilized instruction-specific embeddings from six video and two audio instruction-tuned MLLMs. Experiments with 13 video task-specific instructions show that instruction-tuned video MLLMs significantly outperform non-instruction-tuned multimodal (by 15%) and unimodal models (by 20%). Our evaluation of MLLMs for both video and audio tasks using language-guided instructions shows clear disentanglement in task-specific representations from MLLMs, leading to precise differentiation of multimodal functional processing in the brain. We also find that MLLM layers align hierarchically with the brain, with early sensory areas showing strong alignment with early layers, while higher-level visual and language regions align more with middle to late layers. These findings provide clear evidence for the role of task-specific instructions in improving the alignment between brain activity and MLLMs, and open new avenues for mapping joint information processing in both the systems. We make the code publicly available [https://github.com/subbareddy248/mllm_videos].
中文: 指令调优的多模态大语言模型在自然观影场景中显著优于非指令调优模型,其任务特定表征能精确对应大脑层级处理机制,为大脑与计算系统的映射研究开辟了新途径。
English: Instruction-tuned multimodal large language models significantly outperform non-instruction-tuned models in aligning with brain activity during naturalistic movie viewing, showing hierarchical layer correspondence and task-specific representation disentanglement that advances brain-computation mapping.
Authors:Sunny Gupta, Nikita Jangid, Amit Sethi
Abstract:
Federated Learning (FL) often suffers from severe performance degradation when faced with non-IID data, largely due to local classifier bias. Traditional remedies such as global model regularization or layer freezing either incur high computational costs or struggle to adapt to feature shifts. In this work, we propose UniVarFL, a novel FL framework that emulates IID-like training dynamics directly at the client level, eliminating the need for global model dependency. UniVarFL leverages two complementary regularization strategies during local training: Classifier Variance Regularization, which aligns class-wise probability distributions with those expected under IID conditions, effectively mitigating local classifier bias; and Hyperspherical Uniformity Regularization, which encourages a uniform distribution of feature representations across the hypersphere, thereby enhancing the model's ability to generalize under diverse data distributions. Extensive experiments on multiple benchmark datasets demonstrate that UniVarFL outperforms existing methods in accuracy, highlighting its potential as a highly scalable and efficient solution for real-world FL deployments, especially in resource-constrained settings. Code: https://github.com/sunnyinAI/UniVarFL
Chinese: UniVarFL提出了一种新颖的联邦学习框架,通过双重正则化策略有效缓解本地分类器偏差并提升特征泛化能力,在非独立同分布数据场景下实现了更优的准确性与可扩展性。
English: UniVarFL introduces a novel federated learning framework that employs dual regularization strategies to mitigate local classifier bias and enhance feature generalization, achieving superior accuracy and scalability in non-IID settings.
Authors:Hadi Reisizadeh, Jinghan Jia, Zhiqi Bu, Bhanukiran Vinzamuri, Anil Ramakrishna, Kai-Wei Chang, Volkan Cevher, Sijia Liu, Mingyi Hong
Abstract:
Enabling large language models (LLMs) to unlearn knowledge and capabilities acquired during training has proven vital for ensuring compliance with data regulations and promoting ethical practices in generative AI. Although there are growing interests in developing various unlearning algorithms, it remains unclear how to best formulate the unlearning problem. The most popular formulation uses a weighted sum of forget and retain loss, but it often leads to performance degradation due to the inherent trade-off between forget and retain losses. In this work, we argue that it is important to model the hierarchical structure of the unlearning problem, where the forget problem (which \textit{unlearns} certain knowledge and/or capabilities) takes priority over the retain problem (which preserves model utility). This hierarchical structure naturally leads to a bi-level optimization formulation where the lower-level objective focuses on minimizing the forget loss, while the upper-level objective aims to maintain the model's utility. Based on this new formulation, we propose a novel algorithm, termed Bi-Level UnleaRning (\texttt{BLUR}), which not only possesses strong theoretical guarantees but more importantly, delivers superior performance. In particular, our extensive experiments demonstrate that \texttt{BLUR} consistently outperforms all the state-of-the-art algorithms across various unlearning tasks, models, and metrics. Codes are available at https://github.com/OptimAI-Lab/BLURLLMUnlearning.
中文: 本文针对大语言模型的知识消除问题提出分层双级优化框架,通过BLUR算法优先实现特定知识遗忘并保持模型性能,在多项任务中显著优于现有方法。
English: This paper introduces a hierarchical bi-level optimization approach for large language model unlearning, proposing the BLUR algorithm that prioritizes forgetting specific knowledge while maintaining model utility, which significantly outperforms existing methods across various tasks.
Authors:Ziheng Qin, Hailun Xu, Wei Chee Yew, Qi Jia, Yang Luo, Kanchan Sarkar, Danhui Guan, Kai Wang, Yang You
Abstract:
Machine learning relies heavily on data, yet the continuous growth of real-world data poses challenges for efficient dataset construction and training. A fundamental yet unsolved question is: given our current model and data, does a new data (sample/batch) need annotation/learning? Conventional approaches retain all available data, leading to non-optimal data and training efficiency. Active learning aims to reduce data redundancy by selecting a subset of samples to annotate, while it increases pipeline complexity and introduces bias. In this work, we propose Info-Coevolution, a novel framework that efficiently enables models and data to coevolve through online selective annotation with no bias. Leveraging task-specific models (and open-source models), it selectively annotates and integrates online and web data to improve datasets efficiently. For real-world datasets like ImageNet-1K, Info-Coevolution reduces annotation and training costs by 32\% without performance loss. It is able to automatically give the saving ratio without tuning the ratio. It can further reduce the annotation ratio to 50\% with semi-supervised learning. We also explore retrieval-based dataset enhancement using unlabeled open-source data. Code is available at https://github.com/NUS-HPC-AI-Lab/Info-Coevolution/.
Chinese: 提出的Info-Coevolution框架通过选择性在线标注实现模型与数据的协同进化,在ImageNet-1K数据集上无需性能损失即可降低32%的标注成本,同时消除偏差且无需手动调整比例。
English: The proposed Info-Coevolution framework enables efficient model-data coevolution through selective online annotation, reducing ImageNet-1K annotation costs by 32% without performance loss while eliminating bias and manual ratio tuning.
Authors:Ye Zhu, Duo Xu, Zhiwei Deng, Jonathan C. Tan, Olga Russakovsky
Abstract:
We study Diffusion Schrödinger Bridge (DSB) models in the context of dynamical astrophysical systems, specifically tackling observational inverse prediction tasks within Giant Molecular Clouds (GMCs) for star formation. We introduce the Astro-DSB model, a variant of DSB with the pairwise domain assumption tailored for astrophysical dynamics. By investigating its learning process and prediction performance in both physically simulated data and in real observations (the Taurus B213 data), we present two main takeaways. First, from the astrophysical perspective, our proposed paired DSB method improves interpretability, learning efficiency, and prediction performance over conventional astrostatistical and other machine learning methods. Second, from the generative modeling perspective, probabilistic generative modeling reveals improvements over discriminative pixel-to-pixel modeling in Out-Of-Distribution (OOD) testing cases of physical simulations with unseen initial conditions and different dominant physical processes. Our study expands research into diffusion models beyond the traditional visual synthesis application and provides evidence of the models' learning abilities beyond pure data statistics, paving a path for future physics-aware generative models which can align dynamics between machine learning and real (astro)physical systems.
中文: Astro-DSB模型作为针对天体物理系统定制的扩散薛定谔桥变体,在恒星形成研究中提升了可解释性与预测能力,并证明了生成模型在处理未知天体物理条件时的优越性。
English: The Astro-DSB model, a tailored Diffusion Schrödinger Bridge variant, enhances interpretability and prediction in star formation studies while demonstrating generative modeling's superiority in handling unseen astrophysical conditions.
Authors:Songqiao Hu, Zeyi Liu, Xiao He
Abstract:
The change in data distribution over time, also known as concept drift, poses a significant challenge to the reliability of online learning methods. Existing methods typically require model retraining or drift detection, both of which demand high computational costs and are often unsuitable for real-time applications. To address these limitations, a lightweight, fast and efficient random vector functional-link network termed Lite-RVFL is proposed, capable of adapting to concept drift without drift detection and retraining. Lite-RVFL introduces a novel objective function that assigns weights exponentially increasing to new samples, thereby emphasizing recent data and enabling timely adaptation. Theoretical analysis confirms the feasibility of this objective function for drift adaptation, and an efficient incremental update rule is derived. Experimental results on a real-world safety assessment task validate the efficiency, effectiveness in adapting to drift, and potential to capture temporal patterns of Lite-RVFL. The source code is available at https://github.com/songqiaohu/Lite-RVFL.
中文: Lite-RVFL是一种轻量级网络,通过指数加权函数强调新数据来适应概念漂移,无需重新训练或漂移检测,并在实际应用中验证了其高效性。
English: Lite-RVFL is a lightweight network that adapts to concept drift by emphasizing recent data through an exponential weighting function, eliminating the need for retraining or drift detection while proving efficient in real-world applications.
Authors:Yiming Wang, Hao Peng, Senzhang Wang, Haohua Du, Chunyang Liu, Jia Wu, Guanlin Wu
Abstract:
Traffic data imputation is fundamentally important to support various applications in intelligent transportation systems such as traffic flow prediction. However, existing time-to-space sequential methods often fail to effectively extract features in block-wise missing data scenarios. Meanwhile, the static graph structure for spatial feature propagation significantly constrains the models flexibility in handling the distribution shift issue for the nonstationary traffic data. To address these issues, this paper proposes a SpatioTemporal Attention Mixture of experts network named STAMImputer for traffic data imputation. Specifically, we introduce a Mixture of Experts (MoE) framework to capture latent spatio-temporal features and their influence weights, effectively imputing block missing. A novel Low-rank guided Sampling Graph ATtention (LrSGAT) mechanism is designed to dynamically balance the local and global correlations across road networks. The sampled attention vectors are utilized to generate dynamic graphs that capture real-time spatial correlations. Extensive experiments are conducted on four traffic datasets for evaluation. The result shows STAMImputer achieves significantly performance improvement compared with existing SOTA approaches. Our codes are available at https://github.com/RingBDStack/STAMImupter.
Chinese: 本文提出STAMImputer模型,通过动态图生成和专家混合框架有效处理交通数据中的块状缺失问题,在四个数据集上的实验表明其性能显著优于现有最优方法。
English: This paper introduces STAMImputer, a SpatioTemporal Attention Mixture of experts network that effectively handles block-wise missing traffic data through dynamic graph generation and achieves superior performance over existing methods.
Authors:Anh-Quan Cao, Ivan Lopes, Raoul de Charette
Abstract:
Multi-task learning for dense prediction is limited by the need for extensive annotation for every task, though recent works have explored training with partial task labels. Leveraging the generalization power of diffusion models, we extend the partial learning setup to a zero-shot setting, training a multi-task model on multiple synthetic datasets, each labeled for only a subset of tasks. Our method, StableMTL, repurposes image generators for latent regression. Adapting a denoising framework with task encoding, per-task conditioning and a tailored training scheme. Instead of per-task losses requiring careful balancing, a unified latent loss is adopted, enabling seamless scaling to more tasks. To encourage inter-task synergy, we introduce a multi-stream model with a task-attention mechanism that converts N-to-N task interactions into efficient 1-to-N attention, promoting effective cross-task sharing. StableMTL outperforms baselines on 7 tasks across 8 benchmarks.
中文摘要:StableMTL提出了一种基于扩散模型的零样本多任务学习方法,利用部分标注的合成数据集进行训练,通过统一潜在损失和任务注意力机制促进任务间协同,在多个基准测试中超越基线模型。
English Summary: StableMTL introduces a zero-shot multi-task learning method using diffusion models that trains on synthetic datasets with partial labels and employs a unified latent loss with task-attention for cross-task synergy, outperforming baselines across multiple benchmarks.
Authors:Boya Zeng, Yida Yin, Zhiqiu Xu, Zhuang Liu
Abstract:
Generative models, with their success in image and video generation, have recently been explored for synthesizing effective neural network weights. These approaches take trained neural network checkpoints as training data, and aim to generate high-performing neural network weights during inference. In this work, we examine four representative methods on their ability to generate novel model weights, i.e., weights that are different from the checkpoints seen during training. Surprisingly, we find that these methods synthesize weights largely by memorization: they produce either replicas, or at best simple interpolations, of the training checkpoints. Current methods fail to outperform simple baselines, such as adding noise to the weights or taking a simple weight ensemble, in obtaining different and simultaneously high-performing models. We further show that this memorization cannot be effectively mitigated by modifying modeling factors commonly associated with memorization in image diffusion models, or applying data augmentations. Our findings provide a realistic assessment of what types of data current generative models can model, and highlight the need for more careful evaluation of generative models in new domains. Our code is available at https://github.com/boyazeng/weight_memorization.
中文: 本研究发现,现有生成模型在合成神经网络权重时主要依赖记忆或简单插值训练检查点,未能超越基础方法,这源于数据限制及结构先验利用不足。
English: This study finds that current generative models for synthesizing neural network weights primarily produce memorized or interpolated versions of training checkpoints, failing to outperform simple baselines due to data limitations and inadequate use of structural priors.
Authors:Jacob Helwig, Sai Sreeharsha Adavi, Xuan Zhang, Yuchao Lin, Felix S. Chim, Luke Takeshi Vizzini, Haiyang Yu, Muhammad Hasnain, Saykat Kumar Biswas, John J. Holloway, Narendra Singh, N. K. Anand, Swagnik Guhathakurta, Shuiwang Ji
Abstract:
We consider the problem of modeling high-speed flows using machine learning methods. While most prior studies focus on low-speed fluid flows in which uniform time-stepping is practical, flows approaching and exceeding the speed of sound exhibit sudden changes such as shock waves. In such cases, it is essential to use adaptive time-stepping methods to allow a temporal resolution sufficient to resolve these phenomena while simultaneously balancing computational costs. Here, we propose a two-phase machine learning method, known as ShockCast, to model high-speed flows with adaptive time-stepping. In the first phase, we propose to employ a machine learning model to predict the timestep size. In the second phase, the predicted timestep is used as an input along with the current fluid fields to advance the system state by the predicted timestep. We explore several physically-motivated components for timestep prediction and introduce timestep conditioning strategies inspired by neural ODE and Mixture of Experts. As ShockCast is the first framework for learning high-speed flows, we evaluate our methods by generating two supersonic flow datasets, available at https://huggingface.co/datasets/divelab. Our code is publicly available as part of the AIRS library (https://github.com/divelab/AIRS).
中文: 本研究提出ShockCast,一种两阶段机器学习框架,通过自适应时间步长来高效模拟含激波的高速流动,利用时间步长预测和条件策略解决计算挑战。
English: This study introduces ShockCast, a two-phase machine learning framework that employs adaptive time-stepping to efficiently model high-speed flows with shock waves, addressing computational challenges through timestep prediction and conditioning strategies.
Authors:Christopher Subia-Waud
Abstract:
Current AutoML platforms leave substantial performance untapped. Testing 180 fine-tuning tasks across models from 70M to 70B parameters, we found that HuggingFace AutoTrain, TogetherAI, Databricks, and Google Cloud consistently produce suboptimal configurations. Gradients, built on the Bittensor network, attacks this problem through competition. Independent miners race to find optimal hyperparameters, earning rewards proportional to their models' performance. This tournament drives exploration of configuration spaces that single-strategy methods never examine. In our experiments, Gradients achieved a 100\% win rate against TogetherAI, Databricks, and Google Cloud, and beat HuggingFace AutoTrain in 82.8\% of experiments. Mean improvements reached 42.1\% against commercial platforms. Retrieval-augmented generation tasks saw 30-40\% gains; diffusion models improved 23.4\% on person-specific generation. When miners compete for rewards, they develop optimization strategies that centralized approaches overlook. These findings demonstrate that decentralized systems with economic incentives can systematically outperform traditional AutoML, suggesting market dynamics may be key to achieving superior fine-tuning results. Code is available at https://github.com/rayonlabs/G.O.D.
中文: 现有AutoML平台表现不佳,而基于Bittensor网络的去中心化系统Gradients通过竞争性矿工的经济激励,在超参数优化上表现卓越,平均提升达42.1%,显著优于传统方法。
English: Current AutoML platforms underperform, but Gradients, a decentralized system using competitive miners on the Bittensor network, consistently outperforms them by up to 42.1% through economic incentives that drive superior hyperparameter optimization.
Authors:Jiayi Sheng, Luna Lyu, Jikai Jin, Tony Xia, Alex Gu, James Zou, Pan Lu
Abstract:
Inequality proving, crucial across diverse scientific and mathematical fields, tests advanced reasoning skills such as discovering tight bounds and strategic theorem application. This makes it a distinct, demanding frontier for large language models (LLMs), offering insights beyond general mathematical problem-solving. Progress in this area is hampered by existing datasets that are often scarce, synthetic, or rigidly formal. We address this by proposing an informal yet verifiable task formulation, recasting inequality proving into two automatically checkable subtasks: bound estimation and relation prediction. Building on this, we release IneqMath, an expert-curated dataset of Olympiad-level inequalities, including a test set and training corpus enriched with step-wise solutions and theorem annotations. We also develop a novel LLM-as-judge evaluation framework, combining a final-answer judge with four step-wise judges designed to detect common reasoning flaws. A systematic evaluation of 29 leading LLMs on IneqMath reveals a surprising reality: even top models like o1 achieve less than 10% overall accuracy under step-wise scrutiny; this is a drop of up to 65.5% from their accuracy considering only final answer equivalence. This discrepancy exposes fragile deductive chains and a critical gap for current LLMs between merely finding an answer and constructing a rigorous proof. Scaling model size and increasing test-time computation yield limited gains in overall proof correctness. Instead, our findings highlight promising research directions such as theorem-guided reasoning and self-refinement. Code and data are available at https://ineqmath.github.io/.
Authors:Vahid Balazadeh, Hamidreza Kamkari, Valentin Thomas, Benson Li, Junwei Ma, Jesse C. Cresswell, Rahul G. Krishnan
Abstract:
Causal effect estimation from observational data is fundamental across various applications. However, selecting an appropriate estimator from dozens of specialized methods demands substantial manual effort and domain expertise. We present CausalPFN, a single transformer that amortizes this workflow: trained once on a large library of simulated data-generating processes that satisfy ignorability, it infers causal effects for new observational datasets out-of-the-box. CausalPFN combines ideas from Bayesian causal inference with the large-scale training protocol of prior-fitted networks (PFNs), learning to map raw observations directly to causal effects without any task-specific adjustment. Our approach achieves superior average performance on heterogeneous and average treatment effect estimation benchmarks (IHDP, Lalonde, ACIC). Moreover, it shows competitive performance for real-world policy making on uplift modeling tasks. CausalPFN provides calibrated uncertainty estimates to support reliable decision-making based on Bayesian principles. This ready-to-use model does not require any further training or tuning and takes a step toward automated causal inference (https://github.com/vdblm/CausalPFN).
中文: CausalPFN是一种基于Transformer的模型,通过模拟数据训练实现观测数据的自动因果效应估计,无需手动调整即可在基准测试中取得优异性能,并提供基于贝叶斯原则的校准不确定性评估以支持可靠决策。
English: CausalPFN is a transformer-based model trained on simulated data to provide automated causal effect estimation for observational datasets without requiring manual tuning, delivering superior performance on benchmarks and calibrated uncertainty estimates for reliable decision-making.
Authors:Kevin Rojas, Yuchen Zhu, Sichen Zhu, Felix X. -F. Ye, Molei Tao
Abstract:
Diffusion models have demonstrated remarkable performance in generating unimodal data across various tasks, including image, video, and text generation. On the contrary, the joint generation of multimodal data through diffusion models is still in the early stages of exploration. Existing approaches heavily rely on external preprocessing protocols, such as tokenizers and variational autoencoders, to harmonize varied data representations into a unified, unimodal format. This process heavily demands the high accuracy of encoders and decoders, which can be problematic for applications with limited data. To lift this restriction, we propose a novel framework for building multimodal diffusion models on arbitrary state spaces, enabling native generation of coupled data across different modalities. By introducing an innovative decoupled noise schedule for each modality, we enable both unconditional and modality-conditioned generation within a single model simultaneously. We empirically validate our approach for text-image generation and mixed-type tabular data synthesis, demonstrating that it achieves competitive performance.
Chinese: 该框架通过解耦噪声调度,使多模态扩散模型能够原生生成跨模态的耦合数据,无需依赖外部预处理,并在单一模型中同时支持无条件生成和条件生成。
English: The proposed framework enables multimodal diffusion models to generate coupled data natively across different modalities without relying on external preprocessing, using a decoupled noise schedule to support both unconditional and conditioned generation within a single model.
Authors:Sifan Wang, Zehao Dou, Tong-Rui Liu, Lu Lu
Abstract:
Recent advances in generative modeling -- particularly diffusion models and flow matching -- have achieved remarkable success in synthesizing discrete data such as images and videos. However, adapting these models to physical applications remains challenging, as the quantities of interest are continuous functions governed by complex physical laws. Here, we introduce $\textbf{FunDiff}$, a novel framework for generative modeling in function spaces. FunDiff combines a latent diffusion process with a function autoencoder architecture to handle input functions with varying discretizations, generate continuous functions evaluable at arbitrary locations, and seamlessly incorporate physical priors. These priors are enforced through architectural constraints or physics-informed loss functions, ensuring that generated samples satisfy fundamental physical laws. We theoretically establish minimax optimality guarantees for density estimation in function spaces, showing that diffusion-based estimators achieve optimal convergence rates under suitable regularity conditions. We demonstrate the practical effectiveness of FunDiff across diverse applications in fluid dynamics and solid mechanics. Empirical results show that our method generates physically consistent samples with high fidelity to the target distribution and exhibits robustness to noisy and low-resolution data. Code and datasets are publicly available at https://github.com/sifanexisted/fundiff.
中文: FunDiff是一种创新的函数空间生成框架,它结合了潜在扩散与函数自编码器,能够生成连续且符合物理规律的函数,在多种物理应用中实现了理论最优性和鲁棒性。
English: FunDiff is a novel generative framework for function spaces that integrates latent diffusion with function autoencoders to generate continuous, physically consistent functions while ensuring theoretical optimality and robustness across various physical applications.
Authors:Jinxi Li, Ziyang Song, Siyuan Zhou, Bo Yang
Abstract:
In this paper, we aim to model 3D scene geometry, appearance, and the underlying physics purely from multi-view videos. By applying various governing PDEs as PINN losses or incorporating physics simulation into neural networks, existing works often fail to learn complex physical motions at boundaries or require object priors such as masks or types. In this paper, we propose FreeGave to learn the physics of complex dynamic 3D scenes without needing any object priors. The key to our approach is to introduce a physics code followed by a carefully designed divergence-free module for estimating a per-Gaussian velocity field, without relying on the inefficient PINN losses. Extensive experiments on three public datasets and a newly collected challenging real-world dataset demonstrate the superior performance of our method for future frame extrapolation and motion segmentation. Most notably, our investigation into the learned physics codes reveals that they truly learn meaningful 3D physical motion patterns in the absence of any human labels in training.
中文: 本文提出FreeGave方法,无需任何物体先验知识,仅通过多视角视频学习复杂动态3D场景的物理特性,采用物理编码和无散度模块估计速度场,在帧预测和运动分割任务中展现出优越性能。
English: This paper introduces FreeGave, a method that learns the physics of complex dynamic 3D scenes from multi-view videos without requiring object priors, using a physics code and divergence-free module to estimate velocity fields and demonstrating superior performance in future frame extrapolation and motion segmentation.
Authors:Zihui Zhang, Weisheng Dai, Hongtao Wen, Bo Yang
Abstract:
We study the problem of unsupervised 3D semantic segmentation on raw point clouds without needing human labels in training. Existing methods usually formulate this problem into learning per-point local features followed by a simple grouping strategy, lacking the ability to discover additional and possibly richer semantic priors beyond local features. In this paper, we introduce LogoSP to learn 3D semantics from both local and global point features. The key to our approach is to discover 3D semantic information by grouping superpoints according to their global patterns in the frequency domain, thus generating highly accurate semantic pseudo-labels for training a segmentation network. Extensive experiments on two indoor and an outdoor datasets show that our LogoSP surpasses all existing unsupervised methods by large margins, achieving the state-of-the-art performance for unsupervised 3D semantic segmentation. Notably, our investigation into the learned global patterns reveals that they truly represent meaningful 3D semantics in the absence of human labels during training.
中文: 本文提出LogoSP方法,通过结合局部和全局点特征,在频域中依据全局模式对超点进行分组以生成精确的语义伪标签,在无需人工标注的情况下实现了无监督3D语义分割的最优性能。
English: This paper introduces LogoSP, an unsupervised 3D semantic segmentation method that learns from both local and global point features by grouping superpoints based on frequency-domain patterns to generate accurate pseudo-labels, achieving state-of-the-art performance on multiple datasets without human annotations.
Authors:Michael K. Chen, Xikun Zhang, Jiaxing Huang, Dacheng Tao
Abstract:
Large language models (LLMs) have become the cornerstone of modern AI. However, the existing paradigm of next-token prediction fundamentally limits their ability to form coherent, high-level concepts, making it a critical barrier to human-like understanding and reasoning. Take the phrase "ribonucleic acid" as an example: an LLM will first decompose it into tokens, i.e., artificial text fragments ("rib", "on", ...), then learn each token sequentially, rather than grasping the phrase as a unified, coherent semantic entity. This fragmented representation hinders deeper conceptual understanding and, ultimately, the development of truly intelligent systems. In response, we introduce Concept-Aware Fine-Tuning (CAFT), a novel multi-token training method that redefines how LLMs are fine-tuned. By enabling the learning of sequences that span multiple tokens, this method fosters stronger concept-aware learning. Our experiments demonstrate significant improvements compared to conventional next-token finetuning methods across diverse tasks, including traditional applications like text summarization and domain-specific ones like de novo protein design. Multi-token prediction was previously only possible in the prohibitively expensive pretraining phase; CAFT, to our knowledge, is the first to bring the multi-token setting to the post-training phase, thus effectively democratizing its benefits for the broader community of practitioners and researchers. Finally, the unexpected effectiveness of our proposed method suggests wider implications for the machine learning research community. All code and data are available at https://github.com/michaelchen-lab/caft-llm
Chinese: 大语言模型受限于逐词预测,但新提出的概念感知微调方法通过多词学习增强了概念理解能力,在多项任务中表现优异。
English: Large language models are limited by next-token prediction, but the new Concept-Aware Fine-Tuning method enables multi-token learning for improved conceptual understanding across various tasks.
Authors:Jie Bao, Chuangyin Dang, Rui Luo, Hanwei Zhang, Zhixin Zhou
Abstract:
As deep learning models are increasingly deployed in high-risk applications, robust defenses against adversarial attacks and reliable performance guarantees become paramount. Moreover, accuracy alone does not provide sufficient assurance or reliable uncertainty estimates for these models. This study advances adversarial training by leveraging principles from Conformal Prediction. Specifically, we develop an adversarial attack method, termed OPSA (OPtimal Size Attack), designed to reduce the efficiency of conformal prediction at any significance level by maximizing model uncertainty without requiring coverage guarantees. Correspondingly, we introduce OPSA-AT (Adversarial Training), a defense strategy that integrates OPSA within a novel conformal training paradigm. Experimental evaluations demonstrate that our OPSA attack method induces greater uncertainty compared to baseline approaches for various defenses. Conversely, our OPSA-AT defensive model significantly enhances robustness not only against OPSA but also other adversarial attacks, and maintains reliable prediction. Our findings highlight the effectiveness of this integrated approach for developing trustworthy and resilient deep learning models for safety-critical domains. Our code is available at https://github.com/bjbbbb/Enhancing-Adversarial-Robustness-with-Conformal-Prediction.
中文: 本研究提出了OPSA对抗性攻击方法以增加模型不确定性,并开发了OPSA-AT防御策略,通过整合对抗训练与保形预测来提升模型鲁棒性并保持可靠预测性能。
English: This study introduces OPSA, an adversarial attack that increases model uncertainty, and OPSA-AT, a defense strategy using conformal training to enhance robustness and maintain reliable predictions in deep learning models.
Authors:Qi Yang, Chenghao Zhang, Lubin Fan, Kun Ding, Jieping Ye, Shiming Xiang
Abstract:
Recent advancements in Large Vision Language Models (LVLMs) have significantly improved performance in Visual Question Answering (VQA) tasks through multimodal Retrieval-Augmented Generation (RAG). However, existing methods still face challenges, such as the scarcity of knowledge with reasoning examples and erratic responses from retrieved knowledge. To address these issues, in this study, we propose a multimodal RAG framework, termed RCTS, which enhances LVLMs by constructing a Reasoning Context-enriched knowledge base and a Tree Search re-ranking method. Specifically, we introduce a self-consistent evaluation mechanism to enrich the knowledge base with intrinsic reasoning patterns. We further propose a Monte Carlo Tree Search with Heuristic Rewards (MCTS-HR) to prioritize the most relevant examples. This ensures that LVLMs can leverage high-quality contextual reasoning for better and more consistent responses. Extensive experiments demonstrate that our framework achieves state-of-the-art performance on multiple VQA datasets, significantly outperforming In-Context Learning (ICL) and Vanilla-RAG methods. It highlights the effectiveness of our knowledge base and re-ranking method in improving LVLMs. Our code is available at https://github.com/yannqi/RCTS-RAG.
Chinese: 本研究提出名为RCTS的多模态RAG框架,通过构建富含推理背景的知识库和树搜索重排序方法增强大视觉语言模型,在多项VQA任务中实现了最先进的性能表现。
English: This study introduces RCTS, a multimodal RAG framework that enhances Large Vision Language Models by enriching the knowledge base with reasoning contexts and employing a tree search re-ranking method, achieving state-of-the-art performance on VQA tasks.
Authors:Mohamed Djilani, Nassim Ali Ousalah, Nidhal Eddine Chenni
Abstract:
We introduce a trend-aware and visually-grounded fashion recommendation system that integrates deep visual representations, garment-aware segmentation, semantic category similarity and user behavior simulation. Our pipeline extracts focused visual embeddings by masking non-garment regions via semantic segmentation followed by feature extraction using pretrained CNN backbones (ResNet-50, DenseNet-121, VGG16). To simulate realistic shopping behavior, we generate synthetic purchase histories influenced by user-specific trendiness and item popularity. Recommendations are computed using a weighted scoring function that fuses visual similarity, semantic coherence and popularity alignment. Experiments on the DeepFashion dataset demonstrate consistent gender alignment and improved category relevance, with ResNet-50 achieving 64.95% category similarity and lowest popularity MAE. An ablation study confirms the complementary roles of visual and popularity cues. Our method provides a scalable framework for personalized fashion recommendations that balances individual style with emerging trends. Our implementation is available at https://github.com/meddjilani/FashionRecommender
该系统通过融合服装视觉分析与用户行为模拟,实现了兼顾个人风格与流行趋势的个性化时尚推荐。
This system combines visual garment analysis with user behavior simulation to deliver personalized fashion recommendations that balance individual style with current trends.
Authors:Adam Breuer
Abstract:
In this paper, we provide the first practical algorithms with provable guarantees for the problem of inferring the topics assigned to each document in an LDA topic model. This is the primary inference problem for many applications of topic models in social science, data exploration, and causal inference settings. We obtain this result by showing a novel non-gradient-based, combinatorial approach to estimating topic models. This yields algorithms that converge to near-optimal posterior probability in logarithmic parallel computation time (adaptivity) -- exponentially faster than any known LDA algorithm. We also show that our approach can provide interpretability guarantees such that each learned topic is formally associated with a known keyword. Finally, we show that unlike alternatives, our approach can maintain the independence assumptions necessary to use the learned topic model for downstream causal inference methods that allow researchers to study topics as treatments. In terms of practical performance, our approach consistently returns solutions of higher semantic quality than solutions from state-of-the-art LDA algorithms, neural topic models, and LLM-based topic models across a diverse range of text datasets and evaluation parameters.
Chinese: 本文首次提出具有可证明保证的LDA主题推断实用算法,通过创新组合方法实现指数级加速,同时保持可解释性以支持下游因果分析。
English: This paper introduces the first practical algorithms with provable guarantees for LDA topic inference, using a novel combinatorial approach that achieves exponential speedup and maintains interpretability for downstream causal analysis.
Authors:Seungho Baek, Taegeon Park, Jongchan Park, Seungjun Oh, Yusung Kim
Abstract:
Existing offline hierarchical reinforcement learning methods rely on high-level policy learning to generate subgoal sequences. However, their efficiency degrades as task horizons increase, and they lack effective strategies for stitching useful state transitions across different trajectories. We propose Graph-Assisted Stitching (GAS), a novel framework that formulates subgoal selection as a graph search problem rather than learning an explicit high-level policy. By embedding states into a Temporal Distance Representation (TDR) space, GAS clusters semantically similar states from different trajectories into unified graph nodes, enabling efficient transition stitching. A shortest-path algorithm is then applied to select subgoal sequences within the graph, while a low-level policy learns to reach the subgoals. To improve graph quality, we introduce the Temporal Efficiency (TE) metric, which filters out noisy or inefficient transition states, significantly enhancing task performance. GAS outperforms prior offline HRL methods across locomotion, navigation, and manipulation tasks. Notably, in the most stitching-critical task, it achieves a score of 88.3, dramatically surpassing the previous state-of-the-art score of 1.0. Our source code is available at: https://github.com/qortmdgh4141/GAS.
中文: 提出的图辅助拼接(GAS)框架通过将子目标选择构建为图搜索问题,利用时序距离表示和时序效率指标优化状态转换拼接,在多项任务中显著超越了现有离线分层强化学习方法。
English: The proposed Graph-Assisted Stitching (GAS) framework replaces explicit high-level policy learning with graph-based subgoal selection, utilizing Temporal Distance Representation and Temporal Efficiency metrics to enhance transition stitching and achieve superior performance across various tasks compared to prior offline hierarchical reinforcement learning methods.
Authors:Jiaming Li, Haoran Ye, Yukun Chen, Xinyue Li, Lei Zhang, Hamid Alinejad-Rokny, Jimmy Chih-Hsien Peng, Min Yang
Abstract:
As large language models (LLMs) grow in scale and capability, understanding their internal mechanisms becomes increasingly critical. Sparse autoencoders (SAEs) have emerged as a key tool in mechanistic interpretability, enabling the extraction of human-interpretable features from LLMs. However, existing SAE training methods are primarily designed for base models, resulting in reduced reconstruction quality and interpretability when applied to instruct models. To bridge this gap, we propose $\underline{\textbf{F}}$inetuning-$\underline{\textbf{a}}$ligned $\underline{\textbf{S}}$equential $\underline{\textbf{T}}$raining ($\textit{FAST}$), a novel training method specifically tailored for instruct models. $\textit{FAST}$ aligns the training process with the data distribution and activation patterns characteristic of instruct models, resulting in substantial improvements in both reconstruction and feature interpretability. On Qwen2.5-7B-Instruct, $\textit{FAST}$ achieves a mean squared error of 0.6468 in token reconstruction, significantly outperforming baseline methods with errors of 5.1985 and 1.5096. In feature interpretability, $\textit{FAST}$ yields a higher proportion of high-quality features, for Llama3.2-3B-Instruct, $21.1\%$ scored in the top range, compared to $7.0\%$ and $10.2\%$ for $\textit{BT(P)}$ and $\textit{BT(F)}$. Surprisingly, we discover that intervening on the activations of special tokens via the SAEs leads to improvements in output quality, suggesting new opportunities for fine-grained control of model behavior. Code, data, and 240 trained SAEs are available at https://github.com/Geaming2002/FAST.
中文摘要:FAST方法通过将训练过程与指令模型的数据分布和激活模式对齐,显著提升了稀疏自编码器在指令模型上的重建质量和特征可解释性,同时揭示了通过特殊令牌干预来精细调控模型行为的新途径。
English Summary: The proposed FAST method significantly enhances sparse autoencoder performance for instruct models by aligning training with their data distribution and activation patterns, achieving superior reconstruction accuracy and feature interpretability while revealing new opportunities for model behavior control.
Authors:Yuchong Long, Wen Sun, Ningxiao Sun, Wenxiao Wang, Chao Li, Shan Yin
Abstract:
Automated pollen recognition is vital to paleoclimatology, biodiversity monitoring, and public health, yet conventional methods are hampered by inefficiency and subjectivity. Existing deep learning models often struggle to achieve the requisite localization accuracy for microscopic targets like pollen, which are characterized by their minute size, indistinct edges, and complex backgrounds. To overcome this limitation, we introduce HieraEdgeNet, a multi-scale edge-enhancement framework. The framework's core innovation is the introduction of three synergistic modules: the Hierarchical Edge Module (HEM), which explicitly extracts a multi-scale pyramid of edge features that corresponds to the semantic hierarchy at early network stages; the Synergistic Edge Fusion (SEF) module, for deeply fusing these edge priors with semantic information at each respective scale; and the Cross Stage Partial Omni-Kernel Module (CSPOKM), which maximally refines the most detail-rich feature layers using an Omni-Kernel operator - comprising anisotropic large-kernel convolutions and mixed-domain attention - all within a computationally efficient Cross-Stage Partial (CSP) framework. On a large-scale dataset comprising 120 pollen classes, HieraEdgeNet achieves a mean Average Precision (mAP@.5) of 0.9501, significantly outperforming state-of-the-art baseline models such as YOLOv12n and RT-DETR. Furthermore, qualitative analysis confirms that our approach generates feature representations that are more precisely focused on object boundaries. By systematically integrating edge information, HieraEdgeNet provides a robust and powerful solution for high-precision, high-efficiency automated detection of microscopic objects.
中文: HieraEdgeNet提出了一种多尺度边缘增强框架,通过三个协同模块解决深度学习在定位微观花粉时的局限性,在大规模数据集上实现了卓越的准确性和效率。
English: HieraEdgeNet introduces a multi-scale edge-enhancement framework with three synergistic modules to overcome deep learning challenges in localizing microscopic pollen, achieving superior accuracy and efficiency on a large-scale dataset.
Authors:Mengsong Wu, YaFei Wang, Yidong Ming, Yuqi An, Yuwei Wan, Wenliang Chen, Binbin Lin, Yuqiang Li, Tong Xie, Dongzhan Zhou
Abstract:
Large language models (LLMs) have recently demonstrated promising capabilities in chemistry tasks while still facing challenges due to outdated pretraining knowledge and the difficulty of incorporating specialized chemical expertise. To address these issues, we propose an LLM-based agent that synergistically integrates 137 external chemical tools created ranging from basic information retrieval to complex reaction predictions, and a dataset curation pipeline to generate the dataset ChemToolBench that facilitates both effective tool selection and precise parameter filling during fine-tuning and evaluation. We introduce a Hierarchical Evolutionary Monte Carlo Tree Search (HE-MCTS) framework, enabling independent optimization of tool planning and execution. By leveraging self-generated data, our approach supports step-level fine-tuning (FT) of the policy model and training task-adaptive PRM and ORM that surpass GPT-4o. Experimental evaluations demonstrate that our approach significantly improves performance in Chemistry QA and discovery tasks, offering a robust solution to integrate specialized tools with LLMs for advanced chemical applications. All datasets and code are available at https://github.com/AI4Chem/ChemistryAgent .
中文摘要:本研究提出了一种基于大语言模型的化学智能体,通过整合137种专业工具和新型分层进化蒙特卡洛树搜索框架,实现了化学问答与发现任务的性能突破,有效解决了专业工具与大模型融合的挑战。
English Summary: This study introduces a chemistry-focused LLM agent that integrates 137 specialized tools and a novel HE-MCTS framework to significantly enhance chemical QA and discovery tasks through optimized tool utilization and self-improving data generation.
Authors:Zhangchi Zhao, Jun Shu, Deyu Meng, Zongben Xu
Abstract:
Inspired by the Kolmogorov-Arnold representation theorem, KANs offer a novel framework for function approximation by replacing traditional neural network weights with learnable univariate functions. This design demonstrates significant potential as an efficient and interpretable alternative to traditional MLPs. However, KANs are characterized by a substantially larger number of trainable parameters, leading to challenges in memory efficiency and higher training costs compared to MLPs. To address this limitation, we propose to generate weights for KANs via a smaller meta-learner, called MetaKANs. By training KANs and MetaKANs in an end-to-end differentiable manner, MetaKANs achieve comparable or even superior performance while significantly reducing the number of trainable parameters and maintaining promising interpretability. Extensive experiments on diverse benchmark tasks, including symbolic regression, partial differential equation solving, and image classification, demonstrate the effectiveness of MetaKANs in improving parameter efficiency and memory usage. The proposed method provides an alternative technique for training KANs, that allows for greater scalability and extensibility, and narrows the training cost gap with MLPs stated in the original paper of KANs. Our code is available at https://github.com/Murphyzc/MetaKAN.
中文: MetaKANs通过引入元学习器为KANs生成权重,在多种任务中显著减少了可训练参数和训练成本,同时保持或提升了性能与可解释性。
English: MetaKANs introduce a meta-learner to generate weights for Kolmogorov-Arnold Networks (KANs), significantly reducing trainable parameters and training costs while maintaining or enhancing performance and interpretability across various tasks.
Authors:Shuqiang Zhang, Yuchao Zhang, Jinkun Chen, Haochen Sui
Abstract:
Recommendation systems (RS) aim to provide personalized content, but they face a challenge in unbiased learning due to selection bias, where users only interact with items they prefer. This bias leads to a distorted representation of user preferences, which hinders the accuracy and fairness of recommendations. To address the issue, various methods such as error imputation based, inverse propensity scoring, and doubly robust techniques have been developed. Despite the progress, from the structural causal model perspective, previous debiasing methods in RS assume the independence of the exogenous variables. In this paper, we release this assumption and propose a learning algorithm based on likelihood maximization to learn a prediction model. We first discuss the correlation and difference between unmeasured confounding and our scenario, then we propose a unified method that effectively handles latent exogenous variables. Specifically, our method models the data generation process with latent exogenous variables under mild normality assumptions. We then develop a Monte Carlo algorithm to numerically estimate the likelihood function. Extensive experiments on synthetic datasets and three real-world datasets demonstrate the effectiveness of our proposed method. The code is at https://github.com/WallaceSUI/kdd25-background-variable.
中文: 本文针对推荐系统中的选择偏差问题,提出了一种基于似然最大化的新型学习算法,通过蒙特卡洛方法建模潜在外生变量,并在合成与真实数据集上验证了其有效性。
English: This paper addresses selection bias in recommendation systems by proposing a novel learning algorithm that models latent exogenous variables using likelihood maximization and a Monte Carlo estimation method, demonstrating effectiveness through synthetic and real-world datasets.
Authors:Libo Wang
Abstract:
In view of the problem that each subchain in the chain-of-model (CoM) relies only on the information of the previous subchain and may lose long-range dependencies due to the causal mask blocking the global context flow between multi-level subchains, this work proposes a graph of causal evolution (GoCE). Its core principle is to map the implicit token representation into a differentiable and sparse causal adjacency matrix, then permeate causal constraints through each layer of calculation using causal-masked attention and causal-MoE. By combining intervention consistency loss test and self-evolution gate, the dynamic balance between causal structure learning and adaptive updating of transformer architecture is realized. The researcher built experimental environments in sandboxes built with Claude Sonnet 4, o4-mini-high, and DeepSeek R1 respectively with the transformer variant architecture introduced in GoCE. It is evaluated on publicly available datasets including CLUTRR, CLADDER, EX-FEVER, and CausalQA and compared with the baseline LLMs. The finding proves that GoCE strengthens the transformer's ability to capture long-range causal dependencies, while the ability to self-evolve is improved. It not only surpasses the design of CoM in terms of design principles, but also provides experience for future research on causal learning and continuous adaptive improvement.
中文: 本研究提出因果演化图(GoCE)以解决链式模型中因因果掩码导致长程依赖丢失的问题,通过将隐式表征映射为稀疏因果邻接矩阵并结合因果注意力机制,增强了Transformer捕捉长程因果依赖与自我演化的能力,其性能优于基线模型。
English: This work introduces the Graph of Causal Evolution (GoCE) to address the loss of long-range dependencies in chain-of-model architectures by mapping token representations into a sparse causal adjacency matrix and integrating causal constraints through attention mechanisms, ultimately enhancing transformers' ability to capture causal relationships and self-evolve beyond baseline models.
Authors:Dasol Hong, Wooju Lee, Hyun Myung
Abstract:
Prompt tuning, which adapts vision-language models by freezing model parameters and optimizing only the prompt, has proven effective for task-specific adaptations. The core challenge in prompt tuning is improving specialization for a specific task and generalization for unseen domains. However, frozen encoders often produce misaligned features, leading to confusion between classes and limiting specialization. To overcome this issue, we propose a confusion-aware loss (CoA-loss) that improves specialization by refining the decision boundaries between confusing classes. Additionally, we mathematically demonstrate that a mixture model can enhance generalization without compromising specialization. This is achieved using confidence-aware weights (CoA-weights), which adjust the weights of each prediction in the mixture model based on its confidence within the class domains. Extensive experiments show that CoCoA-Mix, a mixture model with CoA-loss and CoA-weights, outperforms state-of-the-art methods by enhancing specialization and generalization. Our code is publicly available at https://github.com/url-kaist/CoCoA-Mix.
Chinese: 提示调优通过仅优化提示来有效适配视觉语言模型,但面临特征错位和泛化难题;提出的CoCoA-Mix方法结合混淆感知损失和权重,在提升专业化和泛化能力方面均优于现有最优方法。
English: Prompt tuning effectively adapts vision-language models by optimizing only the prompt, but faces challenges with misaligned features and generalization; the proposed CoCoA-Mix method, incorporating confusion-aware loss and weights, enhances both specialization and generalization, outperforming state-of-the-art approaches.
Authors:Thomas Zhu, Joshua Clune, Jeremy Avigad, Albert Qiaochu Jiang, Sean Welleck
Abstract:
Neural methods are transforming automated reasoning for proof assistants, yet integrating these advances into practical verification workflows remains challenging. Hammers are tools that interface with external automatic theorem provers to automate tedious reasoning steps. They have dramatically improved productivity in proof assistants, but the Lean proof assistant still does not have a hammer despite its growing popularity. We present LeanHammer, the first end-to-end domain-general hammer for Lean, built on a novel neural premise selection system for a hammer in dependent type theory. Unlike existing Lean premise selectors, our approach dynamically adapts to user-specific contexts and combines with symbolic proof search and reconstruction to create a practical hammer. With comprehensive evaluations, we show that our premise selector enables LeanHammer to solve 21\% more goals relative to existing premise selectors, and generalize well to diverse domains. Our work bridges the gap between neural retrieval and symbolic reasoning, making formal verification more accessible to researchers and practitioners.
中文摘要:LeanHammer是首个面向Lean证明助手的端到端通用锤子系统,其采用新型神经前提选择技术,能动态适应用户上下文并与符号证明搜索相结合,显著提升了定理自动证明能力。
English Summary: LeanHammer is the first comprehensive hammer for the Lean proof assistant, featuring a novel neural premise selection system that adapts to user contexts and integrates with symbolic proof search to significantly enhance automated theorem proving capabilities.
Authors:Jie Peng, Hongwei Yang, Jing Zhao, Hengji Dong, Hui He, Weizhe Zhang, Haoyu He
Abstract:
Deep neural networks are vulnerable to backdoor attacks, where malicious behaviors are implanted during training. While existing defenses can effectively purify compromised models, they typically require labeled data or specific training procedures, making them difficult to apply beyond supervised learning settings. Notably, recent studies have shown successful backdoor attacks across various learning paradigms, highlighting a critical security concern. To address this gap, we propose Two-stage Symmetry Connectivity (TSC), a novel backdoor purification defense that operates independently of data format and requires only a small fraction of clean samples. Through theoretical analysis, we prove that by leveraging permutation invariance in neural networks and quadratic mode connectivity, TSC amplifies the loss on poisoned samples while maintaining bounded clean accuracy. Experiments demonstrate that TSC achieves robust performance comparable to state-of-the-art methods in supervised learning scenarios. Furthermore, TSC generalizes to self-supervised learning frameworks, such as SimCLR and CLIP, maintaining its strong defense capabilities. Our code is available at https://github.com/JiePeng104/TSC.
中文: 提出的两阶段对称连通性(TSC)防御方法利用少量干净样本和理论原理,有效清除受后门攻击的模型,在监督学习和自监督学习框架中均展现出强大的防御性能。
English: The proposed Two-stage Symmetry Connectivity (TSC) defense effectively purifies backdoor-infected models using minimal clean samples and theoretical principles, demonstrating robust performance in both supervised and self-supervised learning frameworks.
Authors:Alexander Kolpakov, Igor Rivin
Abstract:
Computing classical centrality measures such as betweenness and closeness is computationally expensive on large-scale graphs. In this work, we introduce an efficient force layout algorithm that embeds a graph into a low-dimensional space, where the radial distance from the origin serves as a proxy for various centrality measures. We evaluate our method on multiple graph families and demonstrate strong correlations with degree, PageRank, and paths-based centralities. As an application, it turns out that the proposed embedding allows to find high-influence nodes in a network, and provides a fast and scalable alternative to the standard greedy algorithm.
中文: 本研究提出一种高效的力导向布局算法,将图嵌入低维空间,利用径向距离作为中心性度量的替代指标,为识别网络中有影响力的节点提供了一种快速且可扩展的解决方案。
English: The study presents an efficient force layout algorithm that embeds graphs into low-dimensional space, using radial distance as a proxy for centrality measures and offering a scalable alternative for identifying influential nodes.
Authors:Philip R. Liu, Sparsh Bansal, Jimmy Dinh, Aditya Pawar, Ramani Satishkumar, Shail Desai, Neeraj Gupta, Xin Wang, Shu Hu
Abstract:
The integration of deep learning-based glaucoma detection with large language models (LLMs) presents an automated strategy to mitigate ophthalmologist shortages and improve clinical reporting efficiency. However, applying general LLMs to medical imaging remains challenging due to hallucinations, limited interpretability, and insufficient domain-specific medical knowledge, which can potentially reduce clinical accuracy. Although recent approaches combining imaging models with LLM reasoning have improved reporting, they typically rely on a single generalist agent, restricting their capacity to emulate the diverse and complex reasoning found in multidisciplinary medical teams. To address these limitations, we propose MedChat, a multi-agent diagnostic framework and platform that combines specialized vision models with multiple role-specific LLM agents, all coordinated by a director agent. This design enhances reliability, reduces hallucination risk, and enables interactive diagnostic reporting through an interface tailored for clinical review and educational use. Code available at https://github.com/Purdue-M2/MedChat.
中文总结:MedChat是一个多智能体诊断框架,通过结合专业视觉模型与角色定制的大语言模型,提高了青光眼检测的可靠性,减少幻觉现象,并提供交互式临床报告功能。
English Summary: MedChat is a multi-agent diagnostic framework that integrates specialized vision models with role-specific LLMs to enhance glaucoma detection reliability, reduce hallucinations, and provide interactive clinical reporting.
Authors:Guibin Zhang, Muxin Fu, Guancheng Wan, Miao Yu, Kun Wang, Shuicheng Yan
Abstract:
Large language model (LLM)-powered multi-agent systems (MAS) have demonstrated cognitive and execution capabilities that far exceed those of single LLM agents, yet their capacity for self-evolution remains hampered by underdeveloped memory architectures. Upon close inspection, we are alarmed to discover that prevailing MAS memory mechanisms (1) are overly simplistic, completely disregarding the nuanced inter-agent collaboration trajectories, and (2) lack cross-trial and agent-specific customization, in stark contrast to the expressive memory developed for single agents. To bridge this gap, we introduce G-Memory, a hierarchical, agentic memory system for MAS inspired by organizational memory theory, which manages the lengthy MAS interaction via a three-tier graph hierarchy: insight, query, and interaction graphs. Upon receiving a new user query, G-Memory performs bi-directional memory traversal to retrieve both $\textit{high-level, generalizable insights}$ that enable the system to leverage cross-trial knowledge, and $\textit{fine-grained, condensed interaction trajectories}$ that compactly encode prior collaboration experiences. Upon task execution, the entire hierarchy evolves by assimilating new collaborative trajectories, nurturing the progressive evolution of agent teams. Extensive experiments across five benchmarks, three LLM backbones, and three popular MAS frameworks demonstrate that G-Memory improves success rates in embodied action and accuracy in knowledge QA by up to $20.89\%$ and $10.12\%$, respectively, without any modifications to the original frameworks. Our codes are available at https://github.com/bingreeky/GMemory.
中文: G-Memory作为一种层次化、智能的记忆系统,通过三层图结构管理多智能体系统的交互,无需修改原有框架即可显著提升任务成功率和准确性。
English: G-Memory is a hierarchical, agentic memory system designed to enhance multi-agent systems by managing interactions through a three-tier graph structure, which significantly improves task success rates and accuracy without altering existing frameworks.
Authors:Chengchao Shen, Dawei Liu, Jianxin Wang
Abstract:
Contrastive learning for single object centric images has achieved remarkable progress on unsupervised representation, but suffering inferior performance on the widespread images with multiple objects. In this paper, we propose a simple but effective method, Multiple Object Stitching (MOS), to refine the unsupervised representation for multi-object images. Specifically, we construct the multi-object images by stitching the single object centric ones, where the objects in the synthesized multi-object images are predetermined. Hence, compared to the existing contrastive methods, our method provides additional object correspondences between multi-object images without human annotations. In this manner, our method pays more attention to the representations of each object in multi-object image, thus providing more detailed representations for complicated downstream tasks, such as object detection and semantic segmentation. Experimental results on ImageNet, CIFAR and COCO datasets demonstrate that our proposed method achieves the leading unsupervised representation performance on both single object centric images and multi-object ones. The source code is available at https://github.com/visresearch/MultipleObjectStitching.
中文: 本文提出的多对象拼接(MOS)方法通过将单对象图像合成为多对象图像,无需人工标注即可提供额外对象对应关系,从而在多对象图像的无监督表征学习上取得领先性能。
English: The proposed Multiple Object Stitching (MOS) method enhances unsupervised representation learning for multi-object images by synthesizing them from single-object images, providing additional object correspondences without human annotation and achieving superior performance on various datasets.
Authors:Weijie Guan, Haohui Wang, Jian Kang, Lihui Liu, Dawei Zhou
Abstract:
Graph learning has been crucial to many real-world tasks, but they are often studied with a closed-world assumption, with all possible labels of data known a priori. To enable effective graph learning in an open and noisy environment, it is critical to inform the model users when the model makes a wrong prediction to in-distribution data of a known class, i.e., misclassification detection or when the model encounters out-of-distribution from novel classes, i.e., out-of-distribution detection. This paper introduces Evidential Reasoning Network (EVINET), a framework that addresses these two challenges by integrating Beta embedding within a subjective logic framework. EVINET includes two key modules: Dissonance Reasoning for misclassification detection and Vacuity Reasoning for out-of-distribution detection. Extensive experiments demonstrate that EVINET outperforms state-of-the-art methods across multiple metrics in the tasks of in-distribution classification, misclassification detection, and out-of-distribution detection. EVINET demonstrates the necessity of uncertainty estimation and logical reasoning for misclassification detection and out-of-distribution detection and paves the way for open-world graph learning. Our code and data are available at https://github.com/SSSKJ/EviNET.
中文: EVINET框架通过将Beta嵌入与主观逻辑相结合,有效解决了图学习中的误分类和分布外检测问题,在多项指标上表现优异,推动了开放世界图学习的发展。
English: EVINET is a novel framework that tackles misclassification and out-of-distribution detection in graph learning by integrating Beta embeddings with subjective logic, demonstrating superior performance across multiple metrics and advancing open-world graph learning.
Authors:Tianci Bu, Chuanrui Wang, Hao Ma, Haoren Zheng, Xin Lu, Tailin Wu
Abstract:
Generating graphs with hierarchical structures remains a fundamental challenge due to the limitations of Euclidean geometry in capturing exponential complexity. Here we introduce \textbf{GGBall}, a novel hyperbolic framework for graph generation that integrates geometric inductive biases with modern generative paradigms. GGBall combines a Hyperbolic Vector-Quantized Autoencoder (HVQVAE) with a Riemannian flow matching prior defined via closed-form geodesics. This design enables flow-based priors to model complex latent distributions, while vector quantization helps preserve the curvature-aware structure of the hyperbolic space. We further develop a suite of hyperbolic GNN and Transformer layers that operate entirely within the manifold, ensuring stability and scalability. Empirically, our model reduces degree MMD by over 75\% on Community-Small and over 40\% on Ego-Small compared to state-of-the-art baselines, demonstrating an improved ability to preserve topological hierarchies. These results highlight the potential of hyperbolic geometry as a powerful foundation for the generative modeling of complex, structured, and hierarchical data domains. Our code is available at \href{https://github.com/AI4Science-WestlakeU/GGBall}{here}.
中文摘要:GGBall提出了一种基于双曲几何的图生成框架,通过结合双曲向量量化自编码器和黎曼流匹配技术,能更好地捕捉层次结构,在保持拓扑特性方面显著优于现有方法。
English Summary: GGBall introduces a hyperbolic geometry-based framework for graph generation, combining a Hyperbolic Vector-Quantized Autoencoder with Riemannian flow matching to better capture hierarchical structures and significantly outperform existing methods in preserving topological properties.
Authors:Qi Liu, Jingqing Ruan, Hao Li, Haodong Zhao, Desheng Wang, Jiansong Chen, Wan Guanglu, Xunliang Cai, Zhi Zheng, Tong Xu
Abstract:
Existing multi-objective preference alignment methods for large language models (LLMs) face limitations: (1) the inability to effectively balance various preference dimensions, and (2) reliance on auxiliary reward/reference models introduces computational complexity. To address these challenges, we propose Adaptive Multi-objective Preference Optimization (AMoPO), a novel framework that achieves dynamic balance across preference dimensions. By introducing the multi-objective optimization paradigm to use the dimension-aware generation metrics as implicit rewards, AMoPO aligns LLMs with diverse preferences without additional reward models or reference models. We introduce an adaptive weight assignment mechanism that models the generation space as a Gaussian distribution, allowing dynamic prioritization of preference dimensions. Empirical results demonstrate that AMoPO outperforms state-of-the-art baselines by 28.5%, and the experiments on 7B, 14B, and 32B models reveal the scaling ability of AMoPO. Moreover, additional analysis of multiple dimensions verifies its adaptability and effectiveness. These findings validate AMoPO's capability to achieve dimension-aware preference alignment, highlighting its superiority. Our codes and datasets are available at https://github.com/Javkonline/AMoPO.
中文: AMoPO提出了一种无需辅助奖励模型即可动态平衡大语言模型中多维度偏好的新框架,相比现有方法性能提升28.5%,并在不同规模模型上展现出优秀的扩展能力。
English: AMoPO introduces a novel framework that dynamically balances multiple preference dimensions in large language models without requiring auxiliary reward models, achieving a 28.5% performance improvement over existing methods while demonstrating strong scalability across different model sizes.
Authors:Rong-Xi Tan, Ming Chen, Ke Xue, Yao Wang, Yaoyuan Wang, Sheng Fu, Chao Qian
Abstract:
The pursuit of universal black-box optimization (BBO) algorithms is a longstanding goal. However, unlike domains such as language or vision, where scaling structured data has driven generalization, progress in offline BBO remains hindered by the lack of unified representations for heterogeneous numerical spaces. Thus, existing offline BBO approaches are constrained to single-task and fixed-dimensional settings, failing to achieve cross-domain universal optimization. Recent advances in language models (LMs) offer a promising path forward: their embeddings capture latent relationships in a unifying way, enabling universal optimization across different data types possible. In this paper, we discuss multiple potential approaches, including an end-to-end learning framework in the form of next-token prediction, as well as prioritizing the learning of latent spaces with strong representational capabilities. To validate the effectiveness of these methods, we collect offline BBO tasks and data from open-source academic works for training. Experiments demonstrate the universality and effectiveness of our proposed methods. Our findings suggest that unifying language model priors and learning string embedding space can overcome traditional barriers in universal BBO, paving the way for general-purpose BBO algorithms. The code is provided at https://github.com/lamda-bbo/universal-offline-bbo.
Chinese: 语言模型的最新进展通过利用其统一嵌入和学习框架,为克服异构数值空间中的传统障碍,实现通用黑盒优化提供了有希望的途径。
English: Recent advances in language models provide a promising path toward universal black-box optimization by leveraging their unified embeddings and learning frameworks to overcome traditional barriers in heterogeneous numerical spaces.
Authors:Wenying He, Jieling Huang, Junhua Gu, Ji Zhang, Yude Bai
Abstract:
Missing data in spatiotemporal systems presents a significant challenge for modern applications, ranging from environmental monitoring to urban traffic management. The integrity of spatiotemporal data often deteriorates due to hardware malfunctions and software failures in real-world deployments. Current approaches based on machine learning and deep learning struggle to model the intricate interdependencies between spatial and temporal dimensions effectively and, more importantly, suffer from cumulative errors during the data imputation process, which propagate and amplify through iterations. To address these limitations, we propose CoFILL, a novel Conditional Diffusion Model for spatiotemporal data imputation. CoFILL builds on the inherent advantages of diffusion models to generate high-quality imputations without relying on potentially error-prone prior estimates. It incorporates an innovative dual-stream architecture that processes temporal and frequency domain features in parallel. By fusing these complementary features, CoFILL captures both rapid fluctuations and underlying patterns in the data, which enables more robust imputation. The extensive experiments reveal that CoFILL's noise prediction network successfully transforms random noise into meaningful values that align with the true data distribution. The results also show that CoFILL outperforms state-of-the-art methods in imputation accuracy. The source code is publicly available at https://github.com/joyHJL/CoFILL.
Chinese Summary: 针对时空数据修复中复杂依赖关系和误差传播的难题,CoFILL创新性地采用条件扩散模型与双流架构,通过并行处理时域和频域特征实现高质量数据补全,在修复精度上显著超越现有最优方法。
English Summary: Spatiotemporal data imputation is challenging due to complex interdependencies and error propagation in existing methods, but CoFILL, a novel conditional diffusion model with dual-stream architecture, overcomes these limitations by generating high-quality imputations that outperform state-of-the-art approaches.
Authors:Leheng Sheng, Changshuo Shen, Weixiang Zhao, Junfeng Fang, Xiaohao Liu, Zhenkai Liang, Xiang Wang, An Zhang, Tat-Seng Chua
Abstract:
As LLMs are increasingly deployed in real-world applications, ensuring their ability to refuse malicious prompts, especially jailbreak attacks, is essential for safe and reliable use. Recently, activation steering has emerged as an effective approach for enhancing LLM safety by adding a refusal direction vector to internal activations of LLMs during inference, which will further induce the refusal behaviors of LLMs. However, indiscriminately applying activation steering fundamentally suffers from the trade-off between safety and utility, since the same steering vector can also lead to over-refusal and degraded performance on benign prompts. Although prior efforts, such as vector calibration and conditional steering, have attempted to mitigate this trade-off, their lack of theoretical grounding limits their robustness and effectiveness. To better address the trade-off between safety and utility, we present a theoretically grounded and empirically effective activation steering method called AlphaSteer. Specifically, it considers activation steering as a learnable process with two principled learning objectives: utility preservation and safety enhancement. For utility preservation, it learns to construct a nearly zero vector for steering benign data, with the null-space constraints. For safety enhancement, it learns to construct a refusal direction vector for steering malicious data, with the help of linear regression. Experiments across multiple jailbreak attacks and utility benchmarks demonstrate the effectiveness of AlphaSteer, which significantly improves the safety of LLMs without compromising general capabilities. Our codes are available at https://github.com/AlphaLab-USTC/AlphaSteer.
中文: AlphaSteer是一种基于理论的激活引导方法,通过为恶意提示学习拒绝向量来增强大语言模型安全性,同时利用近零向量保持良性输入的实用性,有效平衡了安全性与性能。
English: AlphaSteer is a theoretically grounded activation steering method that enhances LLM safety by learning refusal vectors for malicious prompts while preserving utility through nearly zero vectors for benign inputs, effectively balancing safety and performance.
Authors:Arun Sharma, Mingzhou Yang, Majid Farhadloo, Subhankar Ghosh, Bharat Jayaprakash, Shashi Shekhar
Abstract:
Given trajectory data, a domain-specific study area, and a user-defined threshold, we aim to find anomalous trajectories indicative of possible GPS spoofing (e.g., fake trajectory). The problem is societally important to curb illegal activities in international waters, such as unauthorized fishing and illicit oil transfers. The problem is challenging due to advances in AI generated in deep fakes generation (e.g., additive noise, fake trajectories) and lack of adequate amount of labeled samples for ground-truth verification. Recent literature shows promising results for anomalous trajectory detection using generative models despite data sparsity. However, they do not consider fine-scale spatiotemporal dependencies and prior physical knowledge, resulting in higher false-positive rates. To address these limitations, we propose a physics-informed diffusion model that integrates kinematic constraints to identify trajectories that do not adhere to physical laws. Experimental results on real-world datasets in the maritime and urban domains show that the proposed framework results in higher prediction accuracy and lower estimation error rate for anomaly detection and trajectory generation methods, respectively. Our implementation is available at https://github.com/arunshar/Physics-Informed-Diffusion-Probabilistic-Model.
中文摘要:本研究提出了一种融合运动学约束的物理信息扩散模型,用于检测GPS欺骗等异常轨迹,在海上和城市数据集中实现了更高的检测精度和更低的误差率。
English Summary: This study introduces a physics-informed diffusion model that incorporates kinematic constraints to detect anomalous trajectories, such as GPS spoofing, achieving higher accuracy and lower error rates in maritime and urban datasets.
Authors:Mingyi Li, Michael R. Metel, Akiko Takeda
Abstract:
The K-means algorithm is one of the most widely studied clustering algorithms in machine learning. While extensive research has focused on its ability to achieve a globally optimal solution, there still lacks a rigorous analysis of its local optimality guarantees. In this paper, we first present conditions under which the K-means algorithm converges to a locally optimal solution. Based on this, we propose simple modifications to the K-means algorithm which ensure local optimality in both the continuous and discrete sense, with the same computational complexity as the original K-means algorithm. As the dissimilarity measure, we consider a general Bregman divergence, which is an extension of the squared Euclidean distance often used in the K-means algorithm. Numerical experiments confirm that the K-means algorithm does not always find a locally optimal solution in practice, while our proposed methods provide improved locally optimal solutions with reduced clustering loss. Our code is available at https://github.com/lmingyi/LO-K-means.
Chinese: 本文分析了K-means算法的局部最优性,并提出了在保持相同计算复杂度的前提下,能够确保算法在广义Bregman散度下收敛到局部最优解的改进方法。
English: This paper analyzes the local optimality of the K-means algorithm and proposes modifications that ensure convergence to locally optimal solutions under general Bregman divergence, maintaining the same computational complexity as the original algorithm.
Authors:Anastasia Koloskova, Youssef Allouah, Animesh Jha, Rachid Guerraoui, Sanmi Koyejo
Abstract:
We address the problem of machine unlearning, where the goal is to remove the influence of specific training data from a model upon request, motivated by privacy concerns and regulatory requirements such as the "right to be forgotten." Unfortunately, existing methods rely on restrictive assumptions or lack formal guarantees. To this end, we propose a novel method for certified machine unlearning, leveraging the connection between unlearning and privacy amplification by stochastic post-processing. Our method uses noisy fine-tuning on the retain data, i.e., data that does not need to be removed, to ensure provable unlearning guarantees. This approach requires no assumptions about the underlying loss function, making it broadly applicable across diverse settings. We analyze the theoretical trade-offs in efficiency and accuracy and demonstrate empirically that our method not only achieves formal unlearning guarantees but also performs effectively in practice, outperforming existing baselines. Our code is available at https://github.com/stair-lab/certified-unlearning-neural-networks-icml-2025
中文: 本文提出了一种经过认证的机器遗忘方法,通过对保留数据进行噪声微调来提供无需严格假设的形式化遗忘保证,在理论和实践中均表现出优越性能。
English: This paper introduces a certified machine unlearning method that uses noisy fine-tuning on retained data to provide formal unlearning guarantees without restrictive assumptions, demonstrating both theoretical and practical effectiveness.
Authors:Mellon M. Zhang, Glen Chou, Saibal Mukhopadhyay
Abstract:
Accurate and efficient object detection is essential for autonomous vehicles, where real-time perception requires low latency and high throughput. LiDAR sensors provide robust depth information, but conventional methods process full 360° scans in a single pass, introducing significant delay. Streaming approaches address this by sequentially processing partial scans in the native polar coordinate system, yet they rely on translation-invariant convolutions that are misaligned with polar geometry -- resulting in degraded performance or requiring complex distortion mitigation. Recent Mamba-based state space models (SSMs) have shown promise for LiDAR perception, but only in the full-scan setting, relying on geometric serialization and positional embeddings that are memory-intensive and ill-suited to streaming. We propose Polar Hierarchical Mamba (PHiM), a novel SSM architecture designed for polar-coordinate streaming LiDAR. PHiM uses local bidirectional Mamba blocks for intra-sector spatial encoding and a global forward Mamba for inter-sector temporal modeling, replacing convolutions and positional encodings with distortion-aware, dimensionally-decomposed operations. PHiM sets a new state-of-the-art among streaming detectors on the Waymo Open Dataset, outperforming the previous best by 10\% and matching full-scan baselines at twice the throughput. Code will be available at https://github.com/meilongzhang/Polar-Hierarchical-Mamba .
中文: 提出的极坐标分层Mamba(PHiM)架构通过用感知畸变的Mamba模块替代传统卷积,解决了流式LiDAR检测的低效问题,在Waymo开放数据集中以超越先前最佳方法10%的性能刷新了最高水平,并实现了与全扫描相当的吞吐量。
English: The proposed Polar Hierarchical Mamba (PHiM) architecture addresses inefficiencies in streaming LiDAR detection by replacing conventional convolutions with distortion-aware Mamba blocks, achieving state-of-the-art performance on the Waymo Open Dataset with a 10% improvement over prior methods and matching full-scan throughput.
Authors:Divya Jyoti Bajpai, Manjesh Kumar Hanawal
Abstract:
In recent years, Vision-Language Models (VLMs) have shown remarkable performance improvements in Vision-Language tasks. However, their large size poses challenges for real-world applications where inference latency is a concern. To tackle this issue, we propose employing Early Exit (EE) strategies in VLMs. However, training exit classifiers in VLMs is challenging, particularly with limited labeled training data. To address this, we introduce FREE, an adversarial training approach within a GAN-based framework. Here, each exit consists of a transformer layer and a classifier. The transformer layer is adversarially trained to produce feature representations similar to the final layer, while a feature classifier serves as the discriminator. Our method focuses on performing input-adaptive inference that increases inference speed with minimal drop in performance. Experimental results demonstrate the effectiveness of our approach in enhancing accuracy and model robustness by mitigating overthinking and the phenomenon of mid-crisis that we highlight. We experimentally validate that our method speeds up the inference process by more than 1.51x while retaining comparable performance. The source code is available at https://github.com/Div290/FREE.
Chinese: 本文提出FREE方法,通过基于GAN框架的对抗性训练在视觉语言模型中实现早期退出策略,在保持性能的同时将推理速度提升超过1.51倍。
English: The paper introduces FREE, an adversarial training method using a GAN-based framework to implement Early Exit strategies in Vision-Language Models, significantly accelerating inference by over 1.51x while maintaining performance.
Authors:Armin Behnamnia, Gholamali Aminian, Alireza Aghaei, Chengchun Shi, Vincent Y. F. Tan, Hamid R. Rabiee
Abstract:
Off-policy learning and evaluation leverage logged bandit feedback datasets, which contain context, action, propensity score, and feedback for each data point. These scenarios face significant challenges due to high variance and poor performance with low-quality propensity scores and heavy-tailed reward distributions. We address these issues by introducing a novel estimator based on the log-sum-exponential (LSE) operator, which outperforms traditional inverse propensity score estimators. Our LSE estimator demonstrates variance reduction and robustness under heavy-tailed conditions. For off-policy evaluation, we derive upper bounds on the estimator's bias and variance. In the off-policy learning scenario, we establish bounds on the regret -- the performance gap between our LSE estimator and the optimal policy -- assuming bounded $(1+ε)$-th moment of weighted reward. Notably, we achieve a convergence rate of $O(n^{-ε/(1+ ε)})$ for the regret bounds, where $ε\in [0,1]$ and $n$ is the size of logged bandit feedback dataset. Theoretical analysis is complemented by comprehensive empirical evaluations in both off-policy learning and evaluation scenarios, confirming the practical advantages of our approach. The code for our estimator is available at the following link: https://github.com/armin-behnamnia/lse-offpolicy-learning.
中文摘要:作者提出了一种基于对数求和指数(LSE)的新型估计器,通过减少方差并在重尾奖励分布下保持鲁棒性,有效解决了离线策略学习和评估中的关键挑战,并提供了理论保证和实证验证。
English Summary: The authors propose a novel log-sum-exponential (LSE) estimator that addresses challenges in off-policy learning and evaluation by reducing variance and demonstrating robustness under heavy-tailed reward distributions, with theoretical guarantees and empirical validation.
Authors:Ilya Kaufman Sirot, Omri Azencot
Abstract:
Deep learning models with a large number of parameters, often referred to as over-parameterized models, have achieved exceptional performance across various tasks. Despite concerns about overfitting, these models frequently generalize well to unseen data, thanks to effective regularization techniques, with data augmentation being among the most widely used. While data augmentation has shown great success in classification tasks using label-preserving transformations, its application in regression problems has received less attention. Recently, a novel \emph{manifold learning} approach for generating synthetic data was proposed, utilizing a first-order approximation of the data manifold. Building on this foundation, we present a theoretical framework and practical tools for approximating and sampling general data manifolds. Furthermore, we introduce the Curvature-Enhanced Manifold Sampling (CEMS) method for regression tasks. CEMS leverages a second-order representation of the data manifold to enable efficient sampling and reconstruction of new data points. Extensive evaluations across multiple datasets and comparisons with state-of-the-art methods demonstrate that CEMS delivers superior performance in both in-distribution and out-of-distribution scenarios, while introducing only minimal computational overhead. Code is available at https://github.com/azencot-group/CEMS.
Chinese: 过参数化深度学习模型通过数据增强等正则化技术实现良好泛化,而提出的曲率增强流形采样(CEMS)方法利用二阶流形近似,以最小计算开销显著提升回归任务的性能。
English: Over-parameterized deep learning models achieve strong generalization through regularization like data augmentation, and the proposed Curvature-Enhanced Manifold Sampling (CEMS) method leverages second-order manifold approximations to significantly improve regression performance with minimal computational cost.
Authors:Fudong Lin, Wanrou Du, Jinchan Liu, Tarikul Milon, Shelby Meche, Wu Xu, Xiaoqi Qin, Xu Yuan
Abstract:
Deep neural networks, particularly Transformers, have been widely adopted for predicting the functional properties of proteins. In this work, we focus on exploring whether Protein Transformers can capture biological intelligence among protein sequences. To achieve our goal, we first introduce a protein function dataset, namely Protein-FN, providing over 9000 protein data with meaningful labels. Second, we devise a new Transformer architecture, namely Sequence Protein Transformers (SPT), for computationally efficient protein function predictions. Third, we develop a novel Explainable Artificial Intelligence (XAI) technique called Sequence Score, which can efficiently interpret the decision-making processes of protein models, thereby overcoming the difficulty of deciphering biological intelligence bided in Protein Transformers. Remarkably, even our smallest SPT-Tiny model, which contains only 5.4M parameters, demonstrates impressive predictive accuracy, achieving 94.3% on the Antibiotic Resistance (AR) dataset and 99.6% on the Protein-FN dataset, all accomplished by training from scratch. Besides, our Sequence Score technique helps reveal that our SPT models can discover several meaningful patterns underlying the sequence structures of protein data, with these patterns aligning closely with the domain knowledge in the biology community. We have officially released our Protein-FN dataset on Hugging Face Datasets https://huggingface.co/datasets/Protein-FN/Protein-FN. Our code is available at https://github.com/fudong03/BioIntelligence.
中文: 本研究提出了Protein-FN蛋白质功能数据集、新型序列蛋白质Transformer(SPT)预测模型,以及可解释AI技术Sequence Score,揭示了SPT模型如何捕捉蛋白质序列中的生物学模式,其最小模型仅用540万参数即实现了优异预测精度。
English: This study introduces a Protein-FN dataset, a novel Sequence Protein Transformer (SPT) model for efficient protein function prediction, and an explainable AI technique called Sequence Score that reveals how SPT models capture biologically meaningful patterns in protein sequences, achieving high accuracy with minimal parameters.
Authors:Yuan Yuan, Yukun Liu, Chonghua Han, Jie Feng, Yong Li
Abstract:
Foundation models have revolutionized fields such as natural language processing and computer vision by enabling general-purpose learning across diverse tasks and datasets. However, building analogous models for human mobility remains challenging due to the privacy-sensitive nature of mobility data and the resulting data silos across institutions. To bridge this gap, we propose MoveGCL, a scalable and privacy-preserving framework for training mobility foundation models via generative continual learning. Without sharing raw data, MoveGCL enables decentralized and progressive model evolution by replaying synthetic trajectories generated from a frozen teacher model, and reinforces knowledge retention through a tailored distillation strategy that mitigates catastrophic forgetting. To address the heterogeneity of mobility patterns, MoveGCL incorporates a Mixture-of-Experts Transformer with a mobility-aware expert routing mechanism, and employs a layer-wise progressive adaptation strategy to stabilize continual updates. Experiments on six real-world urban datasets demonstrate that MoveGCL achieves performance comparable to joint training and significantly outperforms federated learning baselines, while offering strong privacy protection. MoveGCL marks a crucial step toward unlocking foundation models for mobility, offering a practical blueprint for open, scalable, and privacy-preserving model development in the era of foundation models. To facilitate reproducibility and future research, we have released the code and models at https://github.com/tsinghua-fib-lab/MoveGCL.
中文:MoveGCL是一个通过生成式持续学习实现去中心化移动基础模型训练的隐私保护框架,在保护敏感数据的同时,在异构城市数据集上取得了与联合训练相当的性能表现。
English: MoveGCL is a privacy-preserving framework that enables decentralized training of mobility foundation models through generative continual learning, achieving performance comparable to joint training while protecting sensitive data across heterogeneous urban datasets.
Authors:Xinyu Luo, Cedar Site Bai, Bolian Li, Petros Drineas, Ruqi Zhang, Brian Bullins
Abstract:
While popular optimization methods such as SGD, AdamW, and Lion depend on steepest descent updates in either $\ell_2$ or $\ell_\infty$ norms, there remains a critical gap in handling the non-Euclidean structure observed in modern deep networks training. In this work, we address this need by introducing a new accelerated $\ell_p$ steepest descent algorithm, called Stacey, which uses interpolated primal-dual iterate sequences to effectively navigate non-Euclidean smooth optimization tasks. In addition to providing novel theoretical guarantees for the foundations of our algorithm, we empirically compare our approach against these popular methods on tasks including image classification and language model (LLM) pretraining, demonstrating both faster convergence and higher final accuracy. We further evaluate different values of $p$ across various models and datasets, underscoring the importance and efficiency of non-Euclidean approaches over standard Euclidean methods. Code can be found at https://github.com/xinyuluo8561/Stacey .
中文: 本文提出了一种名为Stacey的加速$\ell_p$最速下降算法,通过插值原始-对偶迭代序列有效处理深度学习中的非欧几里得优化问题,在图像分类和语言模型预训练任务中展现出比现有方法更快的收敛速度和更高的最终精度。
English: This paper introduces Stacey, an accelerated $\ell_p$ steepest descent algorithm that effectively addresses non-Euclidean optimization in deep learning, demonstrating superior convergence and accuracy over existing methods in both theoretical and empirical evaluations.
Authors:Joseph T Colonel, Carolyn Hagler, Guiselle Wismer, Laura Curtis, Jacqueline Becker, Juan Wisnivesky, Alex Federman, Gaurav Pandey
Abstract:
Several machine learning algorithms have been developed for the prediction of Alzheimer's disease and related dementia (ADRD) from spontaneous speech. However, none of these algorithms have been translated for the prediction of broader cognitive impairment (CI), which in some cases is a precursor and risk factor of ADRD. In this paper, we evaluated several speech-based open-source methods originally proposed for the prediction of ADRD, as well as methods from multimodal sentiment analysis for the task of predicting CI from patient audio recordings. Results demonstrated that multimodal methods outperformed unimodal ones for CI prediction, and that acoustics-based approaches performed better than linguistics-based ones. Specifically, interpretable acoustic features relating to affect and prosody were found to significantly outperform BERT-based linguistic features and interpretable linguistic features, respectively. All the code developed for this study is available at https://github.com/JTColonel/catch.
中文: 本研究将原本用于预测阿尔茨海默病的语音机器学习方法应用于更广泛的认知障碍检测,发现多模态和声学方法,特别是捕捉情感和韵律特征的方法,优于单模态和语言学方法。
English: This study adapts existing speech-based machine learning methods, originally designed for Alzheimer's disease prediction, to detect broader cognitive impairment, finding that multimodal and acoustic approaches, particularly those capturing affect and prosody, outperform unimodal and linguistic methods.
Authors:Zhiyuan Zhao, Juntong Ni, Shangqing Xu, Haoxin Liu, Wei Jin, B. Aditya Prakash
Abstract:
Time-series forecasting is an essential task with wide real-world applications across domains. While recent advances in deep learning have enabled time-series forecasting models with accurate predictions, there remains considerable debate over which architectures and design components, such as series decomposition or normalization, are most effective under varying conditions. Existing benchmarks primarily evaluate models at a high level, offering limited insight into why certain designs work better. To mitigate this gap, we propose TimeRecipe, a unified benchmarking framework that systematically evaluates time-series forecasting methods at the module level. TimeRecipe conducts over 10,000 experiments to assess the effectiveness of individual components across a diverse range of datasets, forecasting horizons, and task settings. Our results reveal that exhaustive exploration of the design space can yield models that outperform existing state-of-the-art methods and uncover meaningful intuitions linking specific design choices to forecasting scenarios. Furthermore, we release a practical toolkit within TimeRecipe that recommends suitable model architectures based on these empirical insights. The benchmark is available at: https://github.com/AdityaLab/TimeRecipe.
Chinese: TimeRecipe框架通过大量实验在模块级别系统评估时间序列预测方法,揭示了超越现有最优方法的设计选择,并提供了基于实证的架构推荐工具包。
English: The TimeRecipe framework systematically evaluates time-series forecasting models at the module level through extensive experiments, revealing optimal design choices that outperform state-of-the-art methods and providing a toolkit for architecture recommendations.
Authors:Abrar Majeedi, Viswanatha Reddy Gajjala, Satya Sai Srinath Namburi GNVV, Nada Magdi Elkordi, Yin Li
Abstract:
Real-world time series are often governed by complex nonlinear dynamics. Understanding these underlying dynamics is crucial for precise future prediction. While deep learning has achieved major success in time series forecasting, many existing approaches do not explicitly model the dynamics. To bridge this gap, we introduce DeepEDM, a framework that integrates nonlinear dynamical systems modeling with deep neural networks. Inspired by empirical dynamic modeling (EDM) and rooted in Takens' theorem, DeepEDM presents a novel deep model that learns a latent space from time-delayed embeddings, and employs kernel regression to approximate the underlying dynamics, while leveraging efficient implementation of softmax attention and allowing for accurate prediction of future time steps. To evaluate our method, we conduct comprehensive experiments on synthetic data of nonlinear dynamical systems as well as real-world time series across domains. Our results show that DeepEDM is robust to input noise, and outperforms state-of-the-art methods in forecasting accuracy. Our code is available at: https://abrarmajeedi.github.io/deep_edm.
Authors:Ruizhong Qiu, Gaotang Li, Tianxin Wei, Jingrui He, Hanghang Tong
Abstract:
Existing safety assurance research has primarily focused on training-phase alignment to instill safe behaviors into LLMs. However, recent studies have exposed these methods' susceptibility to diverse jailbreak attacks. Concurrently, inference scaling has significantly advanced LLM reasoning capabilities but remains unexplored in the context of safety assurance. Addressing this gap, our work pioneers inference scaling for robust and effective LLM safety against emerging threats. We reveal that conventional inference scaling techniques, despite their success in reasoning tasks, perform poorly in safety contexts, even falling short of basic approaches like Best-of-N Sampling. We attribute this inefficiency to a newly identified challenge, the exploration--efficiency dilemma, arising from the high computational overhead associated with frequent process reward model (PRM) evaluations. To overcome this dilemma, we propose SAFFRON, a novel inference scaling paradigm tailored explicitly for safety assurance. Central to our approach is the introduction of a multifurcation reward model (MRM) that significantly reduces the required number of reward model evaluations. To operationalize this paradigm, we further propose: (i) a partial supervision training objective for MRM, (ii) a conservative exploration constraint to prevent out-of-distribution explorations, and (iii) a Trie-based key--value caching strategy that facilitates cache sharing across sequences during tree search. Extensive experiments validate the effectiveness of our method. Additionally, we publicly release our trained multifurcation reward model (Saffron-1) and the accompanying token-level safety reward dataset (Safety4M) to accelerate future research in LLM safety. Our code, model, and data are publicly available at https://github.com/q-rz/saffron , and our project homepage is at https://q-rz.github.io/p/saffron .
中文: 现有大语言模型安全方法易受越狱攻击,因此本研究提出SAFFRON创新推理扩展框架,通过多叉奖励模型显著提升安全防御能力并降低计算开销。
English: Current safety methods for large language models are vulnerable to jailbreak attacks, so this research introduces SAFFRON, a novel inference scaling approach using a multifurcation reward model to enhance safety while reducing computational costs.
Authors:Luis Pinto
Abstract:
Pretrained molecular encoders have become indispensable in computational chemistry for tasks such as property prediction and molecular generation. However, the standard practice of relying solely on final-layer embeddings for downstream tasks may discard valuable information. In this work, we challenge this convention by conducting a comprehensive layer-wise analysis of five diverse molecular encoders across 22 ADMET property prediction tasks. Our results demonstrate that embeddings from intermediate layers consistently outperform final-layer representations. Specifically, using fixed embeddings from the optimal intermediate layers improved downstream performance by an average of 5.4%, reaching gains up to 28.6%. Furthermore, finetuning up to these intermediate layers yielded even greater average improvements of 8.5%, with performance increases as high as 40.8%, achieving new state-of-the-art results on several benchmarks. Additionally, a strong positive correlation between fixed embedding performance and finetuning outcomes supports an efficient evaluate-then-finetune approach, enabling identification of optimal layers with reduced computational cost. These findings highlight the importance of exploring the full representational depth of molecular encoders to achieve substantial performance improvements and computational efficiency. The code is made publicly available at https://github.com/luispintoc/Unlocking-Chemical-Insights.
中文: 本研究表明,在ADMET性质预测中,分子编码器的中间层嵌入始终优于最终层表示,固定嵌入可将性能提升高达28.6%,微调更可实现40.8%的增益,同时强相关性为高效选择最优层提供了依据。
English: This study reveals that intermediate-layer embeddings from molecular encoders consistently outperform final-layer representations in ADMET property prediction, with fixed embeddings improving performance by up to 28.6% and fine-tuning achieving gains of up to 40.8%, while demonstrating a strong correlation that enables efficient layer selection.
Authors:Dor Tsur, Carol Xuan Long, Claudio Mayrink Verdun, Hsiang Hsu, Chen-Fu Chen, Haim Permuter, Sajani Vithana, Flavio P. Calmon
Abstract:
Large language model (LLM) watermarks enable authentication of text provenance, curb misuse of machine-generated text, and promote trust in AI systems. Current watermarks operate by changing the next-token predictions output by an LLM. The updated (i.e., watermarked) predictions depend on random side information produced, for example, by hashing previously generated tokens. LLM watermarking is particularly challenging in low-entropy generation tasks - such as coding - where next-token predictions are near-deterministic. In this paper, we propose an optimization framework for watermark design. Our goal is to understand how to most effectively use random side information in order to maximize the likelihood of watermark detection and minimize the distortion of generated text. Our analysis informs the design of two new watermarks: HeavyWater and SimplexWater. Both watermarks are tunable, gracefully trading-off between detection accuracy and text distortion. They can also be applied to any LLM and are agnostic to side information generation. We examine the performance of HeavyWater and SimplexWater through several benchmarks, demonstrating that they can achieve high watermark detection accuracy with minimal compromise of text generation quality, particularly in the low-entropy regime. Our theoretical analysis also reveals surprising new connections between LLM watermarking and coding theory. The code implementation can be found in https://github.com/DorTsur/HeavyWater_SimplexWater
中文: 本文提出了一种可优化的水印设计框架,开发了HeavyWater和SimplexWater两种水印技术,能在保持文本质量的同时实现高检测精度,适用于各类语言模型和辅助信息生成方式。
English: This paper introduces an optimization framework for designing tunable LLM watermarks, HeavyWater and SimplexWater, which effectively balance detection accuracy and text quality while being applicable to any language model and side information generation method.
Authors:Ali Murad, Bo Hui, Wei-Shinn Ku
Abstract:
Federated Learning (FL) is a distributed framework for collaborative model training over large-scale distributed data, enabling higher performance while maintaining client data privacy. However, the nature of model aggregation at the centralized server can result in a performance drop in the presence of non-IID data across different clients. We remark that training a client locally on more data than necessary does not benefit the overall performance of all clients. In this paper, we devise a novel framework that leverages a Deep Reinforcement Learning (DRL) agent to select an optimized amount of data necessary to train a client model without oversharing information with the server. Starting without awareness of the client's performance, the DRL agent utilizes the change in training loss as a reward signal and learns to optimize the amount of training data necessary for improving the client's performance. Specifically, after each aggregation round, the DRL algorithm considers the local performance as the current state and outputs the optimized weights for each class, in the training data, to be used during the next round of local training. In doing so, the agent learns a policy that creates an optimized partition of the local training dataset during the FL rounds. After FL, the client utilizes the entire local training dataset to further enhance its performance on its own data distribution, mitigating the non-IID effects of aggregation. Through extensive experiments, we demonstrate that training FL clients through our algorithm results in superior performance on multiple benchmark datasets and FL frameworks. Our code is available at https://github.com/amuraddd/optimized_client_training.git.
中文: 联邦学习在非独立同分布数据下性能会下降,本文提出一种深度强化学习框架,通过优化本地训练数据量来提升客户端性能,同时保护数据隐私。
English: Federated Learning can suffer from performance degradation with non-IID data, but this paper introduces a Deep Reinforcement Learning framework that optimizes local data usage during training to enhance client performance while maintaining privacy.
Authors:Masoud Rahimi, Reza Karbasi, Abdol-Hossein Vahabie
Abstract:
We introduce an open-source Python framework for generating synthetic ECG image datasets to advance critical deep learning-based tasks in ECG analysis, including ECG digitization, lead region and lead name detection, and pixel-level waveform segmentation. Using the PTB-XL signal dataset, our proposed framework produces four open-access datasets: (1) ECG images in various lead configurations paired with time-series signals for ECG digitization, (2) ECG images annotated with YOLO-format bounding boxes for detection of lead region and lead name, (3)-(4) cropped single-lead images with segmentation masks compatible with U-Net-based models in normal and overlapping versions. In the overlapping case, waveforms from neighboring leads are superimposed onto the target lead image, while the segmentation masks remain clean. The open-source Python framework and datasets are publicly available at https://github.com/rezakarbasi/ecg-image-and-signal-dataset and https://doi.org/10.5281/zenodo.15484519, respectively.
中文: 该开源Python框架基于PTB-XL信号生成合成心电图图像数据集,支持心电图数字化、导联区域识别和波形分割等深度学习任务,四类开放数据集可通过GitHub和Zenodo公开获取。
English: This open-source Python framework generates synthetic ECG image datasets from PTB-XL signals to support deep learning tasks like ECG digitization, lead detection, and waveform segmentation, with four publicly available datasets accessible via GitHub and Zenodo.
Authors:Xiaoyu Sun, Yang Yang, Xunde Dong
Abstract:
In the field of automatic Electrocardiogram (ECG) diagnosis, due to the relatively limited amount of labeled data, how to build a robust ECG pretrained model based on unlabeled data is a key area of focus for researchers. Recent advancements in contrastive learning-based ECG pretrained models highlight the potential of exploiting the additional patient-level self-supervisory signals inherent in ECG. They are referred to as patient contrastive learning. Its rationale is that multiple physical recordings from the same patient may share commonalities, termed patient consistency, so redefining positive and negative pairs in contrastive learning as intrapatient and inter-patient samples provides more shared context to learn an effective representation. However, these methods still fail to efficiently exploit patient consistency due to the insufficient amount of intra-inter patient samples existing in a batch. Hence, we propose a contrastive learning-based ECG pretrained model enhanced by the Patient Memory Queue (PMQ), which incorporates a large patient memory queue to mitigate model degeneration that can arise from insufficient intra-inter patient samples. In order to further enhance the performance of the pretrained model, we introduce two extra data augmentation methods to provide more perspectives of positive and negative pairs for pretraining. Extensive experiments were conducted on three public datasets with three different data ratios. The experimental results show that the comprehensive performance of our method outperforms previous contrastive learning methods and exhibits greater robustness in scenarios with limited labeled data. The code is available at https://github.com/3hiuwoo/PMQ.
中文: 本研究提出了一种基于患者记忆队列增强的心电图对比学习预训练模型,通过解决批次内患者样本不足的问题提升了模型鲁棒性,在标注数据有限的情况下显著优于现有方法。
English: This study introduces a Patient Memory Queue-enhanced contrastive learning model for ECG diagnosis, which improves robustness by addressing insufficient intra-inter patient samples and outperforms previous methods in limited labeled data scenarios.
Authors:Junyi Liu, Stanley Kok
Abstract:
Agencies such as Standard & Poor's and Moody's provide bank credit ratings that influence economic stability and decision-making by stakeholders. Accurate and timely predictions support informed decision-making, regulatory actions, and investor protection. However, a complete interbank connection graph is often unavailable due to privacy concerns, complicating the direct application of Graph Neural Networks (GNNs) for rating prediction. our research utilizes persistent homology to construct a network that captures relationships among banks and combines this with a traditional lending network to create a heterogeneous network that integrates information from both sources, leading to improved predictions. Experiments on a global, real-world dataset validate the effectiveness of HTGNN. This research has implications for investors and regulatory bodies in enhancing proactive risk mitigation and the implementation of effective market interventions.The code can be find at https://github.com/Liu-Jun-Yi/HTGNN.
Chinese: 本研究通过持久同调构建银行关系网络,并将其与借贷网络结合成异质结构,提升了银行信用评级预测的准确性,经真实数据验证有助于投资者和监管机构进行风险管理。
English: This study enhances bank credit rating predictions by using persistent homology to build a bank relationship network and combining it with a lending network into a heterogeneous structure, which is validated on real-world data to aid investors and regulators in risk management.
Authors:Mohammad Jalali, Bahar Dibaei Nia, Farzan Farnia
Abstract:
While several feature embedding models have been developed in the literature, comparisons of these embeddings have largely focused on their numerical performance in classification-related downstream applications. However, an interpretable comparison of different embeddings requires identifying and analyzing mismatches between sample groups clustered within the embedding spaces. In this work, we propose the \emph{Spectral Pairwise Embedding Comparison (SPEC)} framework to compare embeddings and identify their differences in clustering a reference dataset. Our approach examines the kernel matrices derived from two embeddings and leverages the eigendecomposition of the difference kernel matrix to detect sample clusters that are captured differently by the two embeddings. We present a scalable implementation of this kernel-based approach, with computational complexity that grows linearly with the sample size. Furthermore, we introduce an optimization problem using this framework to align two embeddings, ensuring that clusters identified in one embedding are also captured in the other model. We provide numerical results demonstrating the SPEC's application to compare and align embeddings on large-scale datasets such as ImageNet and MS-COCO. The project page is available at https://mjalali.github.io/SPEC/.
Authors:Dimitrios Proios, Alban Bornet, Anthony Yazdani, Jose F Rodrigues, Douglas Teodoro
Abstract:
Patient stratification identifying clinically meaningful subgroups is essential for advancing personalized medicine through improved diagnostics and treatment strategies. Electronic health records (EHRs), particularly those from intensive care units (ICUs), contain rich temporal clinical data that can be leveraged for this purpose. In this work, we introduce ICU-TSB (Temporal Stratification Benchmark), the first comprehensive benchmark for evaluating patient stratification based on temporal patient representation learning using three publicly available ICU EHR datasets. A key contribution of our benchmark is a novel hierarchical evaluation framework utilizing disease taxonomies to measure the alignment of discovered clusters with clinically validated disease groupings. In our experiments with ICU-TSB, we compared statistical methods and several recurrent neural networks, including LSTM and GRU, for their ability to generate effective patient representations for subsequent clustering of patient trajectories. Our results demonstrate that temporal representation learning can rediscover clinically meaningful patient cohorts; nevertheless, it remains a challenging task, with v-measuring varying from up to 0.46 at the top level of the taxonomy to up to 0.40 at the lowest level. To further enhance the practical utility of our findings, we also evaluate multiple strategies for assigning interpretable labels to the identified clusters. The experiments and benchmark are fully reproducible and available at https://github.com/ds4dh/CBMS2025stratification.
中文: ICU-TSB是首个基于ICU电子健康记录时序表征学习的患者分层综合基准,其分层评估框架表明时序模型能识别临床相关患者群组,但在聚类准确性方面仍存在挑战。
English: ICU-TSB is the first comprehensive benchmark for evaluating patient stratification using temporal representation learning from ICU EHR data, featuring a hierarchical evaluation framework that demonstrates temporal models can identify clinically meaningful patient cohorts while revealing persistent challenges in clustering accuracy.
Authors:Wenyuan Li, Shunlin Liang, Yuxiang Zhang, Liqin Liu, Keyan Chen, Yongzhe Chen, Han Ma, Jianglei Xu, Yichuan Ma, Shikang Guan, Zhenwei Shi
Abstract:
Fine-grained crop type classification serves as the fundamental basis for large-scale crop mapping and plays a vital role in ensuring food security. It requires simultaneous capture of both phenological dynamics (obtained from multi-temporal satellite data like Sentinel-2) and subtle spectral variations (demanding nanometer-scale spectral resolution from hyperspectral imagery). Research combining these two modalities remains scarce currently due to challenges in hyperspectral data acquisition and crop types annotation costs. To address these issues, we construct a hierarchical hyperspectral crop dataset (H2Crop) by integrating 30m-resolution EnMAP hyperspectral data with Sentinel-2 time series. With over one million annotated field parcels organized in a four-tier crop taxonomy, H2Crop establishes a vital benchmark for fine-grained agricultural crop classification and hyperspectral image processing. We propose a dual-stream Transformer architecture that synergistically processes these modalities. It coordinates two specialized pathways: a spectral-spatial Transformer extracts fine-grained signatures from hyperspectral EnMAP data, while a temporal Swin Transformer extracts crop growth patterns from Sentinel-2 time series. The designed hierarchical classification head with hierarchical fusion then simultaneously delivers multi-level crop type classification across all taxonomic tiers. Experiments demonstrate that adding hyperspectral EnMAP data to Sentinel-2 time series yields a 4.2% average F1-scores improvement (peaking at 6.3%). Extensive comparisons also confirm our method's higher accuracy over existing deep learning approaches for crop type classification and the consistent benefits of hyperspectral data across varying temporal windows and crop change scenarios. Codes and dataset are available at https://github.com/flyakon/H2Crop.
中文: 本研究提出了H2Crop分层高光谱作物数据集及双流Transformer模型,通过协同处理高光谱与多时相卫星数据,显著提升了细粒度作物分类精度,相关代码与数据已开源。
English: This study introduces H2Crop, a hierarchical hyperspectral crop dataset, and a dual-stream Transformer model that synergistically combines hyperspectral and multi-temporal satellite data to significantly improve fine-grained crop classification accuracy, with codes and data publicly available.
Authors:Xudong Zhang, Renato Cordeiro de Amorim
Abstract:
Unsupervised feature selection is critical for improving clustering performance in high-dimensional data, where irrelevant features can obscure meaningful structure. In this work, we introduce the Minkowski weighted $k$-means++, a novel initialisation strategy for the Minkowski Weighted $k$-means. Our initialisation selects centroids probabilistically using feature relevance estimates derived from the data itself. Building on this, we propose two new feature selection algorithms, FS-MWK++, which aggregates feature weights across a range of Minkowski exponents to identify stable and informative features, and SFS-MWK++, a scalable variant based on subsampling. We support our approach with a theoretical guarantee under mild assumptions and extensive experiments showing that our methods consistently outperform existing alternatives. Our software can be found at https://github.com/xzhang4-ops1/FSMWK.
中文: 本文提出了Minkowski加权$k$-means++初始化方法及两种基于特征相关性的特征选择算法,通过理论保证和实验验证了其在提升高维数据聚类性能上的显著优势。
English: This paper introduces Minkowski weighted $k$-means++ initialization and two feature selection algorithms that leverage feature relevance to enhance clustering in high-dimensional data, supported by theoretical guarantees and superior experimental results.
Authors:Cheng-Long Wang, Qi Li, Zihang Xiang, Yinzhi Cao, Di Wang
Abstract:
Growing concerns over data privacy and security highlight the importance of machine unlearning--removing specific data influences from trained models without full retraining. Techniques like Membership Inference Attacks (MIAs) are widely used to externally assess successful unlearning. However, existing methods face two key limitations: (1) maximizing MIA effectiveness (e.g., via online attacks) requires prohibitive computational resources, often exceeding retraining costs; (2) MIAs, designed for binary inclusion tests, struggle to capture granular changes in approximate unlearning. To address these challenges, we propose the Interpolated Approximate Measurement (IAM), a framework natively designed for unlearning inference. IAM quantifies sample-level unlearning completeness by interpolating the model's generalization-fitting behavior gap on queried samples. IAM achieves strong performance in binary inclusion tests for exact unlearning and high correlation for approximate unlearning--scalable to LLMs using just one pre-trained shadow model. We theoretically analyze how IAM's scoring mechanism maintains performance efficiently. We then apply IAM to recent approximate unlearning algorithms, revealing general risks of both over-unlearning and under-unlearning, underscoring the need for stronger safeguards in approximate unlearning systems. The code is available at https://github.com/Happy2Git/Unlearning_Inference_IAM.
中文摘要:提出的插值近似测量(IAM)框架通过泛化拟合差距量化样本级遗忘完整性,有效评估机器遗忘效果,克服了现有方法的计算和粒度限制,并揭示了近似遗忘系统中的普遍风险。
English Summary: The proposed Interpolated Approximate Measurement (IAM) framework efficiently evaluates machine unlearning by quantifying sample-level completeness through generalization-fitting gaps, overcoming computational and granular limitations of existing methods while revealing risks in approximate unlearning systems.
Authors:Rujikorn Charakorn, Edoardo Cetin, Yujin Tang, Robert Tjarko Lange
Abstract:
While Foundation Models provide a general tool for rapid content creation, they regularly require task-specific adaptation. Traditionally, this exercise involves careful curation of datasets and repeated fine-tuning of the underlying model. Fine-tuning techniques enable practitioners to adapt foundation models for many new applications but require expensive and lengthy training while being notably sensitive to hyperparameter choices. To overcome these limitations, we introduce Text-to-LoRA (T2L), a model capable of adapting large language models (LLMs) on the fly solely based on a natural language description of the target task. T2L is a hypernetwork trained to construct LoRAs in a single inexpensive forward pass. After training T2L on a suite of 9 pre-trained LoRA adapters (GSM8K, Arc, etc.), we show that the ad-hoc reconstructed LoRA instances match the performance of task-specific adapters across the corresponding test sets. Furthermore, T2L can compress hundreds of LoRA instances and zero-shot generalize to entirely unseen tasks. This approach provides a significant step towards democratizing the specialization of foundation models and enables language-based adaptation with minimal compute requirements.
Our code is available at https://github.com/SakanaAI/text-to-lora
基础模型通常需要针对特定任务进行微调,这一过程成本高昂且对设置敏感,而Text-to-LoRA (T2L) 通过自然语言描述实现即时适配,能以极低计算量生成高效的LoRA适配器。
Foundation models often need task-specific fine-tuning, which is costly and sensitive to settings, but Text-to-LoRA (T2L) enables on-the-fly adaptation using natural language descriptions, generating effective LoRA adapters with minimal computational effort.
Authors:Maor Ashkenazi, Ofir Brenner, Tal Furman Shohet, Eran Treister
Abstract:
Detecting Large Language Model (LLM)-generated code is a growing challenge with implications for security, intellectual property, and academic integrity. We investigate the role of conditional probability distributions in improving zero-shot LLM-generated code detection, when considering both the code and the corresponding task prompt that generated it. Our key insight is that when evaluating the probability distribution of code tokens using an LLM, there is little difference between LLM-generated and human-written code. However, conditioning on the task reveals notable differences. This contrasts with natural language text, where differences exist even in the unconditional distributions. Leveraging this, we propose a novel zero-shot detection approach that approximates the original task used to generate a given code snippet and then evaluates token-level entropy under the approximated task conditioning (ATC). We further provide a mathematical intuition, contextualizing our method relative to previous approaches. ATC requires neither access to the generator LLM nor the original task prompts, making it practical for real-world applications. To the best of our knowledge, it achieves state-of-the-art results across benchmarks and generalizes across programming languages, including Python, CPP, and Java. Our findings highlight the importance of task-level conditioning for LLM-generated code detection. The supplementary materials and code are available at https://github.com/maorash/ATC, including the dataset gathering implementation, to foster further research in this area.
中文: 本研究提出了一种新颖的零样本检测方法,通过评估近似任务条件下的标记熵来区分LLM生成的代码与人工编写的代码,该方法无需访问生成模型或原始提示,就在多种编程语言中实现了最优性能。
English: This study introduces a novel zero-shot detection method that leverages task-level conditioning to distinguish LLM-generated code from human-written code by evaluating token entropy under approximated task conditions, achieving state-of-the-art performance across multiple programming languages without requiring access to the generator model or original prompts.
Authors:Taoran Yue, Xiaojin Lu, Jiaxi Cai, Yuanping Chen, Shibing Chu
Abstract:
Current CNN-based infrared small target detection(IRSTD) methods generally overlook the heterogeneity between shallow and deep features, leading to inefficient collaboration between shallow fine grained structural information and deep high-level semantic representations. Additionally, the dependency relationships and fusion mechanisms across different feature hierarchies lack systematic modeling, which fails to fully exploit the complementarity of multilevel features. These limitations hinder IRSTD performance while incurring substantial computational costs. To address these challenges, this paper proposes a shallow-deep synergistic detection network (SDS-Net) that efficiently models multilevel feature representations to increase both the detection accuracy and computational efficiency in IRSTD tasks. SDS-Net introduces a dual-branch architecture that separately models the structural characteristics and semantic properties of features, effectively preserving shallow spatial details while capturing deep semantic representations, thereby achieving high-precision detection with significantly improved inference speed. Furthermore, the network incorporates an adaptive feature fusion module to dynamically model cross-layer feature correlations, enhancing overall feature collaboration and representation capability. Comprehensive experiments on three public datasets (NUAA-SIRST, NUDT-SIRST, and IRSTD-1K) demonstrate that SDS-Net outperforms state-of-the-art IRSTD methods while maintaining low computational complexity and high inference efficiency, showing superior detection performance and broad application prospects. Our code will be made public at https://github.com/PhysiLearn/SDS-Net.
中文: 本文提出的SDS-Net通过双分支结构和自适应特征融合模块,有效协同浅层与深层特征,在提升红外小目标检测精度的同时显著提高了计算效率。
English: This paper introduces SDS-Net, a dual-branch network that effectively models shallow and deep features through adaptive fusion, enhancing infrared small target detection accuracy and efficiency while reducing computational costs.
Authors:Joscha Diehl, Rasheed Ibraheem, Leonard Schmitz, Yue Wu
Abstract:
Data in the form of images or higher-order tensors is ubiquitous in modern deep learning applications. Owing to their inherent high dimensionality, the need for subquadratic layers processing such data is even more pressing than for sequence data. We propose a novel tensor-to-tensor layer with linear cost in the input size, utilizing the mathematical gadget of ``corner trees'' from the field of permutation counting. In particular, for order-two tensors, we provide an image-to-image layer that can be plugged into image processing pipelines. On the one hand, our method can be seen as a higher-order generalization of state-space models. On the other hand, it is based on a multiparameter generalization of the signature of iterated integrals (or sums). The proposed tensor-to-tensor concept is used to build a neural network layer called the Fast Iterated Sums (FIS) layer which integrates seamlessly with other layer types. We demonstrate the usability of the FIS layer with both classification and anomaly detection tasks. By replacing some layers of a smaller ResNet architecture with FIS, a similar accuracy (with a difference of only 0.1\%) was achieved in comparison to a larger ResNet while reducing the number of trainable parameters and multi-add operations. The FIS layer was also used to build an anomaly detection model that achieved an average AUROC of 97.3\% on the texture images of the popular MVTec AD dataset. The processing and modelling codes are publicly available at https://github.com/diehlj/fast-iterated-sums.
中文摘要:本文提出了一种基于排列计数中“角树”数学工具的新型张量到张量线性计算层,该层既是状态空间模型的高阶推广,也是迭代积分的多参数扩展,在保持与大型网络相近精度的同时显著减少了参数数量和计算量。
English Summary: This paper introduces a novel tensor-to-tensor layer with linear computational cost using "corner trees" from permutation counting, which serves as both a higher-order generalization of state-space models and a multiparameter extension of iterated integrals, achieving comparable accuracy to larger networks while reducing parameters and computational operations.
Authors:Yuhao Sun, Jiacheng Zhang, Zesheng Ye, Chaowei Xiao, Feng Liu
Abstract:
Diffusion-based purification (DBP) methods aim to remove adversarial noise from the input sample by first injecting Gaussian noise through a forward diffusion process, and then recovering the clean example through a reverse generative process. In the above process, how much Gaussian noise is injected to the input sample is key to the success of DBP methods, which is controlled by a constant noise level $t^*$ for all samples in existing methods. In this paper, we discover that an optimal $t^*$ for each sample indeed could be different. Intuitively, the cleaner a sample is, the less the noise it should be injected, and vice versa. Motivated by this finding, we propose a new framework, called Sample-specific Score-aware Noise Injection (SSNI). Specifically, SSNI uses a pre-trained score network to estimate how much a data point deviates from the clean data distribution (i.e., score norms). Then, based on the magnitude of score norms, SSNI applies a reweighting function to adaptively adjust $t^*$ for each sample, achieving sample-specific noise injections. Empirically, incorporating our framework with existing DBP methods results in a notable improvement in both accuracy and robustness on CIFAR-10 and ImageNet-1K, highlighting the necessity to allocate distinct noise levels to different samples in DBP methods. Our code is available at: https://github.com/tmlr-group/SSNI.
中文: 本文提出SSNI框架,通过预训练评分网络评估样本偏离干净数据分布的程度,自适应调整每个样本的噪声注入水平,在CIFAR-10和ImageNet-1K数据集上显著提升了基于扩散的净化方法的准确性和鲁棒性。
English: The paper introduces SSNI, a framework that adaptively adjusts noise injection levels in diffusion-based purification methods based on each sample's deviation from the clean data distribution, improving accuracy and robustness on benchmark datasets.
Authors:Shilong Tao, Zhe Feng, Haonan Sun, Zhanxing Zhu, Yunhuai Liu
Abstract:
Multi-solid systems are foundational to a wide range of real-world applications, yet modeling their complex interactions remains challenging. Existing deep learning methods predominantly rely on implicit modeling, where the factors influencing solid deformation are not explicitly represented but are instead indirectly learned. However, as the number of solids increases, these methods struggle to accurately capture intricate physical interactions. In this paper, we introduce a novel explicit modeling paradigm that incorporates factors influencing solid deformation through structured modules. Specifically, we present Unisoma, a unified and flexible Transformer-based model capable of handling variable numbers of solids. Unisoma directly captures physical interactions using contact modules and adaptive interaction allocation mechanism, and learns the deformation through a triplet relationship. Compared to implicit modeling techniques, explicit modeling is more well-suited for multi-solid systems with diverse coupling patterns, as it enables detailed treatment of each solid while preventing information blending and confusion. Experimentally, Unisoma achieves consistent state-of-the-art performance across seven well-established datasets and two complex multi-solid tasks. Code is avaiable at https://github.com/therontau0054/Unisoma.
中文: 本文提出Unisoma,一种基于Transformer的显式建模方法,通过结构化模块精确捕捉多固体系统中的物理相互作用,在多个数据集和复杂任务中均实现了最先进的性能。
English: This paper introduces Unisoma, an explicit Transformer-based model that accurately captures physical interactions in multi-solid systems through structured modules, achieving state-of-the-art performance across diverse datasets and tasks.
Authors:Motoki Omura, Kazuki Ota, Takayuki Osa, Yusuke Mukuta, Tatsuya Harada
Abstract:
For continuous action spaces, actor-critic methods are widely used in online reinforcement learning (RL). However, unlike RL algorithms for discrete actions, which generally model the optimal value function using the Bellman optimality operator, RL algorithms for continuous actions typically model Q-values for the current policy using the Bellman operator. These algorithms for continuous actions rely exclusively on policy updates for improvement, which often results in low sample efficiency. This study examines the effectiveness of incorporating the Bellman optimality operator into actor-critic frameworks. Experiments in a simple environment show that modeling optimal values accelerates learning but leads to overestimation bias. To address this, we propose an annealing approach that gradually transitions from the Bellman optimality operator to the Bellman operator, thereby accelerating learning while mitigating bias. Our method, combined with TD3 and SAC, significantly outperforms existing approaches across various locomotion and manipulation tasks, demonstrating improved performance and robustness to hyperparameters related to optimality. The code for this study is available at https://github.com/motokiomura/annealed-q-learning.
Chinese: 本研究提出了一种退火方法,在连续动作空间的行动者-评论家框架中从贝尔曼最优算子过渡到贝尔曼算子,既加速了学习过程又缓解了高估偏差,在各种任务中显著提升了性能和鲁棒性。
English: This study introduces an annealing approach that transitions from the Bellman optimality operator to the Bellman operator in actor-critic frameworks for continuous action spaces, accelerating learning while mitigating overestimation bias and significantly improving performance and robustness in various tasks.
Authors:Tianjun Yao, Haoxuan Li, Yongqiang Chen, Tongliang Liu, Le Song, Eric Xing, Zhiqiang Shen
Abstract:
Graph Neural Networks (GNNs) often encounter significant performance degradation under distribution shifts between training and test data, hindering their applicability in real-world scenarios. Recent studies have proposed various methods to address the out-of-distribution generalization challenge, with many methods in the graph domain focusing on directly identifying an invariant subgraph that is predictive of the target label. However, we argue that identifying the edges from the invariant subgraph directly is challenging and error-prone, especially when some spurious edges exhibit strong correlations with the targets. In this paper, we propose PrunE, the first pruning-based graph OOD method that eliminates spurious edges to improve OOD generalizability. By pruning spurious edges, PrunE retains the invariant subgraph more comprehensively, which is critical for OOD generalization. Specifically, PrunE employs two regularization terms to prune spurious edges: 1) graph size constraint to exclude uninformative spurious edges, and 2) $ε$-probability alignment to further suppress the occurrence of spurious edges. Through theoretical analysis and extensive experiments, we show that PrunE achieves superior OOD performance and outperforms previous state-of-the-art methods significantly. Codes are available at: \href{https://github.com/tianyao-aka/PrunE-GraphOOD}{https://github.com/tianyao-aka/PrunE-GraphOOD}.
Chinese: 本文提出PrunE方法,通过图大小约束和ε概率对齐剪除虚假边,提升图神经网络在分布外场景下的泛化能力,实验证明其显著优于现有最优方法。
English: The paper introduces PrunE, a novel pruning-based method that enhances out-of-distribution generalization in Graph Neural Networks by eliminating spurious edges through graph size constraints and ε-probability alignment, achieving superior performance over existing approaches.
Authors:Adrien Petralia, Paul Boniol, Philippe Charpentier, Themis Palpanas
Abstract:
Improving smart grid system management is crucial in the fight against climate change, and enabling consumers to play an active role in this effort is a significant challenge for electricity suppliers. In this regard, millions of smart meters have been deployed worldwide in the last decade, recording the main electricity power consumed in individual households. This data produces valuable information that can help them reduce their electricity footprint; nevertheless, the collected signal aggregates the consumption of the different appliances running simultaneously in the house, making it difficult to apprehend. Non-Intrusive Load Monitoring (NILM) refers to the challenge of estimating the power consumption, pattern, or on/off state activation of individual appliances using the main smart meter signal. Recent methods proposed to tackle this task are based on a fully supervised deep-learning approach that requires both the aggregate signal and the ground truth of individual appliance power. However, such labels are expensive to collect and extremely scarce in practice, as they require conducting intrusive surveys in households to monitor each appliance. In this paper, we introduce CamAL, a weakly supervised approach for appliance pattern localization that only requires information on the presence of an appliance in a household to be trained. CamAL merges an ensemble of deep-learning classifiers combined with an explainable classification method to be able to localize appliance patterns. Our experimental evaluation, conducted on 4 real-world datasets, demonstrates that CamAL significantly outperforms existing weakly supervised baselines and that current SotA fully supervised NILM approaches require significantly more labels to reach CamAL performances. The source of our experiments is available at: https://github.com/adrienpetralia/CamAL. This paper appeared in ICDE 2025.
中文摘要:提升智能电网管理对应对气候变化至关重要,本文提出CamAL弱监督方法,仅需家电存在信息即可定位用电模式,在减少标注需求的同时显著优于现有技术。
English Summary: Improving smart grid management is vital for climate action, and this paper introduces CamAL, a weakly supervised method that localizes appliance usage patterns using only household presence data, outperforming existing approaches with fewer labels.
Authors:Zhixiong Zhuang, Hui-Po Wang, Maria-Irina Nicolae, Mario Fritz
Abstract:
Model stealing poses a significant security risk in machine learning by enabling attackers to replicate a black-box model without access to its training data, thus jeopardizing intellectual property and exposing sensitive information. Recent methods that use pre-trained diffusion models for data synthesis improve efficiency and performance but rely heavily on manually crafted prompts, limiting automation and scalability, especially for attackers with little expertise. To assess the risks posed by open-source pre-trained models, we propose a more realistic threat model that eliminates the need for prompt design skills or knowledge of class names. In this context, we introduce Stealix, the first approach to perform model stealing without predefined prompts. Stealix uses two open-source pre-trained models to infer the victim model's data distribution, and iteratively refines prompts through a genetic algorithm, progressively improving the precision and diversity of synthetic images. Our experimental results demonstrate that Stealix significantly outperforms other methods, even those with access to class names or fine-grained prompts, while operating under the same query budget. These findings highlight the scalability of our approach and suggest that the risks posed by pre-trained generative models in model stealing may be greater than previously recognized.
Authors:Junpeng Lin, Tian Lan, Bo Zhang, Ke Lin, Dandan Miao, Huiru He, Jiantao Ye, Chen Zhang, Yan-fu Li
Abstract:
Forecasting non-stationary time series is a challenging task because their statistical properties often change over time, making it hard for deep models to generalize well. Instance-level normalization techniques can help address shifts in temporal distribution. However, most existing methods overlook the multi-component nature of time series, where different components exhibit distinct non-stationary behaviors. In this paper, we propose Wavelet-based Disentangled Adaptive Normalization (WDAN), a model-agnostic framework designed to address non-stationarity in time series forecasting. WDAN uses discrete wavelet transforms to break down the input into low-frequency trends and high-frequency fluctuations. It then applies tailored normalization strategies to each part. For trend components that exhibit strong non-stationarity, we apply first-order differencing to extract stable features used for predicting normalization parameters. Extensive experiments on multiple benchmarks demonstrate that WDAN consistently improves forecasting accuracy across various backbone model. Code is available at this repository: https://github.com/MonBG/WDAN.
中文: 本文提出WDAN框架,通过小波变换将时间序列分解为趋势和波动分量,并针对各分量应用定制化归一化策略,有效处理非平稳性以提升预测精度。
English: This paper introduces WDAN, a model-agnostic framework that uses wavelet transforms to decompose time series into trends and fluctuations, applying customized normalization to each component to enhance forecasting accuracy by addressing non-stationarity.
Authors:Yihan Xie, Sijing Li, Tianwei Lin, Zhuonan Wang, Chenglin Yang, Yu Zhong, Wenqiao Zhang, Haoyuan Li, Hao Jiang, Fengda Zhang, Qishan Chen, Jun Xiao, Yueting Zhuang, Beng Chin Ooi
Abstract:
We present Heartcare Suite, a multimodal comprehensive framework for finegrained electrocardiogram (ECG) understanding. It comprises three key components: (i) Heartcare-220K, a high-quality, structured, and comprehensive multimodal ECG dataset covering essential tasks such as disease diagnosis, waveform morphology analysis, and rhythm interpretation. (ii) Heartcare-Bench, a systematic and multi-dimensional benchmark designed to evaluate diagnostic intelligence and guide the optimization of Medical Multimodal Large Language Models (Med-MLLMs) in ECG scenarios. and (iii) HeartcareGPT with a tailored tokenizer Bidirectional ECG Abstract Tokenization (Beat), which compresses raw multi-lead signals into semantically rich discrete tokens via duallevel vector quantization and query-guided bidirectional diffusion mechanism. Built upon Heartcare-220K, HeartcareGPT achieves strong generalization and SoTA performance across multiple clinically meaningful tasks. Extensive experiments demonstrate that Heartcare Suite is highly effective in advancing ECGspecific multimodal understanding and evaluation. Our project is available at https://github.com/DCDmllm/Heartcare-Suite .
中文:Heartcare Suite 是一个用于精细心电图分析的多模态框架,包含全面数据集 Heartcare-220K、评估基准 Heartcare-Bench 以及采用定制分词器的 HeartcareGPT,在临床任务中实现了顶尖性能。
English: Heartcare Suite is a multimodal framework for detailed ECG analysis, featuring a comprehensive dataset (Heartcare-220K), an evaluation benchmark (Heartcare-Bench), and HeartcareGPT with specialized tokenization that achieves state-of-the-art performance in clinical tasks.
Authors:Yogesh Verma, Amauri H. Souza, Vikas Garg
Abstract:
The local inductive bias of message-passing graph neural networks (GNNs) hampers their ability to exploit key structural information (e.g., connectivity and cycles). Positional encoding (PE) and Persistent Homology (PH) have emerged as two promising approaches to mitigate this issue. PE schemes endow GNNs with location-aware features, while PH methods enhance GNNs with multiresolution topological features. However, a rigorous theoretical characterization of the relative merits and shortcomings of PE and PH has remained elusive. We bridge this gap by establishing that neither paradigm is more expressive than the other, providing novel constructions where one approach fails but the other succeeds. Our insights inform the design of a novel learnable method, PiPE (Persistence-informed Positional Encoding), which is provably more expressive than both PH and PE. PiPE demonstrates strong performance across a variety of tasks (e.g., molecule property prediction, graph classification, and out-of-distribution generalization), thereby advancing the frontiers of graph representation learning. Code is available at https://github.com/Aalto-QuML/PIPE.
中文摘要:研究表明位置编码和持续同调在图神经网络中表达能力相当,由此开发的PiPE新方法在多项任务中超越了二者的性能。
English Summary: The study demonstrates that positional encoding and persistent homology are equally expressive for graph neural networks, leading to the development of PiPE, a novel method that surpasses both in performance across various tasks.
Authors:Guang-Xing Li
Abstract:
The development of science has been transforming man's view towards nature for centuries. Observing structures and patterns in an effective approach to discover regularities from data is a key step toward theory-building. With increasingly complex data being obtained, revealing regularities systematically has become a challenge. Correlation is a most commonly-used and effective approach to describe regularities in data, yet for complex patterns, spatial inhomogeneity and complexity can often undermine the correlations. We present an algorithm to derive maps representing the type and degree of correlations, by taking the two-fold symmetry of the correlation vector into full account using the Stokes parameter. The method allows for a spatially resolved view of the nature and strength of correlations between physical quantities. In the correlation view, a region can often be separated into different subregions with different types of correlations. Subregions correspond to physical regimes for physical systems, or climate zones for climate maps. The simplicity of the method makes it widely applicable to a variety of data, where the correlation-based approach makes the map particularly useful in revealing regularities in physical systems and alike. As a new and efficient approach to represent data, the method should facilitate the development of new computational approaches to regularity discovery.
中文: 该研究提出一种算法,利用斯托克斯参数生成反映复杂数据中相关类型与强度的图谱,能实现模式的空间解析和子区域划分,有助于在多个领域提升规律发现的效率。
English: The study introduces an algorithm that uses the Stokes parameter to generate maps depicting the type and strength of correlations in complex data, enabling spatial resolution of patterns and subregions for improved regularity discovery in various fields.
Authors:Rongzhe Wei, Peizhi Niu, Hans Hao-Hsun Hsu, Ruihan Wu, Haoteng Yin, Yifan Li, Eli Chien, Kamalika Chaudhuri, Olgica Milenkovic, Pan Li
Abstract:
Machine unlearning techniques aim to mitigate unintended memorization in large language models (LLMs). However, existing approaches predominantly focus on the explicit removal of isolated facts, often overlooking latent inferential dependencies and the non-deterministic nature of knowledge within LLMs. Consequently, facts presumed forgotten may persist implicitly through correlated information. To address these challenges, we propose a knowledge unlearning evaluation framework that more accurately captures the implicit structure of real-world knowledge by representing relevant factual contexts as knowledge graphs with associated confidence scores. We further develop an inference-based evaluation protocol leveraging powerful LLMs as judges; these judges reason over the extracted knowledge subgraph to determine unlearning success. Our LLM judges utilize carefully designed prompts and are calibrated against human evaluations to ensure their trustworthiness and stability. Extensive experiments on our newly constructed benchmark demonstrate that our framework provides a more realistic and rigorous assessment of unlearning performance. Moreover, our findings reveal that current evaluation strategies tend to overestimate unlearning effectiveness. Our code is publicly available at https://github.com/Graph-COM/Knowledge_Unlearning.git.
中文摘要:本文提出了一种知识遗忘评估框架,通过知识图谱和大型语言模型作为评判者,更准确地评估大语言模型中已遗忘事实的隐性存留,发现现有方法高估了遗忘效果。
English Summary: This paper introduces a knowledge unlearning evaluation framework that uses knowledge graphs and LLM judges to more accurately assess the implicit persistence of forgotten facts in large language models, revealing that current methods overestimate unlearning effectiveness.
Authors:Fang Wu, Vijay Prakash Dwivedi, Jure Leskovec
Abstract:
Large language models (LLMs) have demonstrated remarkable capabilities across various domains, yet their application to relational deep learning (RDL) remains underexplored. Existing approaches adapt LLMs by traversing relational links between entities in a database and converting the structured data into flat text documents. Still, this text-based serialization disregards critical relational structures, introduces redundancy, and often exceeds standard LLM context lengths. We introduce Rel-LLM, a novel architecture that utilizes a graph neural network (GNN)- based encoder to generate structured relational prompts for LLMs within a retrieval-augmented generation (RAG) framework. Unlike traditional text-based serialization approaches, our method preserves the inherent relational structure of databases while enabling LLMs to effectively process and reason over complex entity relationships. Specifically, the GNN encoder extracts a local subgraph around an entity to build feature representations that contain relevant entity relationships and temporal dependencies. These representations are transformed into structured prompts using a denormalization process, effectively allowing the LLM to reason over relational structures. Through extensive experiments, we demonstrate that Rel-LLM outperforms existing methods on key RDL tasks, offering a scalable and efficient approach to integrating LLMs with structured data sources. Code is available at https://github.com/smiles724/Rel-LLM.
Chinese: Rel-LLM提出了一种新颖架构,通过基于图神经网络的编码器在检索增强生成框架中生成结构化关系提示,既保留了数据库的内在关系结构,又使大语言模型能有效处理复杂实体关系,在关键关系深度学习任务上超越了现有方法。
English: Rel-LLM introduces a novel architecture that uses a GNN-based encoder to generate structured relational prompts within a RAG framework, preserving database structures and enabling LLMs to effectively reason over complex entity relationships, outperforming existing methods on key RDL tasks.
Authors:Dumindu Tissera, Omar Awadallah, Muhammad Umair Danish, Ayan Sadhu, Katarina Grolinger
Abstract:
Multi-label Classification (MLC) assigns an instance to one or more non-exclusive classes. A challenge arises when the dataset contains a large proportion of instances with no assigned class, referred to as negative data, which can overwhelm the learning process and hinder the accurate identification and classification of positive instances. Nevertheless, it is common in MLC applications such as industrial defect detection, agricultural disease identification, and healthcare diagnosis to encounter large amounts of negative data. Assigning a separate negative class to these instances further complicates the learning objective and introduces unnecessary redundancies. To address this challenge, we redesign standard MLC loss functions by deriving a likelihood of any class being present, formulated by a normalized weighted geometric mean of the predicted class probabilities. We introduce a regularization parameter that controls the relative contribution of the absent class probabilities to the any-class presence likelihood in positive instances. The any-class presence likelihood complements the multi-label learning by encouraging the network to become more aware of implicit positive instances and improve the label classification within those positive instances. Experiments on large-scale datasets with negative data: SewerML, modified COCO, and ChestX-ray14, across various networks and base loss functions show that our loss functions consistently improve MLC performance of their standard loss counterparts, achieving gains of up to 6.01 percentage points in F1, 8.06 in F2, and 3.11 in mean average precision, all without additional parameters or computational complexity. Code available at: https://github.com/ML-for-Sensor-Data-Western/gmean-mlc
中文摘要:本研究通过归一化加权几何均值重构多标签分类损失函数,有效处理负数据,在不增加参数或计算复杂度的情况下提升了分类性能。
English Summary: This study redesigns standard multi-label classification loss functions using a normalized weighted geometric mean to handle negative data, improving performance without extra parameters or computational cost.
Authors:Zhan Zhuang, Xiequn Wang, Wei Li, Yulong Zhang, Qiushi Huang, Shuhao Chen, Xuehao Wang, Yanbin Wei, Yuhe Nie, Kede Ma, Yu Zhang, Ying Wei
Abstract:
Low-rank adaptation (LoRA) has emerged as a leading parameter-efficient fine-tuning technique for adapting large foundation models, yet it often locks adapters into suboptimal minima near their initialization. This hampers model generalization and limits downstream operators such as adapter merging and pruning. Here, we propose CoTo, a progressive training strategy that gradually increases adapters' activation probability over the course of fine-tuning. By stochastically deactivating adapters, CoTo encourages more balanced optimization and broader exploration of the loss landscape. We provide a theoretical analysis showing that CoTo promotes layer-wise dropout stability and linear mode connectivity, and we adopt a cooperative-game approach to quantify each adapter's marginal contribution. Extensive experiments demonstrate that CoTo consistently boosts single-task performance, enhances multi-task merging accuracy, improves pruning robustness, and reduces training overhead, all while remaining compatible with diverse LoRA variants. Code is available at https://github.com/zwebzone/coto.
中文: CoTo是一种渐进式训练策略,通过逐步提高适配器激活概率来增强LoRA,促进更均衡的优化和更广泛的损失空间探索,从而提升模型泛化能力和下游任务表现。
English: CoTo is a progressive training strategy that enhances LoRA by gradually increasing adapter activation probability, promoting balanced optimization and broader loss landscape exploration to improve generalization and downstream task performance.
Authors:Keinichi Fujita, Shota Horiguchi, Yusuke Ijima
Abstract:
Para-/non-linguistic information in speech is pivotal in shaping the listeners' impression. Although zero-shot text-to-speech (TTS) has achieved high speaker fidelity, modulating subtle para-/non-linguistic information to control perceived voice characteristics, i.e., impressions, remains challenging. We have therefore developed a voice impression control method in zero-shot TTS that utilizes a low-dimensional vector to represent the intensities of various voice impression pairs (e.g., dark-bright). The results of both objective and subjective evaluations have demonstrated our method's effectiveness in impression control. Furthermore, generating this vector via a large language model enables target-impression generation from a natural language description of the desired impression, thus eliminating the need for manual optimization. Audio examples are available on our demo page (https://ntt-hilab-gensp.github.io/is2025voiceimpression/).
Authors:Chao Zhang, Li Wang, Samson Lasaulce, Merouane Debbah
Abstract:
Post-training model quantization is a widely adopted technique for reducing the memory and computational costs of large language models (LLMs). However, most existing methods rely on uniform or heuristic bitwidth assignments, failing to account for the nonuniform sensitivity of weights to quantization noise. In this paper, we propose a novel framework for allocating quantization bitwidths based on sensitivity metrics derived from a Hessian proxy. We make key assumptions, which allow the layer/component-wise loss function to be expressed as an explicit function of the bitwidths. This enables a neat formulation of the bit allocation problem as a convex optimization task, whose closed-form solution adapts precision across weights to minimize the layer-wise quantization loss. Inspecting the solution provides several insights (such as the equal-loss structure), which are then exploited to design the proposed \textbf{BAQ} (Bit Allocation Quantization) algorithm. The proposed algorithm achieves a good trade-off between loss minimization and complexity and allows BAQ to be integrated into standard quantization pipelines with minimal overhead. Experimental results show that BAQ consistently outperforms GPTQ, achieving up to 56$\times$ lower perplexity at the same bitwidth on large language models ranging from 125M to 30B parameters. Leveraging our analytical results derived from solving the optimal bit allocation problem, we also provide a theoretical explanation for the observed gains. All codes of this paper are available at https://github.com/CSU-ModelCompression/BAQ.
中文摘要:本文提出BAQ量化框架,通过基于Hessian的敏感度指标和凸优化实现最优位宽分配,在多种规模的大语言模型上相比GPTQ实现高达56倍的困惑度降低。
English Summary: This paper introduces BAQ, a novel quantization framework that optimally allocates bitwidths using Hessian-based sensitivity metrics and convex optimization, significantly outperforming GPTQ with up to 56× lower perplexity across various LLM sizes.
Authors:Junjie Xing, Yeye He, Mengyu Zhou, Haoyu Dong, Shi Han, Lingjiao Chen, Dongmei Zhang, Surajit Chaudhuri, H. V. Jagadish
Abstract:
Tables and table-based use cases play a crucial role in many important real-world applications, such as spreadsheets, databases, and computational notebooks, which traditionally require expert-level users like data engineers, data analysts, and database administrators to operate. Although LLMs have shown remarkable progress in working with tables (e.g., in spreadsheet and database copilot scenarios), comprehensive benchmarking of such capabilities remains limited. In contrast to an extensive and growing list of NLP benchmarks, evaluations of table-related tasks are scarce, and narrowly focus on tasks like NL-to-SQL and Table-QA, overlooking the broader spectrum of real-world tasks that professional users face. This gap limits our understanding and model progress in this important area.
In this work, we introduce MMTU, a large-scale benchmark with over 30K questions across 25 real-world table tasks, designed to comprehensively evaluate models ability to understand, reason, and manipulate real tables at the expert-level. These tasks are drawn from decades' worth of computer science research on tabular data, with a focus on complex table tasks faced by professional users. We show that MMTU require a combination of skills -- including table understanding, reasoning, and coding -- that remain challenging for today's frontier models, where even frontier reasoning models like OpenAI o4-mini and DeepSeek R1 score only around 60%, suggesting significant room for improvement. We highlight key findings in our evaluation using MMTU and hope that this benchmark drives further advances in understanding and developing foundation models for structured data processing and analysis. Our code and data are available at https://github.com/MMTU-Benchmark/MMTU and https://huggingface.co/datasets/MMTU-benchmark/MMTU.
中文:MMTU是一个包含超过3万个问题、涵盖25种专家级表格任务的大规模基准,旨在全面评估模型对真实表格的理解、推理和操作能力,结果显示即使顶级模型得分也仅约60%,表明仍有巨大改进空间。
English: MMTU is a large-scale benchmark with over 30,000 questions across 25 expert-level table tasks, designed to comprehensively evaluate models' ability to understand, reason, and manipulate real tables, revealing significant room for improvement as even top models score only around 60%.
Authors:Vlastimil Martinek, Andrea Gariboldi, Dimosthenis Tzimotoudis, Aitor Alberdi Escudero, Edward Blake, David Cechak, Luke Cassar, Alessandro Balestrucci, Panagiotis Alexiou
Abstract:
The adoption of machine learning (ML) and deep learning methods has revolutionized molecular medicine by driving breakthroughs in genomics, transcriptomics, drug discovery, and biological systems modeling. The increasing quantity, multimodality, and heterogeneity of biological datasets demand automated methods that can produce generalizable predictive models. Recent developments in large language model-based agents have shown promise for automating end-to-end ML experimentation on structured benchmarks. However, when applied to heterogeneous computational biology datasets, these methods struggle with generalization and success rates. Here, we introduce Agentomics-ML, a fully autonomous agent-based system designed to produce a classification model and the necessary files for reproducible training and inference. Our method follows predefined steps of an ML experimentation process, repeatedly interacting with the file system through Bash to complete individual steps. Once an ML model is produced, training and validation metrics provide scalar feedback to a reflection step to identify issues such as overfitting. This step then creates verbal feedback for future iterations, suggesting adjustments to steps such as data representation, model architecture, and hyperparameter choices. We have evaluated Agentomics-ML on several established genomic and transcriptomic benchmark datasets and show that it outperforms existing state-of-the-art agent-based methods in both generalization and success rates. While state-of-the-art models built by domain experts still lead in absolute performance on the majority of the computational biology datasets used in this work, Agentomics-ML narrows the gap for fully autonomous systems and achieves state-of-the-art performance on one of the used benchmark datasets. The code is available at https://github.com/BioGeMT/Agentomics-ML.
中文: Agentomics-ML是一种基于智能体的全自动系统,通过自动化机器学习实验流程在计算生物学数据集上实现了更好的泛化能力和成功率,显著缩小了与专家构建模型之间的性能差距。
English: Agentomics-ML is an autonomous agent-based system that automates end-to-end machine learning experimentation for computational biology datasets, outperforming existing agent methods in generalization and success rates while narrowing the performance gap with expert-built models.
Authors:Boyuan Deng, Luca Rossini, Jin Wang, Weijie Wang, Nikolaos Tsagarakis
Abstract:
Adaptive recovery from fall incidents are essential skills for the practical deployment of wheeled-legged robots, which uniquely combine the agility of legs with the speed of wheels for rapid recovery. However, traditional methods relying on preplanned recovery motions, simplified dynamics or sparse rewards often fail to produce robust recovery policies. This paper presents a learning-based framework integrating Episode-based Dynamic Reward Shaping and curriculum learning, which dynamically balances exploration of diverse recovery maneuvers with precise posture refinement. An asymmetric actor-critic architecture accelerates training by leveraging privileged information in simulation, while noise-injected observations enhance robustness against uncertainties. We further demonstrate that synergistic wheel-leg coordination reduces joint torque consumption by 15.8% and 26.2% and improves stabilization through energy transfer mechanisms. Extensive evaluations on two distinct quadruped platforms achieve recovery success rates up to 99.1% and 97.8% without platform-specific tuning. The supplementary material is available at https://boyuandeng.github.io/L2R-WheelLegCoordination/
Authors:Andrei Mircea, Supriyo Chakraborty, Nima Chitsazan, Milind Naphade, Sambit Sahu, Irina Rish, Ekaterina Lobacheva
Abstract:
This work aims to understand how scaling improves language models, specifically in terms of training dynamics. We find that language models undergo loss deceleration early in training; an abrupt slowdown in the rate of loss improvement, resulting in piecewise linear behaviour of the loss curve in log-log space. Scaling up the model mitigates this transition by (1) decreasing the loss at which deceleration occurs, and (2) improving the log-log rate of loss improvement after deceleration. We attribute loss deceleration to a type of degenerate training dynamics we term zero-sum learning (ZSL). In ZSL, per-example gradients become systematically opposed, leading to destructive interference in per-example changes in loss. As a result, improving loss on one subset of examples degrades it on another, bottlenecking overall progress. Loss deceleration and ZSL provide new insights into the training dynamics underlying language model scaling laws, and could potentially be targeted directly to improve language models independent of scale. We make our code and artefacts available at: https://github.com/mirandrom/zsl
中文摘要:本研究发现语言模型扩展通过降低损失减速发生的临界点并改善减速后的学习效率,来缓解早期训练中的损失减速现象,该现象源于零和学习机制中的梯度相互抵消问题。
English Summary: This study reveals that language model scaling mitigates early training loss deceleration by reducing its onset loss and improving post-deceleration learning rates, attributing this phenomenon to zero-sum learning dynamics where gradient interference limits progress.
Authors:Kimberley M. Bird, Xujiong Ye, Alan M. Race, James M. Brown
Abstract:
Registration of histological and mass spectrometry imaging (MSI) allows for more precise identification of structural changes and chemical interactions in tissue. With histology and MSI having entirely different image formation processes and dimensionalities, registration of the two modalities remains an ongoing challenge. This work proposes a solution that synthesises histological images from MSI, using a pix2pix model, to effectively enable unimodal registration. Preliminary results show promising synthetic histology images with limited artifacts, achieving increases in mutual information (MI) and structural similarity index measures (SSIM) of +0.924 and +0.419, respectively, compared to a baseline U-Net model. Our source code is available on GitHub: https://github.com/kimberley/MIUA2025.
中文: 本研究提出了一种基于pix2pix的方法,可从质谱成像数据合成组织学图像,实现有效的单模态配准,并在互信息和结构相似性指标上展现出优于基准模型的性能提升。
English: This study introduces a pix2pix-based method to synthesize histological images from mass spectrometry imaging, enabling effective unimodal registration and demonstrating improved mutual information and structural similarity over baseline models.
Authors:Zikang Liu, Tongtian Yue, Yepeng Tang, Longteng Guo, Junxian Cai, Qingbin Liu, Xi Chen, Jing Liu
Abstract:
Group Relative Policy Optimization (GRPO) enhances policy learning by computing gradients from relative comparisons among candidate outputs that share a common input prefix. Despite its effectiveness, GRPO introduces substantial computational overhead when processing long shared prefixes, which must be redundantly encoded for each group member. This inefficiency becomes a major scalability bottleneck in long-context learning scenarios. We propose Prefix Grouper, an efficient GRPO training algorithm that eliminates redundant prefix computation via a Shared-Prefix Forward strategy. In particular, by restructuring self-attention into two parts, our method enables the shared prefix to be encoded only once, while preserving full differentiability and compatibility with end-to-end training. We provide both theoretical and empirical evidence that Prefix Grouper is training-equivalent to standard GRPO: it yields identical forward outputs and backward gradients, ensuring that the optimization dynamics and final policy performance remain unchanged. Empirically, our experiments confirm that Prefix Grouper achieves consistent results while significantly reducing the computational cost of training, particularly in long-prefix scenarios. The proposed method is fully plug-and-play: it is compatible with existing GRPO-based architectures and can be seamlessly integrated into current training pipelines as a drop-in replacement, requiring no structural modifications and only minimal changes to input construction and attention computation. Prefix Grouper enables the use of larger group sizes under the same computational budget, thereby improving the scalability of GRPO to more complex tasks and larger models. Code is now available at https://github.com/johncaged/PrefixGrouper
中文摘要:Prefix Grouper是一种高效训练算法,通过重构自注意力机制消除GRPO中的冗余前缀计算,在保持性能不变的同时显著降低计算成本。
English Summary: Prefix Grouper is an efficient training algorithm that eliminates redundant prefix computations in Group Relative Policy Optimization (GRPO) by restructuring self-attention, maintaining identical performance while significantly reducing computational costs.
Authors:Zishan Shu, Yufan Deng, Hongyu Zhang, Zhiwei Nie, Jie Chen
Abstract:
Activity cliff prediction is a critical task in drug discovery and material design. Existing computational methods are limited to handling single binding targets, which restricts the applicability of these prediction models. In this paper, we present the Multi-Grained Target Perception network (MTPNet) to incorporate the prior knowledge of interactions between the molecules and their target proteins. Specifically, MTPNet is a unified framework for activity cliff prediction, which consists of two components: Macro-level Target Semantic (MTS) guidance and Micro-level Pocket Semantic (MPS) guidance. By this way, MTPNet dynamically optimizes molecular representations through multi-grained protein semantic conditions. To our knowledge, it is the first time to employ the receptor proteins as guiding information to effectively capture critical interaction details. Extensive experiments on 30 representative activity cliff datasets demonstrate that MTPNet significantly outperforms previous approaches, achieving an average RMSE improvement of 18.95% on top of several mainstream GNN architectures. Overall, MTPNet internalizes interaction patterns through conditional deep learning to achieve unified predictions of activity cliffs, helping to accelerate compound optimization and design. Codes are available at: https://github.com/ZishanShu/MTPNet.
Chinese: MTPNet通过多粒度蛋白质语义引导动态优化分子表征,提出了统一的活性悬崖预测框架,在30个代表性数据集上显著超越现有方法,平均预测精度提升18.95%。
English: MTPNet introduces a unified framework for activity cliff prediction by dynamically optimizing molecular representations through multi-grained protein semantic guidance, significantly outperforming existing methods and improving prediction accuracy by 18.95% on average across 30 datasets.
Authors:Wenhao Wu, Fuhong Liu, Haoru Li, Zican Hu, Daoyi Dong, Chunlin Chen, Zhi Wang
Abstract:
In-context reinforcement learning (ICRL) has emerged as a promising paradigm for adapting RL agents to downstream tasks through prompt conditioning. However, two notable challenges remain in fully harnessing in-context learning within RL domains: the intrinsic multi-modality of the state-action-reward data and the diverse, heterogeneous nature of decision tasks. To tackle these challenges, we propose \textbf{T2MIR} (\textbf{T}oken- and \textbf{T}ask-wise \textbf{M}oE for \textbf{I}n-context \textbf{R}L), an innovative framework that introduces architectural advances of mixture-of-experts (MoE) into transformer-based decision models. T2MIR substitutes the feedforward layer with two parallel layers: a token-wise MoE that captures distinct semantics of input tokens across multiple modalities, and a task-wise MoE that routes diverse tasks to specialized experts for managing a broad task distribution with alleviated gradient conflicts. To enhance task-wise routing, we introduce a contrastive learning method that maximizes the mutual information between the task and its router representation, enabling more precise capture of task-relevant information. The outputs of two MoE components are concatenated and fed into the next layer. Comprehensive experiments show that T2MIR significantly facilitates in-context learning capacity and outperforms various types of baselines. We bring the potential and promise of MoE to ICRL, offering a simple and scalable architectural enhancement to advance ICRL one step closer toward achievements in language and vision communities. Our code is available at https://github.com/NJU-RL/T2MIR.
中文:T2MIR框架通过引入标记级和任务级混合专家机制到基于Transformer的决策模型中,有效解决了情境强化学习中的多模态和任务异质性挑战,并在实验中展现出优于各类基准方法的性能。
English: The T2MIR framework introduces token- and task-wise mixture-of-experts to transformer-based decision models, effectively addressing multi-modality and task heterogeneity in in-context reinforcement learning while demonstrating superior performance over existing methods.
Authors:Kyungsoo Kim, Jeongsoo Ha, Yusung Kim
Abstract:
Vision-based reinforcement learning requires efficient and robust representations of image-based observations, especially when the images contain distracting (task-irrelevant) elements such as shadows, clouds, and light. It becomes more important if those distractions are not exposed during training. We design a Self-Predictive Dynamics (SPD) method to extract task-relevant features efficiently, even in unseen observations after training. SPD uses weak and strong augmentations in parallel, and learns representations by predicting inverse and forward transitions across the two-way augmented versions. In a set of MuJoCo visual control tasks and an autonomous driving task (CARLA), SPD outperforms previous studies in complex observations, and significantly improves the generalization performance for unseen observations. Our code is available at https://github.com/unigary/SPD.
中文:自预测动态(SPD)方法通过双向增强预测状态转换,能有效提取含干扰图像中的任务相关特征,在视觉控制任务中表现优异,并显著提升对未见观察的泛化能力。
English: The Self-Predictive Dynamics (SPD) method effectively extracts task-relevant features from images with distractions by using dual augmentations to predict transitions, demonstrating superior performance in visual control tasks and enhanced generalization for unseen observations.
Authors:Mingfei Chen, Zijun Cui, Xiulong Liu, Jinlin Xiang, Caleb Zheng, Jingyuan Li, Eli Shlizerman
Abstract:
3D spatial reasoning in dynamic, audio-visual environments is a cornerstone of human cognition yet remains largely unexplored by existing Audio-Visual Large Language Models (AV-LLMs) and benchmarks, which predominantly focus on static or 2D scenes. We introduce SAVVY-Bench, the first benchmark for 3D spatial reasoning in dynamic scenes with synchronized spatial audio. SAVVY-Bench is comprised of thousands of relationships involving static and moving objects, and requires fine-grained temporal grounding, consistent 3D localization, and multi-modal annotation. To tackle this challenge, we propose SAVVY, a novel training-free reasoning pipeline that consists of two stages: (i) Egocentric Spatial Tracks Estimation, which leverages AV-LLMs as well as other audio-visual methods to track the trajectories of key objects related to the query using both visual and spatial audio cues, and (ii) Dynamic Global Map Construction, which aggregates multi-modal queried object trajectories and converts them into a unified global dynamic map. Using the constructed map, a final QA answer is obtained through a coordinate transformation that aligns the global map with the queried viewpoint. Empirical evaluation demonstrates that SAVVY substantially enhances performance of state-of-the-art AV-LLMs, setting a new standard and stage for approaching dynamic 3D spatial reasoning in AV-LLMs.
Authors:Patrik Czakó, Gábor Kertész, Sándor Szénási
Abstract:
We present SmoothRot, a novel post-training quantization technique to enhance the efficiency of 4-bit quantization in Large Language Models (LLMs). SmoothRot addresses the critical challenge of massive activation outliers, by integrating channel-wise scaling with Hadamard transformations. Our technique effectively transforms extreme outliers into quantization-friendly activations, significantly improving quantization accuracy. Experiments conducted on popular LLMs (LLaMA2 7B, LLaMA3.1 8B, and Mistral 7B) demonstrate that SmoothRot consistently reduces the performance gap between quantized and FP16 models by approximately 10-30\% across language generation and zero-shot reasoning tasks, without introducing additional inference latency. Code is available at https://github.com/czakop/smoothrot.
中文: SmoothRot是一种新颖的训练后量化技术,通过将激活异常值转化为量化友好形式,显著提升大型语言模型的4位量化效率,在不增加延迟的情况下将性能差距缩小10-30%。
English: SmoothRot is a novel post-training quantization technique that enhances 4-bit quantization efficiency in Large Language Models by transforming activation outliers into quantization-friendly forms, significantly reducing performance gaps by 10-30% without added latency.
Authors:Hondamunige Prasanna Silva, Federico Becattini, Lorenzo Seidenari
Abstract:
Foundation models represent the most prominent and recent paradigm shift in artificial intelligence. Foundation models are large models, trained on broad data that deliver high accuracy in many downstream tasks, often without fine-tuning. For this reason, models such as CLIP , DINO or Vision Transfomers (ViT), are becoming the bedrock of many industrial AI-powered applications. However, the reliance on pre-trained foundation models also introduces significant security concerns, as these models are vulnerable to adversarial attacks. Such attacks involve deliberately crafted inputs designed to deceive AI systems, jeopardizing their reliability. This paper studies the vulnerabilities of vision foundation models, focusing specifically on CLIP and ViTs, and explores the transferability of adversarial attacks to downstream tasks. We introduce a novel attack, targeting the structure of transformer-based architectures in a task-agnostic fashion. We demonstrate the effectiveness of our attack on several downstream tasks: classification, captioning, image/text retrieval, segmentation and depth estimation. Code available at:https://github.com/HondamunigePrasannaSilva/attack-attention
基础模型通过为多种任务提供通用且高精度的解决方案正在革新人工智能领域,然而它们对对抗性攻击的脆弱性带来了严重的安全隐患,本研究聚焦CLIP和视觉变换器,通过一种新型的针对变换器结构的攻击验证了这一点。
Foundation models are revolutionizing AI by providing versatile, high-accuracy solutions across various tasks, yet their vulnerability to adversarial attacks poses serious security risks, as demonstrated in this study focusing on CLIP and ViTs with a novel transformer-targeted attack.
Authors:Shenyang Huang, Ali Parviz, Emma Kondrup, Zachary Yang, Zifeng Ding, Michael Bronstein, Reihaneh Rabbany, Guillaume Rabusseau
Abstract:
Large Language Models (LLMs) have recently driven significant advancements in Natural Language Processing and various other applications. While a broad range of literature has explored the graph-reasoning capabilities of LLMs, including their use of predictors on graphs, the application of LLMs to dynamic graphs -- real world evolving networks -- remains relatively unexplored. Recent work studies synthetic temporal graphs generated by random graph models, but applying LLMs to real-world temporal graphs remains an open question. To address this gap, we introduce Temporal Graph Talker (TGTalker), a novel temporal graph learning framework designed for LLMs. TGTalker utilizes the recency bias in temporal graphs to extract relevant structural information, converted to natural language for LLMs, while leveraging temporal neighbors as additional information for prediction. TGTalker demonstrates competitive link prediction capabilities compared to existing Temporal Graph Neural Network (TGNN) models. Across five real-world networks, TGTalker performs competitively with state-of-the-art temporal graph methods while consistently outperforming popular models such as TGN and HTGN. Furthermore, TGTalker generates textual explanations for each prediction, thus opening up exciting new directions in explainability and interpretability for temporal link prediction. The code is publicly available at https://github.com/shenyangHuang/TGTalker.
中文: 本文提出的TGTalker创新框架使大语言模型能在真实时序图上实现具有竞争力的链接预测,并通过生成文本解释显著提升了模型的可解释性。
English: This paper introduces TGTalker, a novel framework that enables Large Language Models to perform competitive link prediction on real-world temporal graphs while generating textual explanations for enhanced interpretability.
Authors:Mihran Miroyan, Tsung-Han Wu, Logan King, Tianle Li, Jiayi Pan, Xinyan Hu, Wei-Lin Chiang, Anastasios N. Angelopoulos, Trevor Darrell, Narges Norouzi, Joseph E. Gonzalez
Abstract:
Search-augmented language models combine web search with Large Language Models (LLMs) to improve response groundedness and freshness. However, analyzing these systems remains challenging: existing datasets are limited in scale and narrow in scope, often constrained to static, single-turn, fact-checking questions. In this work, we introduce Search Arena, a crowd-sourced, large-scale, human-preference dataset of over 24,000 paired multi-turn user interactions with search-augmented LLMs. The dataset spans diverse intents and languages, and contains full system traces with around 12,000 human preference votes. Our analysis reveals that user preferences are influenced by the number of citations, even when the cited content does not directly support the attributed claims, uncovering a gap between perceived and actual credibility. Furthermore, user preferences vary across cited sources, revealing that community-driven platforms are generally preferred and static encyclopedic sources are not always appropriate and reliable. To assess performance across different settings, we conduct cross-arena analyses by testing search-augmented LLMs in a general-purpose chat environment and conventional LLMs in search-intensive settings. We find that web search does not degrade and may even improve performance in non-search settings; however, the quality in search settings is significantly affected if solely relying on the model's parametric knowledge. We open-sourced the dataset to support future research in this direction. Our dataset and code are available at: https://github.com/lmarena/search-arena.
中文: 本研究推出了Search Arena这一大规模人类偏好数据集,用于评估搜索增强语言模型,发现用户偏好受引用数量和来源类型影响,交叉分析表明网络搜索能提升非搜索场景性能,但仅依赖参数知识会显著降低搜索密集型任务的质量。
English: This study introduces Search Arena, a large-scale human-preference dataset for evaluating search-augmented language models, revealing that user preferences are influenced by citation quantity and source type, while cross-analysis shows web search enhances performance in non-search settings but reliance on parametric knowledge alone degrades search-intensive results.
Authors:Ranajoy Sadhukhan, Zhuoming Chen, Haizhong Zheng, Yang Zhou, Emma Strubell, Beidi Chen
Abstract:
We rethink test-time scaling laws from a practical efficiency perspective, revealing that the effectiveness of smaller models is significantly overestimated. Prior work, grounded in compute-optimality, overlooks critical memory access bottlenecks introduced by inference-time strategies (e.g., Best-of-$N$, long CoTs). Our holistic analysis, spanning models from 0.6B to 32B parameters, reveals a new Kinetics Scaling Law that better guides resource allocation by incorporating both computation and memory access costs. Kinetics Scaling Law suggests that test-time compute is more effective when used on models above a threshold than smaller ones. A key reason is that in TTS, attention, rather than parameter count, emerges as the dominant cost factor. Motivated by this, we propose a new scaling paradigm centered on sparse attention, which lowers per-token cost and enables longer generations and more parallel samples within the same resource budget. Empirically, we show that sparse attention models consistently outperform dense counterparts, achieving over 60 points gains in low-cost regimes and over 5 points gains in high-cost regimes for problem-solving accuracy on AIME, encompassing evaluations on state-of-the-art MoEs. These results suggest that sparse attention is essential and increasingly important with more computing invested, for realizing the full potential of test-time scaling where, unlike training, accuracy has yet to saturate as a function of computation, and continues to improve through increased generation. The code is available at https://github.com/Infini-AI-Lab/Kinetics.
中文摘要:本研究提出动力学缩放定律,揭示因忽略内存瓶颈而高估了小模型效能,并通过稀疏注意力机制降低单令牌成本、支持生成长文本,从而优化资源分配。
English Summary: This study introduces the Kinetics Scaling Law, which demonstrates that smaller models' effectiveness is overestimated due to overlooked memory bottlenecks, and proposes sparse attention to optimize resource allocation by reducing per-token costs and enabling longer generations.
Authors:Arnav Kumar Jain, Vibhakar Mohta, Subin Kim, Atiksh Bhardwaj, Juntao Ren, Yunhai Feng, Sanjiban Choudhury, Gokul Swamy
Abstract:
The fundamental limitation of the behavioral cloning (BC) approach to imitation learning is that it only teaches an agent what the expert did at states the expert visited. This means that when a BC agent makes a mistake which takes them out of the support of the demonstrations, they often don't know how to recover from it. In this sense, BC is akin to giving the agent the fish -- giving them dense supervision across a narrow set of states -- rather than teaching them to fish: to be able to reason independently about achieving the expert's outcome even when faced with unseen situations at test-time. In response, we explore learning to search (L2S) from expert demonstrations, i.e. learning the components required to, at test time, plan to match expert outcomes, even after making a mistake. These include (1) a world model and (2) a reward model. We carefully ablate the set of algorithmic and design decisions required to combine these and other components for stable and sample/interaction-efficient learning of recovery behavior without additional human corrections. Across a dozen visual manipulation tasks from three benchmarks, our approach $\texttt{SAILOR}$ consistently out-performs state-of-the-art Diffusion Policies trained via BC on the same data. Furthermore, scaling up the amount of demonstrations used for BC by 5-10$\times$ still leaves a performance gap. We find that $\texttt{SAILOR}$ can identify nuanced failures and is robust to reward hacking. Our code is available at https://github.com/arnavkj1995/SAILOR .
行为克隆(BC)仅能模仿专家在已演示状态下的行为,缺乏错误恢复能力,而我们提出的L2S方法通过习得世界模型和奖励模型,使智能体具备独立规划与纠错能力,在多项基准测试中显著优于BC方法。
Behavioral cloning (BC) is limited to imitating expert actions only within demonstrated states, lacking recovery strategies for errors, whereas our proposed L2S approach enables agents to plan and recover independently using learned world and reward models, significantly outperforming BC methods across multiple benchmarks.
Authors:Moritz Miller, Bernhard Schölkopf, Siyuan Guo
Abstract:
Large-scale neural language models (LMs) exhibit remarkable performance in in-context learning: the ability to learn and reason the input context on the fly without parameter update. This work studies in-context counterfactual reasoning in language models, that is, to predict the consequences of changes under hypothetical scenarios. We focus on studying a well-defined synthetic setup: a linear regression task that requires noise abduction, where accurate prediction is based on inferring and copying the contextual noise from factual observations. We show that language models are capable of counterfactual reasoning in this controlled setup and provide insights that counterfactual reasoning for a broad class of functions can be reduced to a transformation on in-context observations; we find self-attention, model depth, and data diversity in pre-training drive performance in Transformers. More interestingly, our findings extend beyond regression tasks and show that Transformers can perform noise abduction on sequential data, providing preliminary evidence on the potential for counterfactual story generation. Our code is available under https://github.com/moXmiller/counterfactual-reasoning.git .
中文: 研究表明大规模语言模型能够通过噪声溯因在线性回归任务中进行上下文反事实推理,其性能受自注意力机制、模型深度和预训练数据多样性影响,并显示出在反事实故事生成等领域的应用潜力。
English: This research demonstrates that large-scale language models can perform in-context counterfactual reasoning through noise abduction in linear regression tasks, with performance driven by self-attention mechanisms, model depth, and pre-training data diversity, showing potential for applications like counterfactual story generation.
Authors:Zhicheng Yang, Zhijiang Guo, Yinya Huang, Xiaodan Liang, Yiwei Wang, Jing Tang
Abstract:
Large Language Models (LLMs) have shown remarkable reasoning capabilities through Reinforcement Learning with Verifiable Rewards (RLVR) methods. However, a key limitation of existing approaches is that rewards defined at the full trajectory level provide insufficient guidance for optimizing the intermediate steps of a reasoning process. To address this, we introduce \textbf{\name}, a novel method that estimates the mathematical expectations of rewards at various reasoning steps using tree sampling. Unlike prior methods that rely on a separate step reward model, \name directly estimates these rewards through this sampling process. Building on the group-relative reward training mechanism of GRPO, \name innovatively computes rewards based on step-level groups generated during tree sampling. This advancement allows \name to produce fine-grained and dense reward signals, significantly enhancing the learning process and overall performance of LLMs. Experimental results demonstrate that our \name algorithm substantially improves the average Pass@1 accuracy of Qwen-2.5-Math on test benchmarks, increasing it from 19.0\% to 35.5\%. Furthermore, \name significantly outperforms GRPO by 2.9\% in performance while simultaneously reducing the average response length by 18.1\%, showcasing its effectiveness and efficiency. Our code will be available at \href{https://github.com/yangzhch6/TreeRPO}{https://github.com/yangzhch6/TreeRPO}.
中文: 提出的TreeRPO方法通过树采样估计步骤级奖励来增强大语言模型的推理能力,在准确性和效率上相比现有方法均有显著提升。
English: The proposed TreeRPO method enhances LLM reasoning by estimating step-level rewards through tree sampling, significantly improving accuracy and efficiency over existing approaches.
Authors:Viet Nguyen, Changjian Shui, Vijay Giri, Siddarth Arya, Amol Verma, Fahad Razak, Rahul G. Krishnan
Abstract:
The distribution of data changes over time; models operating operating in dynamic environments need retraining. But knowing when to retrain, without access to labels, is an open challenge since some, but not all shifts degrade model performance. This paper formalizes and addresses the problem of post-deployment deterioration (PDD) monitoring. We propose D3M, a practical and efficient monitoring algorithm based on the disagreement of predictive models, achieving low false positive rates under non-deteriorating shifts and provides sample complexity bounds for high true positive rates under deteriorating shifts. Empirical results on both standard benchmark and a real-world large-scale internal medicine dataset demonstrate the effectiveness of the framework and highlight its viability as an alert mechanism for high-stakes machine learning pipelines.
Chinese: 本文提出D3M监控算法,通过预测模型间的分歧来检测动态环境中机器学习模型的性能退化,在非退化偏移下保持低误报率,在退化偏移下实现高检测率,经实验验证可作为高风险机器学习流程的有效预警机制。
English: This paper introduces D3M, a monitoring algorithm that detects when machine learning models need retraining due to performance degradation in dynamic environments, effectively reducing false alarms while ensuring high detection rates for actual deterioration.
Authors:Nicolas Lell, Ansgar Scherp
Abstract:
Shallow node embeddings like node2vec (N2V) can be used for nodes without features or to supplement existing features with structure-based information. Embedding methods like N2V are limited in their application on new nodes, which restricts them to the transductive setting where the entire graph, including the test nodes, is available during training. We propose inductive node2vec (iN2V), which combines a post-hoc procedure to compute embeddings for nodes unseen during training and modifications to the original N2V training procedure to prepare the embeddings for this post-hoc procedure. We conduct experiments on several benchmark datasets and demonstrate that iN2V is an effective approach to bringing transductive embeddings to an inductive setting. Using iN2V embeddings improves node classification by 1 point on average, with up to 6 points of improvement depending on the dataset and the number of unseen nodes. Our iN2V is a plug-in approach to create new or enrich existing embeddings. It can also be combined with other embedding methods, making it a versatile approach for inductive node representation learning. Code to reproduce the results is available at https://github.com/Foisunt/iN2V .
Chinese: 提出的归纳式node2vec(iN2V)方法通过结合改进的训练和后处理程序,将直推式嵌入扩展到归纳式场景,使节点分类提升最高达6个百分点,并可作为现有嵌入技术的通用插件使用。
English: The proposed inductive node2vec (iN2V) method extends transductive embeddings to an inductive setting by combining modified training with a post-hoc procedure, improving node classification by up to 6 points and serving as a versatile plug-in for existing embedding techniques.
Authors:Hyeongwon Jang, Changhun Kim, Eunho Yang
Abstract:
Recent explainable artificial intelligence (XAI) methods for time series primarily estimate point-wise attribution magnitudes, while overlooking the directional impact on predictions, leading to suboptimal identification of significant points. Our analysis shows that conventional Integrated Gradients (IG) effectively capture critical points with both positive and negative impacts on predictions. However, current evaluation metrics fail to assess this capability, as they inadvertently cancel out opposing feature contributions. To address this limitation, we propose novel evaluation metrics-Cumulative Prediction Difference (CPD) and Cumulative Prediction Preservation (CPP)-to systematically assess whether attribution methods accurately identify significant positive and negative points in time series XAI. Under these metrics, conventional IG outperforms recent counterparts. However, directly applying IG to time series data may lead to suboptimal outcomes, as generated paths ignore temporal relationships and introduce out-of-distribution samples. To overcome these challenges, we introduce TIMING, which enhances IG by incorporating temporal awareness while maintaining its theoretical properties. Extensive experiments on synthetic and real-world time series benchmarks demonstrate that TIMING outperforms existing time series XAI baselines. Our code is available at https://github.com/drumpt/TIMING.
中文:近期时间序列可解释人工智能方法常忽略对预测的方向性影响,但传统积分梯度能有效识别关键点;然而现有评估指标存在缺陷,因此我们提出新指标证明其优越性,并引入具有时间感知能力的TIMING方法,在实验中超越现有基准。
English: Recent time series explainable AI methods often miss capturing directional impacts on predictions, but conventional Integrated Gradients (IG) effectively identify critical points; however, their evaluation is flawed, so we propose new metrics showing IG's superiority and introduce TIMING, a temporally-aware IG enhancement that outperforms existing methods.
Authors:Zeming Wei, Yiwen Guo, Yisen Wang
Abstract:
Adversarial training (AT) has been considered one of the most effective methods for making deep neural networks robust against adversarial attacks, while the training mechanisms and dynamics of AT remain open research problems. In this paper, we present a novel perspective on studying AT through the lens of class-wise feature attribution. Specifically, we identify the impact of a key family of features on AT that are shared by multiple classes, which we call cross-class features. These features are typically useful for robust classification, which we offer theoretical evidence to illustrate through a synthetic data model. Through systematic studies across multiple model architectures and settings, we find that during the initial stage of AT, the model tends to learn more cross-class features until the best robustness checkpoint. As AT further squeezes the training robust loss and causes robust overfitting, the model tends to make decisions based on more class-specific features. Based on these discoveries, we further provide a unified view of two existing properties of AT, including the advantage of soft-label training and robust overfitting. Overall, these insights refine the current understanding of AT mechanisms and provide new perspectives on studying them. Our code is available at https://github.com/PKU-ML/Cross-Class-Features-AT.
中文: 本文通过分析跨类别特征提出了对抗训练的新视角,揭示了这类特征在实现鲁棒性中的关键作用,并通过类别特征归因解释了鲁棒过拟合等现象。
English: This paper introduces a novel perspective on adversarial training by analyzing cross-class features, revealing their crucial role in achieving robustness and explaining phenomena like robust overfitting through class-wise feature attribution.
Authors:Kuang He, Wei Tang, Tong Wei, Min-Ling Zhang
Abstract:
Partial label learning (PLL) seeks to train generalizable classifiers from datasets with inexact supervision, a common challenge in real-world applications. Existing studies have developed numerous approaches to progressively refine and recover ground-truth labels by training convolutional neural networks. However, limited attention has been given to foundation models that offer transferrable representations. In this work, we empirically conduct comprehensive evaluations of 11 foundation models across 13 PLL approaches on 8 benchmark datasets under 3 PLL scenarios. We further propose PartialCLIP, an efficient fine-tuning framework for foundation models in PLL. Our findings reveal that current PLL approaches tend to 1) achieve significant performance gains when using foundation models, 2) exhibit remarkably similar performance to each other, 3) maintain stable performance across varying ambiguity levels, while 4) are susceptible to foundation model selection and adaptation strategies. Additionally, we demonstrate the efficacy of text-embedding classifier initialization and effective candidate label filtering using zero-shot CLIP. Our experimental results and analysis underscore the limitations of current PLL approaches and provide valuable insights for developing more generalizable PLL models. The source code can be found at https://github.com/SEU-hk/PartialCLIP.
中文: 本研究评估了基础模型在偏标签学习中的应用,揭示了其显著的性能优势和局限性,并提出了PartialCLIP这一高效微调框架,以提升模型适应性和标签筛选能力。
English: This study evaluates foundation models in partial label learning, revealing their significant performance benefits and limitations, and introduces PartialCLIP, an efficient fine-tuning framework that enhances model adaptability and label filtering.
Authors:Jônata Tyska Carvalho, Stefano Nolfi
Abstract:
We propose a method that enables large language models (LLMs) to control embodied agents by directly mapping continuous observation vectors to continuous action vectors. At the outset, the LLMs generate a control strategy based on a textual description of the agent, its environment, and the intended goal. This strategy is then iteratively refined through a learning process in which the LLMs are repeatedly prompted to improve the current strategy, using performance feedback and sensory-motor data collected during its evaluation. The method is validated on classic control tasks from the Gymnasium library and the inverted pendulum task from the MuJoCo library. The approach proves effective with relatively compact models such as Gpt-oss:120b and Qwen2.5:72b. In most cases, it successfully identifies optimal or near-optimal solutions by integrating symbolic knowledge derived through reasoning with sub-symbolic sensory-motor data gathered as the agent interacts with its environment.
中文: 该方法通过迭代学习生成并优化控制策略,使大型语言模型能够操控具身智能体,并在Gymnasium和MuJoCo任务中通过GPT-oss:120b等模型验证了其有效性。
English: This method enables large language models to control embodied agents by generating and refining control strategies through iterative learning, validated on Gymnasium and MuJoCo tasks using models like GPT-oss:120b and Qwen2.5:72b.
Authors:Svetlana Pavlitska, Jamie Robb, Nikolai Polley, Melih Yazgan, J. Marius Zöllner
Abstract:
Realistic adversarial attacks on various camera-based perception tasks of autonomous vehicles have been successfully demonstrated so far. However, only a few works considered attacks on traffic light detectors. This work shows how CNNs for traffic light detection can be attacked with printed patches. We propose a threat model, where each instance of a traffic light is attacked with a patch placed under it, and describe a training strategy. We demonstrate successful adversarial patch attacks in universal settings. Our experiments show realistic targeted red-to-green label-flipping attacks and attacks on pictogram classification. Finally, we perform a real-world evaluation with printed patches and demonstrate attacks in the lab settings with a mobile traffic light for construction sites and in a test area with stationary traffic lights. Our code is available at https://github.com/KASTEL-MobilityLab/attacks-on-traffic-light-detection.
中文摘要:本研究展示了使用印刷补丁对交通信号灯检测系统实施现实对抗攻击的方法,在模拟和真实环境中均成功实现了针对性标签翻转和分类攻击。
English Summary: This study demonstrates realistic adversarial attacks on traffic light detection systems using printed patches, successfully achieving targeted label-flipping and classification attacks in both simulated and real-world environments.
Authors:Yusuke Matsui
Abstract:
Approximate nearest neighbor search (ANNS) is an essential building block for applications like RAG but can sometimes yield results that are overly similar to each other. In certain scenarios, search results should be similar to the query and yet diverse. We propose LotusFilter, a post-processing module to diversify ANNS results. We precompute a cutoff table summarizing vectors that are close to each other. During the filtering, LotusFilter greedily looks up the table to delete redundant vectors from the candidates. We demonstrated that the LotusFilter operates fast (0.02 [ms/query]) in settings resembling real-world RAG applications, utilizing features such as OpenAI embeddings. Our code is publicly available at https://github.com/matsui528/lotf.
Chinese: LotusFilter是一个后处理模块,通过预计算邻近向量表并贪婪地剔除候选结果中的冗余向量,有效提升近似最近邻搜索结果的多样性,在真实RAG应用中展现出高速处理性能。
English: LotusFilter is a post-processing module that efficiently diversifies approximate nearest neighbor search results by greedily removing redundant vectors using a precomputed cutoff table, operating at high speed in real-world RAG applications.
Authors:Jiachen Tang, Zhonghao Wang, Sirui Chen, Sheng Zhou, Jiawei Chen, Jiajun Bu
Abstract:
Graph Transformers (GTs) have recently demonstrated remarkable performance across diverse domains. By leveraging attention mechanisms, GTs are capable of modeling long-range dependencies and complex structural relationships beyond local neighborhoods. However, their applicable scenarios are still underexplored, this highlights the need to identify when and why they excel. Furthermore, unlike GNNs, which predominantly rely on message-passing mechanisms, GTs exhibit a diverse design space in areas such as positional encoding, attention mechanisms, and graph-specific adaptations. Yet, it remains unclear which of these design choices are truly effective and under what conditions. As a result, the community currently lacks a comprehensive benchmark and library to promote a deeper understanding and further development of GTs. To address this gap, this paper introduces OpenGT, a comprehensive benchmark for Graph Transformers. OpenGT enables fair comparisons and multidimensional analysis by establishing standardized experimental settings and incorporating a broad selection of state-of-the-art GNNs and GTs. Our benchmark evaluates GTs from multiple perspectives, encompassing diverse tasks and datasets with varying properties. Through extensive experiments, our benchmark has uncovered several critical insights, including the difficulty of transferring models across task levels, the limitations of local attention, the efficiency trade-offs in several models, the application scenarios of specific positional encodings, and the preprocessing overhead of some positional encodings. We aspire for this work to establish a foundation for future graph transformer research emphasizing fairness, reproducibility, and generalizability. We have developed an easy-to-use library OpenGT for training and evaluating existing GTs. The benchmark code is available at https://github.com/eaglelab-zju/OpenGT.
图变换器在建模复杂依赖关系方面表现出色,但缺乏全面的基准测试,因此推出OpenGT以进行公平比较和多维分析,推动该领域研究发展。
Graph Transformers have shown strong performance in modeling complex dependencies but lack comprehensive benchmarks, prompting the introduction of OpenGT for fair comparisons and multidimensional analysis to advance research.
Authors:Osayamen Jonathan Aimuyo, Byungsoo Oh, Rachee Singh
Abstract:
The computational sparsity of Mixture-of-Experts (MoE) models enables sub-linear growth in compute cost as model size increases, thus offering a scalable path to training massive neural networks. However, existing implementations suffer from \emph{low GPU utilization}, \emph{significant latency overhead}, and a fundamental \emph{inability to leverage task locality}, primarily due to CPU-managed scheduling, host-initiated communication, and frequent kernel launches. To overcome these limitations, we develop FlashDMoE, a fully GPU-resident MoE operator that fuses expert computation and inter-GPU communication into a \emph{single persistent GPU kernel}. FlashDMoE enables fine-grained pipelining of dispatch, compute, and combine phases, eliminating launch overheads and reducing idle gaps. Unlike existing work, FlashDMoE obviates bulk-synchronous collectives for one-sided, device-initiated, inter-GPU (R)DMA transfers, thus unlocking \emph{payload efficiency}, where we eliminate bloated or redundant network payloads in sparsely activated layers. When evaluated on a single 8-H100 GPU node with MoE models having up to 128 experts and 16K token sequences, FlashDMoE achieves up to \textbf{9}$\times$ higher GPU utilization, \textbf{6}$\times$ lower latency, \textbf{5.7}$\times$ higher throughput, and \textbf{4}$\times$ better overlap efficiency compared to state-of-the-art baselines, despite using FP32 while baselines use FP16. FlashDMoE demonstrates that principled GPU kernel-hardware co-design is key to unlocking the performance ceiling of large-scale distributed ML workloads.
中文:FlashDMoE通过实现完全驻留GPU的算子,将计算和通信融合为单一内核,克服了现有混合专家模型的局限性,在GPU利用率、延迟和吞吐量方面实现了显著提升。
English: FlashDMoE overcomes the limitations of existing Mixture-of-Experts models by implementing a fully GPU-resident operator that fuses computation and communication into a single kernel, achieving significant improvements in GPU utilization, latency, and throughput.
Authors:Zhuoyun Zhong, Seyedali Golestaneh, Constantinos Chamzas
Abstract:
Planning with learned dynamics models offers a promising approach toward versatile real-world manipulation, particularly in nonprehensile settings such as pushing or rolling, where accurate analytical models are difficult to obtain. However, collecting training data for learning-based methods can be costly and inefficient, as it often relies on randomly sampled interactions that are not necessarily the most informative. Furthermore, learned models tend to exhibit high uncertainty in underexplored regions of the skill space, undermining the reliability of long-horizon planning. To address these challenges, we propose ActivePusher, a novel framework that combines residual-physics modeling with uncertainty-based active learning, to focus data acquisition on the most informative skill parameters. Additionally, ActivePusher seamlessly integrates with model-based kinodynamic planners, leveraging uncertainty estimates to bias control sampling toward more reliable actions. We evaluate our approach in both simulation and real-world environments, and demonstrate that it consistently improves data efficiency and achieves higher planning success rates in comparison to baseline methods. The source code is available at https://github.com/elpis-lab/ActivePusher.
Chinese: ActivePusher通过结合残差物理建模和基于不确定性的主动学习,提升了非抓取操作的数据效率和规划可靠性,在仿真和真实环境中均优于基线方法。
English: ActivePusher enhances nonprehensile manipulation by integrating residual-physics modeling with uncertainty-driven active learning, improving data efficiency and planning reliability in both simulated and real-world settings.
Authors:Li Liu, Heng Yong
Abstract:
Recently, machine learning methods have gained significant traction in scientific computing, particularly for solving Partial Differential Equations (PDEs). However, methods based on deep neural networks (DNNs) often lack convergence guarantees and computational efficiency compared to traditional numerical schemes. This work introduces DeePoly, a novel framework that transforms the solution paradigm from pure non-convex parameter optimization to a two-stage approach: first employing a DNN to capture complex global features, followed by linear space optimization with combined DNN-extracted features (Spotter) and polynomial basis functions (Sniper). This strategic combination leverages the complementary strengths of both methods -- DNNs excel at approximating complex global features (i.e., high-gradient features) and stabilize the polynomial approximation while polynomial bases provide high-precision local corrections with convergence guarantees. Theoretical analysis and numerical experiments demonstrate that this approach significantly enhances both high-order accuracy and efficiency across diverse problem types while maintaining mesh-free and scheme-free properties. This paper also serves as a theoretical exposition for the open-source project DeePoly.
中文摘要:DeePoly框架通过将深度神经网络用于全局特征提取与多项式基函数进行局部修正相结合,显著提升了求解偏微分方程的精度和效率,同时保持了理论收敛性和无网格特性。
English Summary: The DeePoly framework enhances PDE solving by combining deep neural networks for global feature approximation with polynomial basis functions for precise local corrections, achieving improved accuracy and efficiency while maintaining theoretical guarantees.
Authors:Marianna Nezhurina, Tomer Porian, Giovanni Pucceti, Tommie Kerssies, Romain Beaumont, Mehdi Cherti, Jenia Jitsev
Abstract:
In studies of transferable learning, scaling laws are obtained for various important foundation models to predict their properties and performance at larger scales. We show here how scaling law derivation can also be used for model and dataset comparison, allowing to decide which procedure is to be preferred for pre-training. For the first time, full scaling laws based on dense measurements across a wide span of model and samples seen scales are derived for two important language-vision learning procedures, CLIP and MaMMUT, that use either contrastive only or contrastive and captioning text generative loss. Ensuring sufficient prediction accuracy for held out points, we use derived scaling laws to compare both models, obtaining evidence for MaMMUT's stronger improvement with scale and better sample efficiency than standard CLIP. To strengthen validity of the comparison, we show scaling laws for various downstream tasks, classification, retrieval, and segmentation, and for different open datasets, DataComp, DFN and Re-LAION, observing consistently the same trends. We show that comparison can also be performed when deriving scaling laws with a constant learning rate schedule, reducing compute cost. Accurate derivation of scaling laws provides thus means to perform model and dataset comparison across scale spans, avoiding misleading conclusions based on measurements from single reference scales only, paving the road for systematic comparison and improvement of open foundation models and datasets for their creation. We release all the pre-trained models with their intermediate checkpoints, including openMaMMUT-L/14, which achieves $80.3\%$ zero-shot ImageNet-1k accuracy, trained on 12.8B samples from DataComp-1.4B. Code for reproducing experiments in the paper and raw experiments data can be found at https://github.com/LAION-AI/scaling-laws-for-comparison.
中文: 本研究展示了如何利用扩展定律比较模型与数据集,发现MaMMUT在多任务和数据集上比CLIP具有更优的扩展性和样本效率,同时开源了模型与代码以支持复现。
English: This study demonstrates how scaling laws can be used to compare models and datasets, revealing MaMMUT's superior scalability and sample efficiency over CLIP across multiple tasks and datasets, while providing open-source models and code for reproducibility.
Authors:Nikita Oskolkov, Huzhenyu Zhang, Dmitry Makarov, Dmitry Yudin, Aleksandr Panov
Abstract:
The 3D scene graph models spatial relationships between objects, enabling the agent to efficiently navigate in a partially observable environment and predict the location of the target object.This paper proposes an original framework named SGN-CIRL (3D Scene Graph-Based Reinforcement Learning Navigation) for mapless reinforcement learning-based robot navigation with learnable representation of open-vocabulary 3D scene graph. To accelerate and stabilize the training of reinforcement learning-based algorithms, the framework also employs imitation learning and curriculum learning. The first one enables the agent to learn from demonstrations, while the second one structures the training process by gradually increasing task complexity from simple to more advanced scenarios. Numerical experiments conducted in the Isaac Sim environment showed that using a 3D scene graph for reinforcement learning significantly increased the success rate in difficult navigation cases. The code is open-sourced and available at: https://github.com/Xisonik/Aloha\_graph.
中文: 本文提出SGN-CIRL框架,通过三维场景图实现无地图机器人导航,结合模仿学习与课程学习提升训练效果,在复杂环境中显著提高了导航成功率。
English: This paper introduces SGN-CIRL, a framework using 3D scene graphs for mapless robot navigation that enhances training with imitation and curriculum learning, significantly improving success rates in complex environments.
Authors:C. Evans Hedges
Abstract:
We provide evidence that orthogonalizing gradients during training improves model calibration without sacrificing accuracy. On CIFAR-10 with 10\% labeled data, $\perp$Grad matches SGD in accuracy but yields consistently improved calibration metrics such as lower test loss, reduced softmax overconfidence, and higher predictive entropy. These benefits persist under input corruption (CIFAR-10C) and extended training, where $\perp$Grad models degrade more gracefully than SGD-trained counterparts. $\perp$Grad is optimizer-agnostic, incurs minimal overhead, and works well with post-hoc calibration techniques like temperature scaling.
Theoretically, we prove convergence of a simplified version of $\perp$Grad under mild assumptions and characterize its stationary points in positive homogeneous networks: $\perp$Grad converges to solutions where further loss reduction requires confidence scaling rather than decision boundary improvement. Code for this paper can be found at: https://github.com/evanshedges2/orthograd\_improves\_calibration.
中文: 训练过程中正交化梯度可在保持准确性的同时提升模型校准度,具体表现为各项指标优化及在多种条件下的稳健性增强。
English: Orthogonalizing gradients during training enhances model calibration while maintaining accuracy, as demonstrated by improved metrics and robustness under various conditions.
Authors:Apurv Verma, NhatHai Phan, Shubhendu Trivedi
Abstract:
Watermarking techniques for large language models (LLMs) can significantly impact output quality, yet their effects on truthfulness, safety, and helpfulness remain critically underexamined. This paper presents a systematic analysis of how two popular watermarking approaches-Gumbel and KGW-affect these core alignment properties across four aligned LLMs. Our experiments reveal two distinct degradation patterns: guard attenuation, where enhanced helpfulness undermines model safety, and guard amplification, where excessive caution reduces model helpfulness. These patterns emerge from watermark-induced shifts in token distribution, surfacing the fundamental tension that exists between alignment objectives.
To mitigate these degradations, we propose Alignment Resampling (AR), an inference-time sampling method that uses an external reward model to restore alignment. We establish a theoretical lower bound on the improvement in expected reward score as the sample size is increased and empirically demonstrate that sampling just 2-4 watermarked generations effectively recovers or surpasses baseline (unwatermarked) alignment scores. To overcome the limited response diversity of standard Gumbel watermarking, our modified implementation sacrifices strict distortion-freeness while maintaining robust detectability, ensuring compatibility with AR. Experimental results confirm that AR successfully recovers baseline alignment in both watermarking approaches, while maintaining strong watermark detectability. This work reveals the critical balance between watermark strength and model alignment, providing a simple inference-time solution to responsibly deploy watermarked LLMs in practice.
中文摘要:大型语言模型的水印技术会削弱其真实性、安全性和实用性,但提出的对齐重采样方法能在保持水印可检测性的同时有效恢复这些对齐特性。
English Summary: Watermarking techniques in large language models can degrade their truthfulness, safety, and helpfulness, but the proposed Alignment Resampling method effectively restores these alignment properties while maintaining watermark detectability.
Authors:Hasin Us Sami, Swapneel Sen, Amit K. Roy-Chowdhury, Srikanth V. Krishnamurthy, Basak Guler
Abstract:
Federated learning (FL) allows multiple data-owners to collaboratively train machine learning models by exchanging local gradients, while keeping their private data on-device. To simultaneously enhance privacy and training efficiency, recently parameter-efficient fine-tuning (PEFT) of large-scale pretrained models has gained substantial attention in FL. While keeping a pretrained (backbone) model frozen, each user fine-tunes only a few lightweight modules to be used in conjunction, to fit specific downstream applications. Accordingly, only the gradients with respect to these lightweight modules are shared with the server. In this work, we investigate how the privacy of the fine-tuning data of the users can be compromised via a malicious design of the pretrained model and trainable adapter modules. We demonstrate gradient inversion attacks on a popular PEFT mechanism, the adapter, which allow an attacker to reconstruct local data samples of a target user, using only the accessible adapter gradients. Via extensive experiments, we demonstrate that a large batch of fine-tuning images can be retrieved with high fidelity. Our attack highlights the need for privacy-preserving mechanisms for PEFT, while opening up several future directions. Our code is available at https://github.com/info-ucr/PEFTLeak.
Chinese: 联邦学习中的参数高效微调(PEFT)虽能在保护数据隐私的同时实现协同训练,但本研究发现恶意设计的预训练模型和适配器模块可通过梯度反演攻击,利用共享梯度重构用户的本地数据样本。
English: Federated learning with parameter-efficient fine-tuning (PEFT) enables collaborative model training while preserving data privacy, but this study reveals that maliciously designed pre-trained models and adapters can exploit shared gradients to reconstruct users' local data through gradient inversion attacks.
Authors:Egor Petrov, Grigoriy Evseev, Aleksey Antonov, Andrey Veprikov, Pavel Plyusnin, Nikolay Bushkov, Stanislav Moiseev, Aleksandr Beznosikov
Abstract:
Fine-tuning Large Language Models (LLMs) is essential for adapting pre-trained models to downstream tasks. Yet traditional first-order optimizers such as Stochastic Gradient Descent (SGD) and Adam incur prohibitive memory and computational costs that scale poorly with model size. In this paper, we investigate zero-order (ZO) optimization methods as a memory- and compute-efficient alternative, particularly in the context of parameter-efficient fine-tuning techniques like LoRA. We propose $\texttt{JAGUAR SignSGD}$, a ZO momentum-based algorithm that extends ZO SignSGD, requiring the same number of parameters as the standard ZO SGD and only $\mathcal{O}(1)$ function evaluations per iteration. To the best of our knowledge, this is the first study to establish rigorous convergence guarantees for SignSGD in the stochastic ZO case. We further propose $\texttt{JAGUAR Muon}$, a novel ZO extension of the Muon optimizer that leverages the matrix structure of model parameters, and we provide its convergence rate under arbitrary stochastic noise. Through extensive experiments on challenging LLM fine-tuning benchmarks, we demonstrate that the proposed algorithms meet or exceed the convergence quality of standard first-order methods, achieving significant memory reduction. Our theoretical and empirical results establish new ZO optimization methods as a practical and theoretically grounded approach for resource-constrained LLM adaptation. Our code is available at https://github.com/brain-mmo-lab/ZO_LLM
传统优化器微调大语言模型成本高昂,因此我们提出如JAGUAR SignSGD和JAGUAR Muon等内存高效的零阶方法,在降低资源需求的同时达到一阶方法性能。
Fine-tuning LLMs with traditional optimizers is costly, so we propose memory-efficient zero-order methods like JAGUAR SignSGD and JAGUAR Muon, which match first-order performance while reducing resource demands.
Authors:Philippe Chlenski, Itsik Pe'er
Abstract:
Decision trees and models that use them as primitives are workhorses of machine learning in Euclidean spaces. Recent work has further extended these models to the Lorentz model of hyperbolic space by replacing axis-parallel hyperplanes with homogeneous hyperplanes when partitioning the input space. In this paper, we show how the hyperDT algorithm can be elegantly reexpressed in the Beltrami-Klein model of hyperbolic spaces. This preserves the thresholding operation used in Euclidean decision trees, enabling us to further rewrite hyperDT as simple pre- and post-processing steps that form a wrapper around existing tree-based models designed for Euclidean spaces. The wrapper approach unlocks many optimizations already available in Euclidean space models, improving flexibility, speed, and accuracy while offering a simpler, more maintainable, and extensible codebase. Our implementation is available at https://github.com/pchlenski/hyperdt.
中文: hyperDT算法在Beltrami-Klein双曲模型中被优雅重构,使其能够作为现有欧几里得决策树模型的封装器运行,从而提升效率、灵活性及代码可维护性。
English: The hyperDT algorithm is elegantly reformulated in the Beltrami-Klein hyperbolic model, enabling it to operate as a wrapper around existing Euclidean decision tree models with improved efficiency, flexibility, and code maintainability.
Authors:Xiang Zheng, Xingjun Ma, Wei-Bin Lee, Cong Wang
Abstract:
Red teaming has proven to be an effective method for identifying and mitigating vulnerabilities in Large Language Models (LLMs). Reinforcement Fine-Tuning (RFT) has emerged as a promising strategy among existing red teaming techniques. However, a lack of a unified benchmark hinders current RFT-based red teaming methods. Implementation details, especially in Proximal Policy Optimization (PPO)-based RFT, significantly affect outcome stability and reproducibility. To address this issue, we introduce RedRFT, a lightweight benchmark designed to simplify and standardize the implementation and evaluation of RFT-based red teaming. RedRFT combines the design strengths of both single-file CleanRL and highly modularized Tianshou, offering high-quality single-file red teaming implementations and modular PPO core components, such as the General Advantage Estimator. It supports a variety of token and sentence diversity metrics, featuring modularized intrinsic reward computation that facilitates plug-and-play experimentation. To clarify their influence on RFT performance, we conducted an extensive ablation study on key components, including Low-Rank Adaptation (LoRA), Kullback-Leibler (KL) divergence, and Lagrange Multiplier. We hope this work contributes to 1) gaining a comprehensive understanding of the implementation nuances of RFT-based red teaming algorithms, and 2) enabling rapid prototyping of innovative features for RFT-based red teaming. Code for the benchmark can be accessed at https://github.com/x-zheng16/RedRFT.git.
中文:RedRFT基准旨在标准化基于强化微调的红队测试方法,通过模块化设计和关键组件消融研究解决了大语言模型安全评估中的实现差异问题。
English: The RedRFT benchmark is introduced to standardize Reinforcement Fine-Tuning (RFT)-based red teaming for Large Language Models, addressing implementation inconsistencies and enabling modular experimentation with comprehensive component analysis.
Authors:Zhao-Heng Yin, Sherry Yang, Pieter Abbeel
Abstract:
Learning robot control policies from human videos is a promising direction for scaling up robot learning. However, how to extract action knowledge (or action representations) from videos for policy learning remains a key challenge. Existing action representations such as video frames, pixelflow, and pointcloud flow have inherent limitations such as modeling complexity or loss of information. In this paper, we propose to use object-centric 3D motion field to represent actions for robot learning from human videos, and present a novel framework for extracting this representation from videos for zero-shot control. We introduce two novel components in its implementation. First, a novel training pipeline for training a ''denoising'' 3D motion field estimator to extract fine object 3D motions from human videos with noisy depth robustly. Second, a dense object-centric 3D motion field prediction architecture that favors both cross-embodiment transfer and policy generalization to background. We evaluate the system in real world setups. Experiments show that our method reduces 3D motion estimation error by over 50% compared to the latest method, achieve 55% average success rate in diverse tasks where prior approaches fail~($\lesssim 10$\%), and can even acquire fine-grained manipulation skills like insertion.
Authors:Wei Cao, Marcel Hallgarten, Tianyu Li, Daniel Dauner, Xunjiang Gu, Caojun Wang, Yakov Miron, Marco Aiello, Hongyang Li, Igor Gilitschenski, Boris Ivanovic, Marco Pavone, Andreas Geiger, Kashyap Chitta
Abstract:
Existing evaluation paradigms for Autonomous Vehicles (AVs) face critical limitations. Real-world evaluation is often challenging due to safety concerns and a lack of reproducibility, whereas closed-loop simulation can face insufficient realism or high computational costs. Open-loop evaluation, while being efficient and data-driven, relies on metrics that generally overlook compounding errors. In this paper, we propose pseudo-simulation, a novel paradigm that addresses these limitations. Pseudo-simulation operates on real datasets, similar to open-loop evaluation, but augments them with synthetic observations generated prior to evaluation using 3D Gaussian Splatting. Our key idea is to approximate potential future states the AV might encounter by generating a diverse set of observations that vary in position, heading, and speed. Our method then assigns a higher importance to synthetic observations that best match the AV's likely behavior using a novel proximity-based weighting scheme. This enables evaluating error recovery and the mitigation of causal confusion, as in closed-loop benchmarks, without requiring sequential interactive simulation. We show that pseudo-simulation is better correlated with closed-loop simulations ($R^2=0.8$) than the best existing open-loop approach ($R^2=0.7$). We also establish a public leaderboard for the community to benchmark new methodologies with pseudo-simulation. Our code is available at https://github.com/autonomousvision/navsim.
中文: 本文提出伪仿真这一自动驾驶汽车评估新范式,通过3D高斯溅射技术融合真实数据与合成观测,无需完整仿真即可评估错误恢复能力,且与闭环基准的相关性(R²=0.8)优于现有方法。
English: This paper introduces pseudo-simulation, a novel evaluation paradigm for autonomous vehicles that combines real-world data efficiency with synthetic observations generated via 3D Gaussian Splatting, enabling error recovery assessment without full simulation while achieving higher correlation (R²=0.8) with closed-loop benchmarks than existing methods.
Authors:Yanting Wang, Wei Zou, Runpeng Geng, Jinyuan Jia
Abstract:
Long context large language models (LLMs) are deployed in many real-world applications such as RAG, agent, and broad LLM-integrated applications. Given an instruction and a long context (e.g., documents, PDF files, webpages), a long context LLM can generate an output grounded in the provided context, aiming to provide more accurate, up-to-date, and verifiable outputs while reducing hallucinations and unsupported claims. This raises a research question: how to pinpoint the texts (e.g., sentences, passages, or paragraphs) in the context that contribute most to or are responsible for the generated output by an LLM? This process, which we call context traceback, has various real-world applications, such as 1) debugging LLM-based systems, 2) conducting post-attack forensic analysis for attacks (e.g., prompt injection attack, knowledge corruption attacks) to an LLM, and 3) highlighting knowledge sources to enhance the trust of users towards outputs generated by LLMs. When applied to context traceback for long context LLMs, existing feature attribution methods such as Shapley have sub-optimal performance and/or incur a large computational cost. In this work, we develop TracLLM, the first generic context traceback framework tailored to long context LLMs. Our framework can improve the effectiveness and efficiency of existing feature attribution methods. To improve the efficiency, we develop an informed search based algorithm in TracLLM. We also develop contribution score ensemble/denoising techniques to improve the accuracy of TracLLM. Our evaluation results show TracLLM can effectively identify texts in a long context that lead to the output of an LLM. Our code and data are at: https://github.com/Wang-Yanting/TracLLM.
中文: 长上下文大语言模型广泛应用于RAG和智能体等场景,能基于文档生成准确输出,但现有方法在追溯输出来源的上下文回溯中效率低下。本文提出TracLLM框架,通过智能搜索算法和贡献值优化技术,显著提升了回溯的准确性和效率,实验验证了其有效性。
English: Long context LLMs are widely used in applications like RAG and agents to generate accurate outputs from extensive documents, but identifying the specific text segments responsible for these outputs—a process called context traceback—remains challenging due to inefficiencies in existing methods. This paper introduces TracLLM, a novel framework that enhances the effectiveness and efficiency of context traceback through informed search algorithms and contribution score techniques, as validated by evaluation results.
Authors:Seohong Park, Kevin Frans, Deepinder Mann, Benjamin Eysenbach, Aviral Kumar, Sergey Levine
Abstract:
In this work, we study the scalability of offline reinforcement learning (RL) algorithms. In principle, a truly scalable offline RL algorithm should be able to solve any given problem, regardless of its complexity, given sufficient data, compute, and model capacity. We investigate if and how current offline RL algorithms match up to this promise on diverse, challenging, previously unsolved tasks, using datasets up to 1000x larger than typical offline RL datasets. We observe that despite scaling up data, many existing offline RL algorithms exhibit poor scaling behavior, saturating well below the maximum performance. We hypothesize that the horizon is the main cause behind the poor scaling of offline RL. We empirically verify this hypothesis through several analysis experiments, showing that long horizons indeed present a fundamental barrier to scaling up offline RL. We then show that various horizon reduction techniques substantially enhance scalability on challenging tasks. Based on our insights, we also introduce a minimal yet scalable method named SHARSA that effectively reduces the horizon. SHARSA achieves the best asymptotic performance and scaling behavior among our evaluation methods, showing that explicitly reducing the horizon unlocks the scalability of offline RL. Code: https://github.com/seohongpark/horizon-reduction
本研究发现在强化学习中仅增加数据无法提升离线学习性能,任务周期过长是主因,但通过引入如SHARSA等周期缩减技术可有效突破扩展性瓶颈。
This study finds that scaling data alone fails to improve offline reinforcement learning performance due to long task horizons, but introducing horizon reduction techniques like the proposed SHARSA method effectively unlocks scalability.
Authors:Disha Sheshanarayana, Tanishka Magar, Ayushi Mittal, Neelam Chaplot
Abstract:
Courtrooms are places where lives are determined and fates are sealed, yet they are not impervious to manipulation. Strategic use of manipulation in legal jargon can sway the opinions of judges and affect the decisions. Despite the growing advancements in NLP, its application in detecting and analyzing manipulation within the legal domain remains largely unexplored. Our work addresses this gap by introducing LegalCon, a dataset of 1,063 annotated courtroom conversations labeled for manipulation detection, identification of primary manipulators, and classification of manipulative techniques, with a focus on long conversations. Furthermore, we propose CLAIM, a two-stage, Intent-driven Multi-agent framework designed to enhance manipulation analysis by enabling context-aware and informed decision-making. Our results highlight the potential of incorporating agentic frameworks to improve fairness and transparency in judicial processes. We hope that this contributes to the broader application of NLP in legal discourse analysis and the development of robust tools to support fairness in legal decision-making. Our code and data are available at https://github.com/Disha1001/CLAIM.
中文摘要:本研究提出了LegalCon数据集用于检测法庭对话中的操纵行为,并开发了CLAIM多智能体框架,通过先进自然语言处理技术提升司法过程的公平性与透明度。
English Summary: This research introduces LegalCon, a dataset for detecting manipulation in courtroom conversations, and proposes CLAIM, a multi-agent framework to enhance legal fairness and transparency through advanced NLP analysis.
Authors:Jonathan Geuter, Youssef Mroueh, David Alvarez-Melis
Abstract:
We propose Guided Speculative Inference (GSI), a novel algorithm for efficient reward-guided decoding in large language models. GSI combines soft best-of-$n$ test-time scaling with a reward model $r(x,y)$ and speculative samples from a small auxiliary model $Ï_S(y\mid x)$. We provably approximate the optimal tilted policy $Ï_{β,B}(y\mid x) \propto Ï_B(y\mid x)\exp(β\,r(x,y))$ of soft best-of-$n$ under the primary model $Ï_B$. We derive a theoretical bound on the KL divergence between our induced distribution and the optimal policy. In experiments on reasoning benchmarks (MATH500, OlympiadBench, Minerva Math), our method achieves higher accuracy than standard soft best-of-$n$ with $Ï_S$ and reward-guided speculative decoding (Liao et al., 2025), and in certain settings even outperforms soft best-of-$n$ with $Ï_B$. The code is available at https://github.com/j-geuter/GSI .
Chinese: 引导推测推理(GSI)是一种新颖算法,通过结合软性最优-n缩放、奖励模型和推测样本,有效提升大型语言模型的奖励引导解码效率,在推理基准测试中比现有方法获得了更高准确率。
English: Guided Speculative Inference (GSI) is a novel algorithm that enhances reward-guided decoding in large language models by combining soft best-of-n scaling with a reward model and speculative samples, achieving higher accuracy on reasoning benchmarks than existing methods.
Authors:Wenhao Li, Wenwu Li, Chuyun Shen, Junjie Sheng, Zixiao Huang, Di Wu, Yun Hua, Wei Yin, Xiangfeng Wang, Hongyuan Zha, Bo Jin
Abstract:
We present TextAtari, a benchmark for evaluating language agents on very long-horizon decision-making tasks spanning up to 100,000 steps. By translating the visual state representations of classic Atari games into rich textual descriptions, TextAtari creates a challenging test bed that bridges sequential decision-making with natural language processing. The benchmark includes nearly 100 distinct tasks with varying complexity, action spaces, and planning horizons, all rendered as text through an unsupervised representation learning framework (AtariARI). We evaluate three open-source large language models (Qwen2.5-7B, Gemma-7B, and Llama3.1-8B) across three agent frameworks (zero-shot, few-shot chain-of-thought, and reflection reasoning) to assess how different forms of prior knowledge affect performance on these long-horizon challenges. Four scenarios-Basic, Obscured, Manual Augmentation, and Reference-based-investigate the impact of semantic understanding, instruction comprehension, and expert demonstrations on agent decision-making. Our results reveal significant performance gaps between language agents and human players in extensive planning tasks, highlighting challenges in sequential reasoning, state tracking, and strategic planning across tens of thousands of steps. TextAtari provides standardized evaluation protocols, baseline implementations, and a framework for advancing research at the intersection of language models and planning. Our code is available at https://github.com/Lww007/Text-Atari-Agents.
中文: TextAtari将经典Atari游戏转化为文本环境,构建了评估语言智能体在长跨度决策任务中表现的基准,揭示了AI模型与人类在序列推理和战略规划方面存在的显著差距。
English: TextAtari is a benchmark that converts Atari games into text-based environments to evaluate language agents on long-horizon decision-making tasks, revealing significant performance gaps between AI models and humans in sequential reasoning and strategic planning.
Authors:Anastasiia Ivanova, Eva Bakaeva, Zoya Volovikova, Alexey K. Kovalev, Aleksandr I. Panov
Abstract:
As a part of an embodied agent, Large Language Models (LLMs) are typically used for behavior planning given natural language instructions from the user. However, dealing with ambiguous instructions in real-world environments remains a challenge for LLMs. Various methods for task ambiguity detection have been proposed. However, it is difficult to compare them because they are tested on different datasets and there is no universal benchmark. For this reason, we propose AmbiK (Ambiguous Tasks in Kitchen Environment), the fully textual dataset of ambiguous instructions addressed to a robot in a kitchen environment. AmbiK was collected with the assistance of LLMs and is human-validated. It comprises 1000 pairs of ambiguous tasks and their unambiguous counterparts, categorized by ambiguity type (Human Preferences, Common Sense Knowledge, Safety), with environment descriptions, clarifying questions and answers, user intents, and task plans, for a total of 2000 tasks. We hope that AmbiK will enable researchers to perform a unified comparison of ambiguity detection methods. AmbiK is available at https://github.com/cog-model/AmbiK-dataset.
中文摘要:AmbiK数据集被提出作为一个通用基准,旨在解决在具身智能体中比较大型语言模型模糊指令检测方法的难题,该数据集包含1000对经过人工验证的厨房环境模糊与非模糊任务。
English Summary: The AmbiK dataset is introduced as a universal benchmark to address the challenge of comparing ambiguity detection methods for LLMs in embodied agents, featuring 1000 pairs of ambiguous and unambiguous kitchen tasks with human-validated annotations.
Authors:Paul Fuchs, Weilong Chen, Stephan Thaler, Julija Zavadlav
Abstract:
Machine learning potentials (MLPs) have advanced rapidly and show great promise to transform molecular dynamics (MD) simulations. However, most existing software tools are tied to specific MLP architectures, lack integration with standard MD packages, or are not parallelizable across GPUs. To address these challenges, we present chemtrain-deploy, a framework that enables model-agnostic deployment of MLPs in LAMMPS. chemtrain-deploy supports any JAX-defined semi-local potential, allowing users to exploit the functionality of LAMMPS and perform large-scale MLP-based MD simulations on multiple GPUs. It achieves state-of-the-art efficiency and scales to systems containing millions of atoms. We validate its performance and scalability using graph neural network architectures, including MACE, Allegro, and PaiNN, applied to a variety of systems, such as liquid-vapor interfaces, crystalline materials, and solvated peptides. Our results highlight the practical utility of chemtrain-deploy for real-world, high-performance simulations and provide guidance for MLP architecture selection and future design.
Chinese: chemtrain-deploy框架实现了机器学习势在LAMMPS中的模型无关部署,支持任何JAX定义的半局域势,可在多GPU上实现高效的大规模分子动力学模拟,具备最先进的可扩展性。
English: The chemtrain-deploy framework enables model-agnostic deployment of machine learning potentials in LAMMPS, supporting any JAX-defined semi-local potential for efficient large-scale molecular dynamics simulations across multiple GPUs with state-of-the-art scalability.
Authors:Hicham Eddoubi, Jonas Ricker, Federico Cocchi, Lorenzo Baraldi, Angelo Sotgiu, Maura Pintor, Marcella Cornia, Lorenzo Baraldi, Asja Fischer, Rita Cucchiara, Battista Biggio
Abstract:
AI-generated images have reached a quality level at which humans are incapable of reliably distinguishing them from real images. To counteract the inherent risk of fraud and disinformation, the detection of AI-generated images is a pressing challenge and an active research topic. While many of the presented methods claim to achieve high detection accuracy, they are usually evaluated under idealized conditions. In particular, the adversarial robustness is often neglected, potentially due to a lack of awareness or the substantial effort required to conduct a comprehensive robustness analysis. In this work, we tackle this problem by providing a simpler means to assess the robustness of AI-generated image detectors. We present RAID (Robust evaluation of AI-generated image Detectors), a dataset of 72k diverse and highly transferable adversarial examples. The dataset is created by running attacks against an ensemble of seven state-of-the-art detectors and images generated by four different text-to-image models. Extensive experiments show that our methodology generates adversarial images that transfer with a high success rate to unseen detectors, which can be used to quickly provide an approximate yet still reliable estimate of a detector's adversarial robustness. Our findings indicate that current state-of-the-art AI-generated image detectors can be easily deceived by adversarial examples, highlighting the critical need for the development of more robust methods. We release our dataset at https://huggingface.co/datasets/aimagelab/RAID and evaluation code at https://github.com/pralab/RAID.
中文摘要:当前最先进的人工智能生成图像检测器极易受到对抗性样本的攻击,RAID数据集揭示了现有检测方法的严重脆弱性,凸显了开发更强健检测技术的迫切需求。
English Summary: AI-generated image detectors are highly vulnerable to adversarial attacks, as demonstrated by the RAID dataset, which exposes significant weaknesses in current state-of-the-art detection methods and underscores the urgent need for more robust solutions.
Authors:HyunGi Kim, Jisoo Mok, Dongjun Lee, Jaihyun Lew, Sungjae Kim, Sungroh Yoon
Abstract:
Utilizing the complex inter-variable causal relationships within multivariate time-series provides a promising avenue toward more robust and reliable multivariate time-series anomaly detection (MTSAD) but remains an underexplored area of research. This paper proposes Causality-Aware contrastive learning for RObust multivariate Time-Series (CAROTS), a novel MTSAD pipeline that incorporates the notion of causality into contrastive learning. CAROTS employs two data augmentors to obtain causality-preserving and -disturbing samples that serve as a wide range of normal variations and synthetic anomalies, respectively. With causality-preserving and -disturbing samples as positives and negatives, CAROTS performs contrastive learning to train an encoder whose latent space separates normal and abnormal samples based on causality. Moreover, CAROTS introduces a similarity-filtered one-class contrastive loss that encourages the contrastive learning process to gradually incorporate more semantically diverse samples with common causal relationships. Extensive experiments on five real-world and two synthetic datasets validate that the integration of causal relationships endows CAROTS with improved MTSAD capabilities. The code is available at https://github.com/kimanki/CAROTS.
中文摘要:本文提出CAROTS方法,通过引入因果关系的对比学习框架,利用保持和干扰因果关系的样本增强技术,有效提升了多元时间序列异常检测的性能。
English Summary: This paper introduces CAROTS, a novel multivariate time-series anomaly detection method that integrates causality into contrastive learning through causality-preserving and -disturbing data augmentations to better distinguish normal and abnormal patterns.
Authors:Aojun Lu, Tao Feng, Hangjie Yuan, Chunhui Ding, Yanan Sun
Abstract:
Continual Learning (CL) seeks to enable neural networks to incrementally acquire new knowledge (plasticity) while retaining existing knowledge (stability). Although pre-trained models (PTMs) have provided a strong foundation for CL, existing approaches face a fundamental challenge in balancing these two competing objectives. Current methods typically address stability by freezing the PTM backbone, which severely limits the model's plasticity, particularly when incoming data distribution diverges largely from the pre-training data. Alternatively, sequentially fine-tuning the entire PTM can adapt to new knowledge but often leads to catastrophic forgetting, highlighting the critical stability-plasticity trade-off in PTM-based CL. To address this limitation, we propose Adapting PTMs before the core CL} process (ACL), a novel framework that introduces a plug-and-play adaptation phase prior to learning each new task. During this phase, ACL refines the PTM backbone by aligning embeddings with their original class prototypes while distancing them from irrelevant classes. This mechanism theoretically and empirically demonstrates desirable balance between stability and plasticity, significantly improving CL performance across benchmarks and integrated methods. Code is available at https://github.com/byyx666/ACL_code.
中文: 提出的ACL框架在持续学习任务前引入即插即用的适应阶段,通过将嵌入向量与原始类别原型对齐来优化预训练模型,有效平衡稳定性与可塑性以提升性能。
English: The proposed ACL framework introduces a plug-and-play adaptation phase before continual learning tasks to refine pre-trained models by aligning embeddings with original class prototypes, effectively balancing stability and plasticity to enhance performance.
Authors:Jianqing Zhang, Xinghao Wu, Yanbing Zhou, Xiaoting Sun, Qiqi Cai, Yang Liu, Yang Hua, Zhenzhe Zheng, Jian Cao, Qiang Yang
Abstract:
As AI evolves, collaboration among heterogeneous models helps overcome data scarcity by enabling knowledge transfer across institutions and devices. Traditional Federated Learning (FL) only supports homogeneous models, limiting collaboration among clients with heterogeneous model architectures. To address this, Heterogeneous Federated Learning (HtFL) methods are developed to enable collaboration across diverse heterogeneous models while tackling the data heterogeneity issue at the same time. However, a comprehensive benchmark for standardized evaluation and analysis of the rapidly growing HtFL methods is lacking. Firstly, the highly varied datasets, model heterogeneity scenarios, and different method implementations become hurdles to making easy and fair comparisons among HtFL methods. Secondly, the effectiveness and robustness of HtFL methods are under-explored in various scenarios, such as the medical domain and sensor signal modality. To fill this gap, we introduce the first Heterogeneous Federated Learning Library (HtFLlib), an easy-to-use and extensible framework that integrates multiple datasets and model heterogeneity scenarios, offering a robust benchmark for research and practical applications. Specifically, HtFLlib integrates (1) 12 datasets spanning various domains, modalities, and data heterogeneity scenarios; (2) 40 model architectures, ranging from small to large, across three modalities; (3) a modularized and easy-to-extend HtFL codebase with implementations of 10 representative HtFL methods; and (4) systematic evaluations in terms of accuracy, convergence, computation costs, and communication costs. We emphasize the advantages and potential of state-of-the-art HtFL methods and hope that HtFLlib will catalyze advancing HtFL research and enable its broader applications. The code is released at https://github.com/TsingZ0/HtFLlib.
Chinese: 为解决传统联邦学习无法支持异构模型及缺乏标准化基准的问题,异构联邦学习库(HtFLlib)被提出作为一个可扩展框架,整合了多样化数据集、模型架构和方法,以推动该领域的研究与应用。
English: To address the limitations of traditional Federated Learning in supporting heterogeneous models and the lack of a standardized benchmark, the Heterogeneous Federated Learning Library (HtFLlib) is introduced as an extensible framework integrating diverse datasets, model architectures, and methods to facilitate research and applications in this field.
Authors:Aojun Lu, Hangjie Yuan, Tao Feng, Yanan Sun
Abstract:
The quest for Continual Learning (CL) seeks to empower neural networks with the ability to learn and adapt incrementally. Central to this pursuit is addressing the stability-plasticity dilemma, which involves striking a balance between two conflicting objectives: preserving previously learned knowledge and acquiring new knowledge. While numerous CL methods aim to achieve this trade-off, they often overlook the impact of network architecture on stability and plasticity, restricting the trade-off to the parameter level. In this paper, we delve into the conflict between stability and plasticity at the architectural level. We reveal that under an equal parameter constraint, deeper networks exhibit better plasticity, while wider networks are characterized by superior stability. To address this architectural-level dilemma, we introduce a novel framework denoted Dual-Arch, which serves as a plug-in component for CL. This framework leverages the complementary strengths of two distinct and independent networks: one dedicated to plasticity and the other to stability. Each network is designed with a specialized and lightweight architecture, tailored to its respective objective. Extensive experiments demonstrate that Dual-Arch enhances the performance of existing CL methods while being up to 87% more compact in terms of parameters. Code: https://github.com/byyx666/Dual-Arch.
Chinese: 本文提出Dual-Arch框架,通过分别针对可塑性和稳定性设计的两个独立网络,在架构层面解决持续学习中的稳定性-可塑性平衡问题,在提升性能的同时将参数量减少高达87%。
English: The paper introduces Dual-Arch, a plug-in framework for continual learning that utilizes two specialized networks—one for plasticity and one for stability—to address the stability-plasticity dilemma at the architectural level, improving performance while reducing parameters by up to 87%.
Authors:Cédric Léonard, Dirk Stober, Martin Schulz
Abstract:
New UAV technologies and the NewSpace era are transforming Earth Observation missions and data acquisition. Numerous small platforms generate large data volume, straining bandwidth and requiring onboard decision-making to transmit high-quality information in time. While Machine Learning allows real-time autonomous processing, FPGAs balance performance with adaptability to mission-specific requirements, enabling onboard deployment. This review systematically analyzes 66 experiments deploying ML models on FPGAs for Remote Sensing applications. We introduce two distinct taxonomies to capture both efficient model architectures and FPGA implementation strategies. For transparency and reproducibility, we follow PRISMA 2020 guidelines and share all data and code at https://github.com/CedricLeon/Survey_RS-ML-FPGA.
中文: 新型无人机和太空技术通过部署在FPGA上的机器学习模型实现了地球观测的实时数据处理,本文系统分析了66个相关实验并公开所有数据以确保透明度。
English: New UAV and NewSpace technologies are revolutionizing Earth Observation by enabling real-time data processing through machine learning models deployed on FPGAs, as systematically reviewed in 66 experiments with shared data for transparency.
Authors:Mingxuan Xia, Haobo Wang, Yixuan Li, Zewei Yu, Jindong Wang, Junbo Zhao, Runze Wu
Abstract:
Recently, Large Language Models (LLMs) have demonstrated significant potential for data annotation, markedly reducing the labor costs associated with downstream applications. However, existing methods mostly adopt an aggressive strategy by prompting LLM to determine a single gold label for each unlabeled sample. Due to the inherent uncertainty within LLMs, they often produce incorrect labels for difficult samples, severely compromising the data quality for downstream applications. Motivated by ambiguity aversion in human behaviors, we propose a novel candidate annotation paradigm wherein large language models are encouraged to output all possible labels when incurring uncertainty. To ensure unique labels are provided for downstream tasks, we develop a teacher-student framework CanDist that distills candidate annotations with a Small Language Model (SLM). We further provide a rigorous justification demonstrating that distilling candidate annotations from the teacher LLM offers superior theoretical guarantees compared to directly using single annotations. Extensive experiments across six text classification tasks validate the effectiveness of our proposed method. The source code is available at https://github.com/MingxuanXia/CanDist.
中文摘要:本文提出一种候选标注新范式,通过让大语言模型对不确定样本输出所有可能的标签,并设计名为CanDist的师生框架,利用小语言模型蒸馏这些候选标注,从而为下游任务提供更高质量的数据保障。
English Summary: This paper introduces a candidate annotation paradigm that leverages LLMs to generate multiple possible labels for uncertain samples, coupled with a teacher-student framework called CanDist that distills these annotations using a smaller language model to ensure data quality for downstream tasks.
Authors:Xiao-Qi Han, Ze-Feng Gao, Xin-De Wang, Zhenfeng Ouyang, Peng-Jie Guo, Zhong-Yi Lu
Abstract:
The discovery of high-temperature superconducting materials holds great significance for human industry and daily life. In recent years, research on predicting superconducting transition temperatures using artificial intelligence~(AI) has gained popularity, with most of these tools claiming to achieve remarkable accuracy. However, the lack of widely accepted benchmark datasets in this field has severely hindered fair comparisons between different AI algorithms and impeded further advancement of these methods. In this work, we present the HTSC-2025, an ambient-pressure high-temperature superconducting benchmark dataset. This comprehensive compilation encompasses theoretically predicted superconducting materials discovered by theoretical physicists from 2023 to 2025 based on BCS superconductivity theory, including the renowned X$_2$YH$_6$ system, perovskite MXH$_3$ system, M$_3$XH$_8$ system, cage-like BCN-doped metal atomic systems derived from LaH$_{10}$ structural evolution, and two-dimensional honeycomb-structured systems evolving from MgB$_2$. The HTSC-2025 benchmark has been open-sourced at https://github.com/xqh19970407/HTSC-2025 and will be continuously updated. This benchmark holds significant importance for accelerating the discovery of superconducting materials using AI-based methods.
中文: HTSC-2025基准数据集通过汇编2023至2025年基于BCS理论预测的超导材料,解决了该领域缺乏标准化数据的问题,为AI算法提供公平比较基准,将加速超导材料的发现进程。
English: The HTSC-2025 benchmark dataset addresses the lack of standardized data in AI-driven high-temperature superconductor research by compiling theoretical predictions from 2023-2025, enabling fair algorithm comparisons and accelerating material discovery.
Authors:Zhuohao Yu, Jiali Zeng, Weizheng Gu, Yidong Wang, Jindong Wang, Fandong Meng, Jie Zhou, Yue Zhang, Shikun Zhang, Wei Ye
Abstract:
Reward Models, essential for guiding Large Language Model optimization, are typically trained on fixed preference datasets, resulting in rigid alignment to single, implicit preference distributions. This prevents adaptation to diverse real-world needs-from conciseness in one task to detailed explanations in another. The standard practice of collecting task-specific preference data and retraining reward models is resource-intensive, often producing biased rewards, and limits practical application. We introduce generalizable, principle-following reward models. We propose that RMs should understand and adhere to dynamically provided natural language specifications of reward principles, similar to instruction-following in LLMs. To measure this capability, we develop RABench, a comprehensive benchmark for RMs focusing on generalization across diverse principles. Evaluations on RABench reveal poor generalization of current RMs. As a solution, we present RewardAnything, a novel RM designed and trained to explicitly follow natural language principles. We achieve SotA performance with RewardAnything in traditional RM benchmark simply by specifying a well-defined principle, and results on RABench show we excel in adapting to novel principles without retraining. Furthermore, RewardAnything integrates seamlessly with existing RLHF methods and we show by a case study on how to automatically and efficiently align LLMs with only natural language principles.
Chinese Summary: 传统奖励模型依赖固定数据集,难以适应多样任务需求,而新型RewardAnything模型通过动态遵循自然语言原则,无需重新训练即可实现卓越的泛化能力。
English Summary: Reward Models traditionally rely on fixed datasets, limiting their adaptability to varying real-world tasks, but the new RewardAnything model dynamically follows natural language principles, achieving superior generalization without retraining.
Authors:Hiroki Shiraishi, Yohei Hayamizu, Tomonori Hashiyama, Keiki Takadama, Hisao Ishibuchi, Masaya Nakata
Abstract:
Rule representations significantly influence the search capabilities and decision boundaries within the search space of Learning Classifier Systems (LCSs), a family of rule-based machine learning systems that evolve interpretable models through evolutionary processes. However, it is very difficult to choose an appropriate rule representation for each problem. Additionally, some problems benefit from using different representations for different subspaces within the input space. Thus, an adaptive mechanism is needed to choose an appropriate rule representation for each rule in LCSs. This article introduces a flexible rule representation using a four-parameter beta distribution and integrates it into a fuzzy-style LCS. The four-parameter beta distribution can form various function shapes, and this flexibility enables our LCS to automatically select appropriate representations for different subspaces. Our rule representation can represent crisp/fuzzy decision boundaries in various boundary shapes, such as rectangles and bells, by controlling four parameters, compared to the standard representations such as trapezoidal ones. Leveraging this flexibility, our LCS is designed to adapt the appropriate rule representation for each subspace. Moreover, our LCS incorporates a generalization bias favoring crisp rules where feasible, enhancing model interpretability without compromising accuracy. Experimental results on real-world classification tasks show that our LCS achieves significantly superior test accuracy and produces more compact rule sets. Our implementation is available at https://github.com/YNU-NakataLab/Beta4-UCS. An extended abstract related to this work is available at https://doi.org/10.36227/techrxiv.174900805.59801248/v1.
中文: 本研究在分类器学习系统中引入了一种基于四参数贝塔分布的灵活规则表示方法,能够自动适应不同子空间的规则表示需求,从而以更紧凑、可解释的规则集实现更高的分类准确率。
English: This study introduces a flexible rule representation using a four-parameter beta distribution in Learning Classifier Systems, enabling automatic adaptation of rule representations for different subspaces and achieving higher accuracy with more compact, interpretable rule sets.
Authors:Shengjie Lin, Jiading Fang, Muhammad Zubair Irshad, Vitor Campagnolo Guizilini, Rares Andrei Ambrus, Greg Shakhnarovich, Matthew R. Walter
Abstract:
Reconstructing articulated objects prevalent in daily environments is crucial for applications in augmented/virtual reality and robotics. However, existing methods face scalability limitations (requiring 3D supervision or costly annotations), robustness issues (being susceptible to local optima), and rendering shortcomings (lacking speed or photorealism). We introduce SplArt, a self-supervised, category-agnostic framework that leverages 3D Gaussian Splatting (3DGS) to reconstruct articulated objects and infer kinematics from two sets of posed RGB images captured at different articulation states, enabling real-time photorealistic rendering for novel viewpoints and articulations. SplArt augments 3DGS with a differentiable mobility parameter per Gaussian, achieving refined part segmentation. A multi-stage optimization strategy is employed to progressively handle reconstruction, part segmentation, and articulation estimation, significantly enhancing robustness and accuracy. SplArt exploits geometric self-supervision, effectively addressing challenging scenarios without requiring 3D annotations or category-specific priors. Evaluations on established and newly proposed benchmarks, along with applications to real-world scenarios using a handheld RGB camera, demonstrate SplArt's state-of-the-art performance and real-world practicality. Code is publicly available at https://github.com/ripl/splart.
Chinese: SplArt是一种自监督框架,通过位姿RGB图像重建铰接物体并推断运动学,无需3D监督即可实现实时逼真渲染。
English: SplArt is a self-supervised framework that reconstructs articulated objects and infers kinematics from posed RGB images, enabling real-time photorealistic rendering without 3D supervision.
Authors:Hiroki Shiraishi, Hisao Ishibuchi, Masaya Nakata
Abstract:
The decision-making process significantly influences the predictions of machine learning models. This is especially important in rule-based systems such as Learning Fuzzy-Classifier Systems (LFCSs) where the selection and application of rules directly determine prediction accuracy and reliability. LFCSs combine evolutionary algorithms with supervised learning to optimize fuzzy classification rules, offering enhanced interpretability and robustness. Despite these advantages, research on improving decision-making mechanisms (i.e., class inference schemes) in LFCSs remains limited. Most LFCSs use voting-based or single-winner-based inference schemes. These schemes rely on classification performance on training data and may not perform well on unseen data, risking overfitting. To address these limitations, this article introduces a novel class inference scheme for LFCSs based on the Dempster-Shafer Theory of Evidence (DS theory). The proposed scheme handles uncertainty well. By using the DS theory, the scheme calculates belief masses (i.e., measures of belief) for each specific class and the ``I don't know'' state from each fuzzy rule and infers a class from these belief masses. Unlike the conventional schemes, the proposed scheme also considers the ``I don't know'' state that reflects uncertainty, thereby improving the transparency and reliability of LFCSs. Applied to a variant of LFCS (i.e., Fuzzy-UCS), the proposed scheme demonstrates statistically significant improvements in terms of test macro F1 scores across 30 real-world datasets compared to conventional voting-based and single-winner-based fuzzy inference schemes. It forms smoother decision boundaries, provides reliable confidence measures, and enhances the robustness and generalizability of LFCSs in real-world applications. Our implementation is available at https://github.com/YNU-NakataLab/jUCS.
中文: 本文基于Dempster-Shafer理论提出了一种新的学习模糊分类系统类别推理方案,通过引入"不确定"状态改进了对不确定性的处理,提高了系统的透明度和可靠性,并在多个数据集上验证了其优越性能。
English: This article introduces a novel class inference scheme for Learning Fuzzy-Classifier Systems based on Dempster-Shafer Theory, which improves uncertainty handling, transparency, and reliability by incorporating an "I don't know" state and demonstrates enhanced performance across multiple datasets.
Authors:Rui Yann, Tianshuo Zhang, Xianglei Xing
Abstract:
We present SemiOccam, an image recognition network that leverages semi-supervised learning in a highly efficient manner. Existing works often rely on complex training techniques and architectures, requiring hundreds of GPU hours for training, while their generalization ability with extremely limited labeled data remains to be improved. To address these limitations, we construct a hierarchical mixture density classification mechanism by optimizing mutual information between feature representations and target classes, compressing redundant information while retaining crucial discriminative components. Experimental results demonstrate that our method achieves state-of-the-art performance on three commonly used datasets, with accuracy exceeding 95% on two of them using only 4 labeled samples per class, and its simple architecture keeps training time at the minute level. Notably, this paper reveals a long-overlooked data leakage issue in the STL-10 dataset for semi-supervised learning and removes duplicates to ensure reliable experimental results. We release the deduplicated CleanSTL-10 dataset to facilitate fair and reproducible research. Code available at https://github.com/Shu1L0n9/SemiOccam.
中文: SemiOccam提出了一种高效的半监督图像识别网络,仅用每类四个标注样本就在两个数据集上实现超过95%的准确率,同时揭露并修正了STL-10数据集的数据泄露问题以确保结果可靠性。
English: SemiOccam introduces an efficient semi-supervised image recognition network that achieves over 95% accuracy on two datasets with just four labeled samples per class, while exposing and rectifying data leakage in STL-10 to ensure reliable results.
Authors:Langlin Huang, Chengsong Huang, Jixuan Leng, Di Huang, Jiaxin Huang
Abstract:
Speculative decoding accelerates Large Language Model (LLM) inference by using a small draft model to predict multiple tokens, and a large target model to verify these tokens in parallel. Recent studies leverage the hidden state of the target model to enhance draft model prediction accuracy. However, existing methods suffer from the degrading quality of draft token predictions at later positions, due to error accumulation in draft model generated features. In this paper, we propose Position Specialists (PosS), which consist of multiple position-specialized draft layers to generate tokens at assigned position(s). Position specialists greatly improve token acceptance rate at later positions per drafting round, as each specialist only needs to focus on handling a certain level of draft model feature deviation. Experiment results on Llama-3-8B-Instruct and Llama-2-13B-chat across six datasets demonstrate that PosS effectively improves over baselines on average acceptance length and speed-up ratio. Our codebase is available at https://github.com/shrango/PosS.
中文摘要:提出的位置专家(PosS)方法通过为不同位置分配专门的草稿层,有效减少错误累积,显著提高了大语言模型推理中的令牌接受率和加速比。
English Summary: The proposed Position Specialists (PosS) method enhances speculative decoding by using specialized draft layers for different token positions, effectively reducing error accumulation and improving acceptance rates and inference speed in large language models.
Authors:Yongxiang Tang, Yanhua Cheng, Xiaocheng Liu, Chenchen Jiao, Yanxiang Zeng, Ning Luo, Pengjia Yuan, Xialong Liu, Peng Jiang
Abstract:
In many machine learning tasks, it is often necessary for the relationship between input and output variables to be monotonic, including both strictly monotonic and implicitly monotonic relationships. Traditional methods for maintaining monotonicity mainly rely on construction or regularization techniques, whereas this paper shows that the issue of strict monotonic probability can be viewed as a partial order between an observable revenue variable and a latent cost variable. This perspective enables us to reformulate the monotonicity challenge into modeling the latent cost variable. To tackle this, we introduce a generative network for the latent cost variable, termed the Generative Cost Model (GCM), which inherently addresses the strict monotonic problem, and propose the Implicit Generative Cost Model (IGCM) to address the implicit monotonic problem. We further validate our approach with a numerical simulation of quantile regression and conduct multiple experiments on public datasets, showing that our method significantly outperforms existing monotonic modeling techniques. The code for our experiments can be found at https://github.com/tyxaaron/GCM.
中文摘要:本文通过将严格单调概率问题重构为可观测收益变量与潜在成本变量间的偏序关系,提出了生成式成本模型(GCM与IGCM),在保持严格单调和隐式单调关系方面显著优于现有方法。
English Summary: This paper reframes strict monotonic probability as a partial order between observable revenue and latent cost variables, introducing Generative Cost Models (GCM and IGCM) that outperform existing methods in maintaining both strict and implicit monotonic relationships.
Authors:Daniel Campa, Mehdi Saeedi, Ian Colbert, Srinjoy Das
Abstract:
Navigation path traces play a crucial role in video game design, serving as a vital resource for both enhancing player engagement and fine-tuning non-playable character behavior. Generating such paths with human-like realism can enrich the overall gaming experience, and evaluating path traces can provide game designers insights into player interactions. Despite the impressive recent advancements in deep learning-based generative modeling, the video game industry hesitates to adopt such models for path generation, often citing their complex training requirements and interpretability challenges. To address these problems, we propose a novel path generation and evaluation approach that is grounded in principled nonparametric statistics and provides precise control while offering interpretable insights. Our path generation method fuses two statistical techniques: (1) nonparametric model-free transformations that capture statistical characteristics of path traces through time; and (2) copula models that capture statistical dependencies in space. For path evaluation, we adapt a nonparametric three-sample hypothesis test designed to determine if the generated paths are overfit (mimicking the original data too closely) or underfit (diverging too far from it). We demonstrate the precision and reliability of our proposed methods with empirical analysis on two existing gaming benchmarks to showcase controlled generation of diverse navigation paths. Notably, our novel path generator can be fine-tuned with user controllable parameters to create navigation paths that exhibit varying levels of human-likeness in contrast to those produced by neural network-based agents. The code is available at https://github.com/daniel-campa/mf-copula.
中文: 作者提出了一种基于非参数统计和Copula模型的新型路径生成与评估方法,能够为视频游戏生成拟人化导航路径,在解决深度学习模型局限性的同时提供了可解释性及对路径真实感的精确控制。
English: The authors introduce a novel path generation and evaluation method using nonparametric statistics and copula models to produce human-like navigation paths for video games, offering interpretability and control over path realism while addressing the limitations of deep learning approaches.
Authors:Mahesh Godavarti
Abstract:
We present an empirical validation of the directional non-commutative monoidal embedding framework recently introduced in prior work~\cite{Godavarti2025monoidal}. This framework defines learnable compositional embeddings using distinct non-commutative operators per dimension (axis) that satisfy an interchange law, generalizing classical one-dimensional transforms. Our primary goal is to verify that this framework can effectively model real data by applying it to a controlled, well-understood task: image classification on the MNIST dataset~\cite{lecun1998gradient}. A central hypothesis for why the proposed monoidal embedding works well is that it generalizes the Discrete Fourier Transform (DFT)~\cite{oppenheim1999discrete} by learning task-specific frequency components instead of using fixed basis frequencies. We test this hypothesis by comparing learned monoidal embeddings against fixed DFT-based embeddings on MNIST. The results show that as the embedding dimensionality decreases (e.g., from 32 to 8 to 2), the performance gap between the learned monoidal embeddings and fixed DFT-based embeddings on MNIST grows increasingly large. This comparison is used as an analytic tool to explain why the framework performs well: the learnable embeddings can capture the most discriminative spectral components for the task. Overall, our experiments confirm that directional non-commutative monoidal embeddings are highly effective for representing image data, offering a compact learned representation that retains high task performance. The code used in this work is available at https://github.com/mahesh-godavarti/directional_composition_mnist.
中文: 本研究验证了方向性非交换幺半群嵌入框架,通过实验证明其能学习任务特定的频谱成分,在MNIST分类任务中(尤其在低维情况下)显著优于固定的基于离散傅里叶变换的嵌入方法。
English: This study validates a directional non-commutative monoidal embedding framework, demonstrating its superiority over fixed DFT-based embeddings in MNIST classification, especially in low dimensions, by learning task-specific spectral components.
Authors:Jigang Fan, Quanlin Wu, Shengjie Luo, Liwei Wang
Abstract:
The detection of ligand binding sites for proteins is a fundamental step in Structure-Based Drug Design. Despite notable advances in recent years, existing methods, datasets, and evaluation metrics are confronted with several key challenges: (1) current datasets and methods are centered on individual protein-ligand complexes and neglect that diverse binding sites may exist across multiple complexes of the same protein, introducing significant statistical bias; (2) ligand binding site detection is typically modeled as a discontinuous workflow, employing binary segmentation and subsequent clustering algorithms; (3) traditional evaluation metrics do not adequately reflect the actual performance of different binding site prediction methods. To address these issues, we first introduce UniSite-DS, the first UniProt (Unique Protein)-centric ligand binding site dataset, which contains 4.81 times more multi-site data and 2.08 times more overall data compared to the previously most widely used datasets. We then propose UniSite, the first end-to-end ligand binding site detection framework supervised by set prediction loss with bijective matching. In addition, we introduce Average Precision based on Intersection over Union (IoU) as a more accurate evaluation metric for ligand binding site prediction. Extensive experiments on UniSite-DS and several representative benchmark datasets demonstrate that IoU-based Average Precision provides a more accurate reflection of prediction quality, and that UniSite outperforms current state-of-the-art methods in ligand binding site detection. The dataset and codes will be made publicly available at https://github.com/quanlin-wu/unisite.
中文: 本研究提出了用于配体结合位点检测的端到端框架UniSite、UniSite-DS数据集及基于交并比的评估指标,实验证明其性能优于现有方法。
English: This study introduces UniSite, an end-to-end framework for ligand binding site detection, along with the UniSite-DS dataset and an IoU-based evaluation metric, demonstrating superior performance over existing methods.
Authors:Yajie Zhou, Jiajun Ruan, Eric S. Wang, Sadjad Fouladi, Francis Y. Yan, Kevin Hsieh, Zaoxing Liu
Abstract:
Despite growing interest in domain-specific benchmarking of large language models (LLMs) and agents, current evaluations remain limited to static, small-scale datasets, especially in high-stakes tasks like network operations that demand reliability for deployments. We present NetPress, an automated benchmark generation framework for evaluating LLM agents in network applications. NetPress introduces a unified abstraction with state and action, enabling dynamic generation of diverse query sets along with corresponding ground truths. At runtime, users can specify benchmark configurations to generate millions of queries on the fly. In addition to dynamic benchmark construction, NetPress integrates with network emulators to provide realistic environment feedback, supporting comprehensive evaluation across correctness, safety, and latency. We instantiate NetPress on three representative applications, revealing interesting fine-grained differences in agent behavior that static, correctness-only benchmarks often miss. NetPress moves LLM evaluation toward realistic, scalable testing in infrastructure-centric domains, helping close the gap between benchmark performance and real-world deployment readiness. Code is available at https://github.com/Froot-NetSys/NetPress.
中文: NetPress是一个自动化框架,通过动态生成可扩展基准并集成网络模拟器反馈,全面评估LLM代理在网络应用中的正确性、安全性和延迟,弥补静态评估的不足。
English: NetPress is an automated framework that dynamically generates scalable benchmarks for evaluating LLM agents in network applications, integrating emulator feedback to assess correctness, safety, and latency beyond static datasets.
Authors:Selcuk Gurses, Aozhong Zhang, Yanxia Deng, Xun Dong, Xin Li, Naigang Wang, Penghang Yin, Zi Yang
Abstract:
Finetuning is a critical step for adapting large language models (LLMs) to domain-specific downstream tasks. To mitigate the substantial computational and memory costs of full-model fine-tuning, Parameter-Efficient Finetuning (PEFT) methods have been proposed to update only a small subset of model parameters. However, performance gaps between PEFT approaches and full-model fine-tuning still exist. In this work, we present DiaBlo, a simple yet effective PEFT approach that updates only the diagonal blocks of selected model weight matrices. Unlike Low Rank Adaptation (LoRA) and its variants, DiaBlo eliminates the need for low rank matrix products, thereby avoiding the reliance on auxiliary initialization schemes or customized optimization strategies to improve convergence. This design leads to stable and robust convergence while maintaining comparable memory efficiency and training speed to LoRA. We conduct extensive experiments across a range of tasks, including commonsense reasoning, arithmetic reasoning, code generation, and safety alignment, to evaluate the effectiveness and efficiency of DiaBlo. Across these benchmarks, DiaBlo demonstrates strong and consistent performance while maintaining high memory efficiency and fast finetuning speed. Codes are available at https://github.com/ziyangjoy/DiaBlo.
中文: DiaBlo是一种参数高效微调方法,仅更新权重矩阵的对角块,无需低秩近似或特殊初始化即可实现稳定收敛,并保持与LoRA相当的训练效率。
English: DiaBlo is a parameter-efficient fine-tuning method that updates only diagonal blocks of weight matrices, achieving stable convergence and comparable efficiency to LoRA without requiring low-rank approximations or special initialization.
Authors:Dania Herzalla, Willian T. Lunardi, Martin Andreoni
Abstract:
Graph-based learning provides a powerful framework for modeling complex relational structures; however, its application within the domain of wireless security remains significantly underexplored. In this work, we introduce the first application of graph-based learning for jamming source localization, addressing the imminent threat of jamming attacks in wireless networks. Unlike geometric optimization techniques that struggle under environmental uncertainties and dense interference, we reformulate the localization as an inductive graph regression task. Our approach integrates structured node representations that encode local and global signal aggregation, ensuring spatial coherence and adaptive signal fusion. To enhance robustness, we incorporate an attention-based \ac{GNN} that adaptively refines neighborhood influence and introduces a confidence-guided estimation mechanism that dynamically balances learned predictions with domain-informed priors. We evaluate our approach under complex \ac{RF} environments with various sampling densities, network topologies, jammer characteristics, and signal propagation conditions, conducting comprehensive ablation studies on graph construction, feature selection, and pooling strategies. Results demonstrate that our novel graph-based learning framework significantly outperforms established localization baselines, particularly in challenging scenarios with sparse and obfuscated signal information. Our code is available at https://github.com/tiiuae/gnn-jamming-source-localization.
中文: 本研究首次提出基于图学习的干扰源定位框架,通过自适应信号聚合和置信度引导机制,在复杂无线环境中显著优于传统定位方法。
English: This study introduces the first graph-based learning framework for jamming source localization, which outperforms traditional methods by integrating adaptive signal aggregation and confidence-guided mechanisms to handle complex wireless environments.
Authors:Yunqi Hong, Sohyun An, Andrew Bai, Neil Y. C. Lin, Cho-Jui Hsieh
Abstract:
Despite Multimodal Large Language Models (MLLMs) showing promising results on general zero-shot image classification tasks, fine-grained image classification remains challenging. It demands precise attention to subtle visual details to distinguish between visually similar subcategories--details that MLLMs may easily overlook without explicit guidance. To address this, we introduce AutoSEP, an iterative self-supervised prompt learning framework designed to enhance MLLM fine-grained classification capabilities in a fully unsupervised manner. Our core idea is to leverage unlabeled data to learn a description prompt that guides MLLMs in identifying crucial discriminative features within an image, and boosts classification accuracy. We developed an automatic self-enhancing prompt learning framework called AutoSEP to iteratively improve the description prompt using unlabeled data, based on instance-level classification scoring function. AutoSEP only requires black-box access to MLLMs, eliminating the need for any training or fine-tuning. We evaluate our approach on multiple fine-grained classification datasets. It consistently outperforms other unsupervised baselines, demonstrating the effectiveness of our self-supervised optimization framework. Notably, AutoSEP on average improves 13 percent over standard zero-shot classification and 5 percent over the best-performing baselines. Code is available at: https://github.com/yq-hong/AutoSEP
中文: AutoSEP是一种自监督提示学习框架,通过从未标注数据中迭代学习区分性提示来增强多模态大语言模型的细粒度图像分类能力,无需模型训练即可显著提升分类准确率。
English: AutoSEP is a self-supervised prompt learning framework that enhances MLLMs' fine-grained image classification by iteratively learning discriminative prompts from unlabeled data, achieving significant accuracy improvements without model training.
Authors:Ekram Alam, Abu Sufian, Paramartha Dutta, Marco Leo
Abstract:
Unintentional or accidental falls are one of the significant health issues in senior persons. The population of senior persons is increasing steadily. So, there is a need for an automated fall detection monitoring system. This paper introduces a vision-based fall detection system using a pre-trained 3D CNN. Unlike 2D CNN, 3D CNN extracts not only spatial but also temporal features. The proposed model leverages the original learned weights of a 3D CNN model pre-trained on the Sports1M dataset to extract the spatio-temporal features. Only the SVM classifier was trained, which saves the time required to train the 3D CNN. Stratified shuffle five split cross-validation has been used to split the dataset into training and testing data. Extracted features from the proposed 3D CNN model were fed to an SVM classifier to classify the activity as fall or ADL. Two datasets, GMDCSA and CAUCAFall, were utilized to conduct the experiment. The source code for this work can be accessed via the following link: https://github.com/ekramalam/HFD_3DCNN.
中文: 本文提出一种基于视觉的老年人跌倒检测系统,利用预训练的3D卷积神经网络提取时空特征,仅需训练SVM分类器即可有效区分跌倒与日常活动。
English: This paper presents a vision-based fall detection system for seniors using a pre-trained 3D CNN to extract spatiotemporal features, with only the SVM classifier trained on datasets to efficiently distinguish falls from daily activities.
Authors:Shivani Chiranjeevi, Hossein Zaremehrjerdi, Zi K. Deng, Talukder Z. Jubery, Ari Grele, Arti Singh, Asheesh K Singh, Soumik Sarkar, Nirav Merchant, Harold F. Greeney, Baskar Ganapathysubramanian, Chinmay Hegde
Abstract:
The rapid global loss of biodiversity, particularly among insects, represents an urgent ecological crisis. Current methods for insect species discovery are manual, slow, and severely constrained by taxonomic expertise, hindering timely conservation actions. We introduce TerraIncognita, a dynamic benchmark designed to evaluate state-of-the-art multimodal models for the challenging problem of identifying unknown, potentially undescribed insect species from image data. Our benchmark dataset combines a mix of expertly annotated images of insect species likely known to frontier AI models, and images of rare and poorly known species, for which few/no publicly available images exist. These images were collected from underexplored biodiversity hotspots, realistically mimicking open-world discovery scenarios faced by ecologists. The benchmark assesses models' proficiency in hierarchical taxonomic classification, their capability to detect and abstain from out-of-distribution (OOD) samples representing novel species, and their ability to generate explanations aligned with expert taxonomic knowledge. Notably, top-performing models achieve over 90\% F1 at the Order level on known species, but drop below 2\% at the Species level, highlighting the sharp difficulty gradient from coarse to fine taxonomic prediction (Order $\rightarrow$ Family $\rightarrow$ Genus $\rightarrow$ Species). TerraIncognita will be updated regularly, and by committing to quarterly dataset expansions (of both known and novel species), will provide an evolving platform for longitudinal benchmarking of frontier AI methods. All TerraIncognita data, results, and future updates are available \href{https://baskargroup.github.io/TerraIncognita/}{here}.
中文: TerraIncognita是一个动态基准,旨在评估多模态AI模型从图像中识别未知昆虫物种的能力,尽管在精细分类级别性能急剧下降,但它能有效进行层级分类并检测新物种。
English: TerraIncognita is a dynamic benchmark that evaluates multimodal AI models for identifying unknown insect species from images, highlighting their proficiency in hierarchical classification and detection of novel species despite a sharp performance decline at finer taxonomic levels.
Authors:Bin Wang, Yongqi Han, Minbo Ma, Tianrui Li, Junbo Zhang, Feng Hong, Yanwei Yu
Abstract:
Deep learning-based approaches have demonstrated significant advancements in time series forecasting. Despite these ongoing developments, the complex dynamics of time series make it challenging to establish the rule of thumb for designing the golden model architecture. In this study, we argue that refining existing advanced models through a universal calibrating strategy can deliver substantial benefits with minimal resource costs, as opposed to elaborating and training a new model from scratch. We first identify a multi-target learning conflict in the calibrating process, which arises when optimizing variables across time steps, leading to the underutilization of the model's learning capabilities. To address this issue, we propose an innovative calibrating strategy called Socket+Plug (SoP). This approach retains an exclusive optimizer and early-stopping monitor for each predicted target within each Plug while keeping the fully trained Socket backbone frozen. The model-agnostic nature of SoP allows it to directly calibrate the performance of any trained deep forecasting models, regardless of their specific architectures. Extensive experiments on various time series benchmarks and a spatio-temporal meteorological ERA5 dataset demonstrate the effectiveness of SoP, achieving up to a 22% improvement even when employing a simple MLP as the Plug (highlighted in Figure 1). Code is available at https://github.com/hanyuki23/SoP.
中文摘要:本研究提出了一种名为Socket+Plug(SoP)的通用校准策略,通过解决多目标学习冲突来提升现有深度学习模型在时间序列预测中的性能,无需开发新模型即可实现高达22%的性能提升。
English Summary: This study introduces a universal calibration strategy called Socket+Plug (SoP) that enhances existing deep learning models for time series forecasting by addressing multi-target learning conflicts, achieving up to 22% improvement without requiring new model development.
Authors:Zidong Wang, Lei Bai, Xiangyu Yue, Wanli Ouyang, Yiyuan Zhang
Abstract:
We introduce native-resolution image synthesis, a novel generative modeling paradigm that enables the synthesis of images at arbitrary resolutions and aspect ratios. This approach overcomes the limitations of conventional fixed-resolution, square-image methods by natively handling variable-length visual tokens, a core challenge for traditional techniques. To this end, we introduce the Native-resolution diffusion Transformer (NiT), an architecture designed to explicitly model varying resolutions and aspect ratios within its denoising process. Free from the constraints of fixed formats, NiT learns intrinsic visual distributions from images spanning a broad range of resolutions and aspect ratios. Notably, a single NiT model simultaneously achieves the state-of-the-art performance on both ImageNet-256x256 and 512x512 benchmarks. Surprisingly, akin to the robust zero-shot capabilities seen in advanced large language models, NiT, trained solely on ImageNet, demonstrates excellent zero-shot generalization performance. It successfully generates high-fidelity images at previously unseen high resolutions (e.g., 1536 x 1536) and diverse aspect ratios (e.g., 16:9, 3:1, 4:3), as shown in Figure 1. These findings indicate the significant potential of native-resolution modeling as a bridge between visual generative modeling and advanced LLM methodologies.
Chinese: 我们提出了原生分辨率图像合成,这是一种利用原生分辨率扩散变换器(NiT)生成任意分辨率和宽高比图像的新范式,实现了最先进的性能及超越训练数据的零样本泛化能力。
English: We introduce native-resolution image synthesis, a novel paradigm using the Native-resolution diffusion Transformer (NiT) to generate images at any resolution and aspect ratio, achieving state-of-the-art performance and zero-shot generalization beyond training data.
Authors:Christian Schlarmann, Francesco Croce, Nicolas Flammarion, Matthias Hein
Abstract:
Contrastive language-image pre-training aligns the features of text-image pairs in a common latent space via distinct encoders for each modality. While this approach achieves impressive performance in several zero-shot tasks, it cannot natively handle multimodal inputs, i.e., encoding image and text into a single feature vector. As a remedy, it is common practice to use additional modules to merge the features extracted by the unimodal encoders. In this work, we present FuseLIP, an alternative architecture for multimodal embedding. Leveraging recent progress in discrete image tokenizers, we propose to use a single transformer model which operates on an extended vocabulary of text and image tokens. This early fusion approach allows the different modalities to interact at each depth of encoding and obtain richer representations compared to common late fusion. We collect new datasets for multimodal pre-training and evaluation, designing challenging tasks for multimodal encoder models. We show that FuseLIP outperforms other approaches in multimodal embedding tasks such as VQA and text-guided image transformation retrieval, while being comparable to baselines on unimodal tasks.
中文摘要:FuseLIP采用早期融合架构,通过单一Transformer处理文本与图像标记的组合,在多模态任务中表现卓越,同时在单模态任务上保持基准竞争力。
English Summary: FuseLIP introduces an early fusion architecture using a single transformer to process combined text and image tokens, achieving superior multimodal task performance while maintaining competitive unimodal capabilities.
Authors:Bin Ma, Yuyuan Feng, Minhua Lin, Enyan Dai
Abstract:
Graph Neural Networks (GNNs) have become essential tools for analyzing graph-structured data in domains such as drug discovery and financial analysis, leading to growing demands for model transparency. Recent advances in explainable GNNs have addressed this need by revealing important subgraphs that influence predictions, but these explanation mechanisms may inadvertently expose models to security risks. This paper investigates how such explanations potentially leak critical decision logic that can be exploited for model stealing. We propose {\method}, a novel stealing framework that integrates explanation alignment for capturing decision logic with guided data augmentation for efficient training under limited queries, enabling effective replication of both the predictive behavior and underlying reasoning patterns of target models. Experiments on molecular graph datasets demonstrate that our approach shows advantages over conventional methods in model stealing. This work highlights important security considerations for the deployment of explainable GNNs in sensitive domains and suggests the need for protective measures against explanation-based attacks. Our code is available at https://github.com/beanmah/EGSteal.
Chinese: 本文提出一种新颖的模型窃取框架,通过利用可解释图神经网络的解释机制来复制目标模型的预测行为和推理模式,揭示了在敏感应用领域存在的安全隐患。
English: This paper introduces a novel model stealing framework that exploits explanations from explainable Graph Neural Networks to replicate both the predictive behavior and reasoning patterns of target models, highlighting security vulnerabilities in sensitive applications.
Authors:Qijun Luo, Mengqi Li, Lei Zhao, Xiao Li
Abstract:
Training language models on long sequence data is a demanding requirement for enhancing the model's capability on complex tasks, e.g., long-chain reasoning. However, as the sequence length scales up, the memory cost for storing activation values becomes huge during the Backpropagation (BP) process, even with the application of gradient checkpointing technique. To tackle this challenge, we propose a memory-efficient and exact BP method called StreamBP, which performs a linear decomposition of the chain rule along the sequence dimension in a layer-wise manner, significantly reducing the memory cost of activation values and logits. The proposed method is applicable to common objectives such as SFT, GRPO, and DPO. From an implementation perspective, StreamBP achieves less computational FLOPs and faster BP speed by leveraging the causal structure of the language model. Compared to gradient checkpointing, StreamBP scales up the maximum sequence length of BP by 2.8-5.5 times larger, while using comparable or even less BP time. Note that StreamBP's sequence length scaling ability can be directly transferred to batch size scaling for accelerating training. We further develop a communication-efficient distributed StreamBP to effectively support multi-GPU training and broaden its applicability. Our code can be easily integrated into the training pipeline of any transformer models and is available at https://github.com/Ledzy/StreamBP.
中文: StreamBP是一种内存高效且精确的反向传播方法,通过逐层分解链式法则显著降低激活值内存成本,相比梯度检查点技术可将训练序列长度提升2.8-5.5倍。
English: StreamBP is a memory-efficient and exact backpropagation method that decomposes the chain rule layer-wise to significantly reduce activation memory costs while enabling 2.8-5.5 times longer sequence training compared to gradient checkpointing.
Authors:Li Zhang, Kevin D. Ashley
Abstract:
Large Language Models (LLMs) are increasingly explored for legal argument generation, yet they pose significant risks of manipulation through hallucination and ungrounded persuasion, and often fail to utilize provided factual bases effectively or abstain when arguments are untenable. This paper introduces a novel reflective multi-agent method designed to address these challenges in the context of legally compliant persuasion. Our approach employs specialized agents--a Factor Analyst and an Argument Polisher--in an iterative refinement process to generate 3-ply legal arguments (plaintiff, defendant, rebuttal). We evaluate Reflective Multi-Agent against single-agent, enhanced-prompt single-agent, and non-reflective multi-agent baselines using four diverse LLMs (GPT-4o, GPT-4o-mini, Llama-4-Maverick-17b-128e, Llama-4-Scout-17b-16e) across three legal scenarios: "arguable", "mismatched", and "non-arguable". Results demonstrate Reflective Multi-Agent's significant superiority in successful abstention (preventing generation when arguments cannot be grounded), marked improvements in hallucination accuracy (reducing fabricated and misattributed factors), particularly in "non-arguable" scenarios, and enhanced factor utilization recall (improving the use of provided case facts). These findings suggest that structured reflection within a multi-agent framework offers a robust computable method for fostering ethical persuasion and mitigating manipulation in LLM-based legal argumentation systems, a critical step towards trustworthy AI in law. Project page: https://lizhang-aiandlaw.github.io/A-Reflective-Multi-Agent-Approach-for-Legal-Argument-Generation/
Authors:Junyi Fang, Yuxun Chen, Yuxin Chen, Chen Zhang
Abstract:
The Multi-Armed Bandit (MAB) problem is challenging in non-stationary environments where reward distributions evolve dynamically. We introduce RAVEN-UCB, a novel algorithm that combines theoretical rigor with practical efficiency via variance-aware adaptation. It achieves tighter regret bounds than UCB1 and UCB-V, with gap-dependent regret of order $K Ï_{\max}^2 \log T / Î$ and gap-independent regret of order $\sqrt{K T \log T}$. RAVEN-UCB incorporates three innovations: (1) variance-driven exploration using $\sqrt{\hatÏ_k^2 / (N_k + 1)}$ in confidence bounds, (2) adaptive control via $α_t = α_0 / \log(t + ε)$, and (3) constant-time recursive updates for efficiency. Experiments across non-stationary patterns - distributional changes, periodic shifts, and temporary fluctuations - in synthetic and logistics scenarios demonstrate its superiority over state-of-the-art baselines, confirming theoretical and practical robustness.
中文: RAVEN-UCB是一种针对非平稳多臂老虎机问题的新算法,通过方差感知自适应实现了更严格的遗憾界,并在多种动态环境中展现出优越性能。
English: RAVEN-UCB is a novel algorithm for non-stationary multi-armed bandit problems that achieves tighter regret bounds through variance-aware adaptation and demonstrates superior performance in various dynamic environments.
Authors:Praneet Sai Madhu Surabhi, Dheeraj Reddy Mudireddy, Jian Tao
Abstract:
This paper presents ThinkTank, a comprehensive and scalable framework designed to transform specialized AI agent systems into versatile collaborative intelligence platforms capable of supporting complex problem-solving across diverse domains. ThinkTank systematically generalizes agent roles, meeting structures, and knowledge integration mechanisms by adapting proven scientific collaboration methodologies. Through role abstraction, generalization of meeting types for iterative collaboration, and the integration of Retrieval-Augmented Generation with advanced knowledge storage, the framework facilitates expertise creation and robust knowledge sharing. ThinkTank enables organizations to leverage collaborative AI for knowledge-intensive tasks while ensuring data privacy and security through local deployment, utilizing frameworks like Ollama with models such as Llama3.1. The ThinkTank framework is designed to deliver significant advantages in cost-effectiveness, data security, scalability, and competitive positioning compared to cloud-based alternatives, establishing it as a universal platform for AI-driven collaborative problem-solving. The ThinkTank code is available at https://github.com/taugroup/ThinkTank
ThinkTank是一个可扩展框架,通过泛化角色、会议结构和知识整合机制,将专业AI代理系统转变为支持跨领域复杂问题解决的协作平台,并利用本地部署确保数据隐私和安全。
ThinkTank is a scalable framework that transforms specialized AI agents into collaborative platforms for complex problem-solving by generalizing roles, meeting structures, and knowledge integration while ensuring data privacy through local deployment.
Authors:Yin Fang, Qiao Jin, Guangzhi Xiong, Bowen Jin, Xianrui Zhong, Siru Ouyang, Aidong Zhang, Jiawei Han, Zhiyong Lu
Abstract:
Cell type annotation is a key task in analyzing the heterogeneity of single-cell RNA sequencing data. Although recent foundation models automate this process, they typically annotate cells independently, without considering batch-level cellular context or providing explanatory reasoning. In contrast, human experts often annotate distinct cell types for different cell clusters based on their domain knowledge. To mimic this workflow, we introduce the CellPuzzles task, where the objective is to assign unique cell types to a batch of cells. This benchmark spans diverse tissues, diseases, and donor conditions, and requires reasoning across the batch-level cellular context to ensure label uniqueness. We find that off-the-shelf large language models (LLMs) struggle on CellPuzzles, with the best baseline (OpenAI's o1) achieving only 19.0% batch-level accuracy. To fill this gap, we propose Cell-o1, a 7B LLM trained via supervised fine-tuning on distilled reasoning traces, followed by reinforcement learning with batch-level rewards. Cell-o1 achieves state-of-the-art performance, outperforming o1 by over 73% and generalizing well across contexts. Further analysis of training dynamics and reasoning behaviors provides insights into batch-level annotation performance and emergent expert-like reasoning. Code and data are available at https://github.com/ncbi-nlp/cell-o1.
中文: 本研究提出了CellPuzzles基准任务,要求通过批次级推理在不同条件下分配唯一细胞类型,并开发了Cell-o1模型——一个通过监督微调和批次级奖励强化学习训练的70亿参数大语言模型,实现了最先进的性能表现。
English: The study introduces CellPuzzles, a benchmark task requiring batch-level reasoning to assign unique cell types across diverse conditions, and proposes Cell-o1, a 7B LLM that achieves state-of-the-art performance by leveraging supervised fine-tuning and reinforcement learning with batch-level rewards.
Authors:Changyi Xiao, Mengdi Zhang, Yixin Cao
Abstract:
Recent studies, including DeepSeek-R1 and Kimi-k1.5, have demonstrated that reinforcement learning with rule-based, binary-valued reward functions can significantly enhance the reasoning capabilities of large language models. These models primarily utilize REINFORCE-based policy optimization techniques, such as REINFORCE with baseline and group relative policy optimization (GRPO). However, a key limitation remains: current policy optimization methods either neglect reward normalization or employ static normalization strategies, which fail to adapt to the dynamic nature of policy updates during training. This may result in unstable gradient estimates and hinder training stability. To address this issue, we propose Beta Normalization Policy Optimization (BNPO), a novel policy optimization method that adaptively normalizes rewards using a Beta distribution with dynamically updated parameters. BNPO aligns the normalization with the changing policy distribution, enabling more precise and lower-variance gradient estimation, which in turn promotes stable training dynamics. We provide theoretical analysis demonstrating BNPO's variance-reducing properties and show that it generalizes both REINFORCE and GRPO under binary-valued reward settings. Furthermore, we introduce an advantage decomposition mechanism to extend BNPO's applicability to more complex reward systems. Experimental results confirm that BNPO achieves state-of-the-art performance among policy optimization methods on reasoning tasks. The code is available at https://github.com/changyi7231/BNPO.
Chinese: 本文提出Beta归一化策略优化(BNPO),这是一种通过Beta分布自适应归一化二元奖励的新方法,旨在稳定训练并增强大语言模型的推理能力,实现了最先进的性能。
English: This paper introduces Beta Normalization Policy Optimization (BNPO), a novel method that adaptively normalizes binary rewards using a Beta distribution to stabilize training and improve reasoning in large language models, achieving state-of-the-art performance.
Authors:Ekaterina Grishina, Mikhail Gorbunov, Maxim Rakhuba
Abstract:
Large language models (LLMs) demonstrate impressive results in natural language processing tasks but require a significant amount of computational and memory resources. Structured matrix representations are a promising way for reducing the number of parameters of these models. However, it seems unrealistic to expect that weight matrices of pretrained models can be accurately represented by structured matrices without any fine-tuning. To overcome this issue, we utilize the fact that LLM output is invariant under certain orthogonal transformations of weight matrices. This insight can be leveraged to identify transformations that significantly improve the compressibility of weights within structured classes. The proposed approach is applicable to various types of structured matrices that support efficient projection operations. Code is available at https://github.com/GrishKate/ProcrustesGPT
中文: 大语言模型可通过利用保持输出不变的正交变换,采用结构化矩阵进行压缩,从而实现无需微调的高效参数削减。
English: Large language models can be compressed using structured matrices by leveraging orthogonal transformations that maintain output invariance, enabling efficient parameter reduction without fine-tuning.
Authors:Changyi Xiao, Yixin Cao
Abstract:
Knowledge graph completion (KGC) can be framed as a 3-order binary tensor completion task. Tensor decomposition-based (TDB) models have demonstrated strong performance in KGC. In this paper, we provide a summary of existing TDB models and derive a general form for them, serving as a foundation for further exploration of TDB models. Despite the expressiveness of TDB models, they are prone to overfitting. Existing regularization methods merely minimize the norms of embeddings to regularize the model, leading to suboptimal performance. Therefore, we propose a novel regularization method for TDB models that addresses this limitation. The regularization is applicable to most TDB models and ensures tractable computation. Our method minimizes the norms of intermediate variables involved in the different ways of computing the predicted tensor. To support our regularization method, we provide a theoretical analysis that proves its effect in promoting low trace norm of the predicted tensor to reduce overfitting. Finally, we conduct experiments to verify the effectiveness of our regularization technique as well as the reliability of our theoretical analysis. The code is available at https://github.com/changyi7231/IVR.
中文: 本文提出了一种新的基于张量分解的知识图谱补全模型的正则化方法,通过最小化中间变量的范数来有效减少过拟合,并辅以理论分析和实验验证。
English: This paper introduces a novel regularization method for tensor decomposition-based knowledge graph completion models that minimizes the norms of intermediate variables to effectively reduce overfitting, supported by theoretical analysis and experimental validation.
Authors:Jan Robine, Marc Höftmann, Stefan Harmeling
Abstract:
What are the essential components of world models? How far do we get with world models that are not employing RNNs, transformers, discrete representations, and image reconstructions? This paper introduces SGF, a Simple, Good, and Fast world model that uses self-supervised representation learning, captures short-time dependencies through frame and action stacking, and enhances robustness against model errors through data augmentation. We extensively discuss SGF's connections to established world models, evaluate the building blocks in ablation studies, and demonstrate good performance through quantitative comparisons on the Atari 100k benchmark.
Chinese: 本文提出SGF这一简洁高效的世界模型,通过自监督学习和数据增强技术,在Atari 100k基准测试中无需复杂架构即实现了优异性能。
English: This paper introduces SGF, a simple and efficient world model that uses self-supervised learning and data augmentation to achieve strong performance on the Atari 100k benchmark without complex architectures.
Authors:Ping Gong, Jiawei Yi, Shengnan Wang, Juncheng Zhang, Zewen Jin, Ouxiang Zhou, Ruibo Liu, Guanbin Xu, Youhui Bai, Bowen Ye, Kun Yuan, Tong Yang, Gong Zhang, Renhai Chen, Feng Wu, Cheng Li
Abstract:
Large Language Models (LLMs) have emerged as a pivotal research area, yet the attention module remains a critical bottleneck in LLM inference, even with techniques like KVCache to mitigate redundant computations. While various top-$k$ attention mechanisms have been proposed to accelerate LLM inference by exploiting the inherent sparsity of attention, they often struggled to strike a balance between efficiency and accuracy. In this paper, we introduce HATA (Hash-Aware Top-$k$ Attention), a novel approach that systematically integrates low-overhead learning-to-hash techniques into the Top-$k$ attention process. Different from the existing top-k attention methods which are devoted to seeking an absolute estimation of qk score, typically with a great cost, HATA maps queries and keys into binary hash codes, and acquires the relative qk score order with a quite low cost, which is sufficient for realizing top-k attention. Extensive experiments demonstrate that HATA achieves up to 7.2$\times$ speedup compared to vanilla full attention while maintaining model accuracy. In addition, HATA outperforms the state-of-the-art top-$k$ attention methods in both accuracy and efficiency across multiple mainstream LLM models and diverse tasks. HATA is open source at https://github.com/gpzlx1/HATA.
中文摘要:HATA提出了一种哈希感知的Top-k注意力机制,通过将查询和键高效映射为二进制编码,在保持模型精度的同时实现了高达7.2倍的加速效果。
English Summary: HATA introduces a hash-aware top-k attention mechanism that accelerates LLM inference by efficiently mapping queries and keys to binary codes, achieving up to 7.2× speedup while maintaining accuracy.
Authors:Timo Osterburg, Franz Albers, Christopher Diehl, Rajesh Pushparaj, Torsten Bertram
Abstract:
The fusion of sensor data is essential for a robust perception of the environment in autonomous driving. Learning-based fusion approaches mainly use feature-level fusion to achieve high performance, but their complexity and hardware requirements limit their applicability in near-production vehicles. High-level fusion methods offer robustness with lower computational requirements. Traditional methods, such as the Kalman filter, dominate this area. This paper modifies the Adapted Kalman Filter (AKF) and proposes a novel transformer-based high-level object fusion method called HiLO. Experimental results demonstrate improvements of $25.9$ percentage points in $\textrm{F}_1$ score and $6.1$ percentage points in mean IoU. Evaluation on a new large-scale real-world dataset demonstrates the effectiveness of the proposed approaches. Their generalizability is further validated by cross-domain evaluation between urban and highway scenarios. Code, data, and models are available at https://github.com/rst-tu-dortmund/HiLO .
中文: 本文提出了一种名为HiLO的基于Transformer的高层目标融合方法,通过将F1分数提升25.9个百分点、平均交并比提升6.1个百分点,显著增强了自动驾驶的环境感知能力,并在城市与高速公路场景中验证了其通用性。
English: This paper introduces HiLO, a transformer-based high-level object fusion method that enhances autonomous driving perception by improving the F1 score by 25.9 percentage points and mean IoU by 6.1 percentage points, validated across urban and highway scenarios.
Authors:Niklas Kormann, Masoud Ramuz, Zeeshan Nisar, Nadine S. Schaadt, Hendrik Annuth, Benjamin Doerr, Friedrich Feuerhake, Thomas Lampert, Johannes F. Lutzeyer
Abstract:
Graph Neural Networks (GNNs) have recently been found to excel in histopathology. However, an important histopathological task, where GNNs have not been extensively explored, is the classification of glomeruli health as an important indicator in nephropathology. This task presents unique difficulties, particularly for the graph construction, i.e., the identification of nodes, edges, and informative features. In this work, we propose a pipeline composed of different traditional and machine learning-based computer vision techniques to identify nodes, edges, and their corresponding features to form a heterogeneous graph. We then proceed to propose a novel heterogeneous GNN architecture for glomeruli classification, called HIEGNet, that integrates both glomeruli and their surrounding immune cells. Hence, HIEGNet is able to consider the immune environment of each glomerulus in its classification. Our HIEGNet was trained and tested on a dataset of Whole Slide Images from kidney transplant patients. Experimental results demonstrate that HIEGNet outperforms several baseline models and generalises best between patients among all baseline models. Our implementation is publicly available at https://github.com/nklsKrmnn/HIEGNet.git.
中文: 图神经网络在组织病理学中表现优异,提出的HIEGNet模型通过整合免疫细胞数据有效分类肾小球健康状况,在肾移植患者数据分析中优于基线模型。
English: Graph Neural Networks (GNNs) show strong performance in histopathology, and the proposed HIEGNet model effectively classifies glomeruli health by integrating immune cell data, outperforming baseline models in kidney transplant patient analysis.
Authors:Seulgi Kim, Ghazal Kaviani, Mohit Prabhushankar, Ghassan AlRegib
Abstract:
Action anticipation, the task of predicting future actions from partially observed videos, is crucial for advancing intelligent systems. Unlike action recognition, which operates on fully observed videos, action anticipation must handle incomplete information. Hence, it requires temporal reasoning, and inherent uncertainty handling. While recent advances have been made, traditional methods often focus solely on visual modalities, neglecting the potential of integrating multiple sources of information. Drawing inspiration from human behavior, we introduce \textit{Multi-level and Multi-modal Action Anticipation (m\&m-Ant)}, a novel multi-modal action anticipation approach that combines both visual and textual cues, while explicitly modeling hierarchical semantic information for more accurate predictions. To address the challenge of inaccurate coarse action labels, we propose a fine-grained label generator paired with a specialized temporal consistency loss function to optimize performance. Extensive experiments on widely used datasets, including Breakfast, 50 Salads, and DARai, demonstrate the effectiveness of our approach, achieving state-of-the-art results with an average anticipation accuracy improvement of 3.08\% over existing methods. This work underscores the potential of multi-modal and hierarchical modeling in advancing action anticipation and establishes a new benchmark for future research in the field. Our code is available at: https://github.com/olivesgatech/mM-ant.
中文摘要:本文提出m&m-Ant多模态方法,通过融合视觉文本线索与分层语义建模,将动作预测准确率较现有方法平均提升3.08%。
English Summary: The paper introduces m&m-Ant, a novel multi-modal approach that integrates visual and textual cues with hierarchical modeling to improve action anticipation accuracy by 3.08% over existing methods.
Authors:Qin Xie, Qinghua Zhang, Shuyin Xia
Abstract:
Data sampling enhances classifier efficiency and robustness through data compression and quality improvement. Recently, the sampling method based on granular-ball (GB) has shown promising performance in generality and noisy classification tasks. However, some limitations remain, including the absence of borderline sampling strategies and issues with class boundary blurring or shrinking due to overlap between GBs. In this paper, an approximate borderline sampling method using GBs is proposed for classification tasks. First, a restricted diffusion-based GB generation (RD-GBG) method is proposed, which prevents GB overlaps by constrained expansion, preserving precise geometric representation of GBs via redefined ones. Second, based on the concept of heterogeneous nearest neighbor, a GB-based approximate borderline sampling (GBABS) method is proposed, which is the first general sampling method capable of both borderline sampling and improving the quality of class noise datasets. Additionally, since RD-GBG incorporates noise detection and GBABS focuses on borderline samples, GBABS performs outstandingly on class noise datasets without the need for an optimal purity threshold. Experimental results demonstrate that the proposed methods outperform the GB-based sampling method and several representative sampling methods. Our source code is publicly available at https://github.com/CherylTse/GBABS.
中文:本文提出一种基于粒球的近似边界采样方法,通过受限扩散避免粒球重叠,无需纯度阈值即可提升噪声数据集的分类性能。
English: This paper introduces a granular-ball-based approximate borderline sampling method that prevents overlaps through restricted diffusion and enhances classification performance on noisy datasets without requiring purity thresholds.
Authors:Andre He, Daniel Fried, Sean Welleck
Abstract:
Reinforcement learning is emerging as a primary driver for improving language model reasoning capabilities. A fundamental question is whether current reinforcement learning algorithms -- such as Group Relative Policy Optimization (GRPO), the de facto standard algorithm used to improve language model reasoning -- merely sharpen the base model's distribution around problems it can already solve. We investigate this question in the context of formal theorem proving, which has access to a perfect verifier. We identify a degenerate rank bias in GRPO in which highly probable trajectories are reinforced and rare ones are neglected. This results in distribution sharpening: the model can solve some problems with fewer samples, but underperforms simply sampling more solutions from the original model. To overcome GRPO's rank bias we introduce unlikeliness reward, a simple method for explicitly up-weighting rare but correct solutions. We show that unlikeliness reward mitigates rank bias and improves pass@$N$ across a large range of $N$ in both synthetic and real theorem proving settings. We also uncover an unexpected link between rank bias and a seemingly mundane hyperparameter -- the number of updates per batch -- that leads to a second, complementary mitigation. We combine our insights into a revised GRPO training recipe for formal theorem proving, yielding an open pipeline that achieves competitive performance to DeepSeek-Prover-V1.5-RL on the miniF2F-test benchmark. We release our implementation at https://github.com/AndreHe02/rewarding-unlikely-release
中文:强化学习算法如GRPO可能仅优化语言模型现有解题能力的分布,但通过引入罕见解奖励可纠正这种偏差,在定理证明中提升模型表现。
English: Reinforcement learning algorithms like GRPO may only refine a language model's existing problem-solving distribution rather than expanding its capabilities, but introducing an unlikeliness reward can counteract this bias and improve performance in theorem proving.
Authors:Qinsi Wang, Jinghan Ke, Hancheng Ye, Yueqian Lin, Yuzhe Fu, Jianyi Zhang, Kurt Keutzer, Chenfeng Xu, Yiran Chen
Abstract:
Current Reinforcement Fine-tuning (RFT) paradigms for Large Language Models (LLMs) suffer from sample inefficiency due to the redundant exposure of identical queries under uniform data sampling. While previous work has explored curriculum learning via heuristic difficulty metrics, these strategies exhibit limitations by neglecting the intrinsic learning signals generated by the model itself, thus leading to suboptimal training regimes. In this paper, we identify a model-inherent signal termed angle concentration that effectively reflects an LLM's capacity to learn from specific data. We theoretically and empirically demonstrate a correlation between the angular distribution of token hidden state vectors and the resulting gradient, revealing a learning preference for data exhibiting higher angle concentration. Inspired by this finding, we propose GAIN-RL, a Gradient-driven Angle-Informed Navigated RL framework. By leveraging the model's intrinsic angle concentration signal, GAIN-RL dynamically selects training data in each epoch, ensuring consistently impactful gradient updates and thus significantly enhancing overall training efficiency. Empirical evaluations show that GAIN-RL (GRPO) achieves over a 2.5x acceleration in training efficiency across diverse mathematical and coding tasks and varying model scales. Furthermore, GAIN-RL (GRPO)'s efficient sampling yields data-efficient training, achieving better performance with half the original data compared to vanilla GRPO with full training data. Code is realsed at https://github.com/wangqinsi1/GAINRL/tree/main.
中文: 当前大型语言模型的强化微调因数据冗余而效率低下,但提出的GAIN-RL框架利用模型内在的角度集中信号动态选择训练数据,显著提升了训练效率,仅用一半数据即可获得更优性能。
English: Current Reinforcement Fine-tuning for LLMs is inefficient due to redundant data exposure, but the proposed GAIN-RL framework uses the model's intrinsic angle concentration signal to dynamically select data, significantly boosting training efficiency and performance with less data.
Authors:Asha Ramanujam, Adam Elyoumi, Hao Chen, Sai Madhukiran Kompalli, Akshdeep Singh Ahluwalia, Shraman Pal, Dimitri J. Papageorgiou, Can Li
Abstract:
Most existing safe reinforcement learning (RL) benchmarks focus on robotics and control tasks, offering limited relevance to high-stakes domains that involve structured constraints, mixed-integer decisions, and industrial complexity. This gap hinders the advancement and deployment of safe RL in critical areas such as energy systems, manufacturing, and supply chains. To address this limitation, we present SafeOR-Gym, a benchmark suite of nine operations research (OR) environments tailored for safe RL under complex constraints. Each environment captures a realistic planning, scheduling, or control problems characterized by cost-based constraint violations, planning horizons, and hybrid discrete-continuous action spaces. The suite integrates seamlessly with the Constrained Markov Decision Process (CMDP) interface provided by OmniSafe. We evaluate several state-of-the-art safe RL algorithms across these environments, revealing a wide range of performance: while some tasks are tractable, others expose fundamental limitations in current approaches. SafeOR-Gym provides a challenging and practical testbed that aims to catalyze future research in safe RL for real-world decision-making problems. The SafeOR-Gym framework and all accompanying code are available at: https://github.com/li-group/SafeOR-Gym.
Chinese: SafeOR-Gym推出了包含九个运筹学环境的基准测试套件,旨在通过模拟具有结构化约束和混合决策空间的现实问题,推动安全强化学习在复杂工业场景中的应用,弥补现有机器人领域基准的不足。
English: SafeOR-Gym introduces a benchmark suite of nine operations research environments to advance safe reinforcement learning in complex, real-world domains with structured constraints and hybrid decision spaces, addressing limitations in existing robotics-focused benchmarks.
Authors:Johannes Schusterbauer, Ming Gui, Frank Fundel, Björn Ommer
Abstract:
Diffusion models have revolutionized generative tasks through high-fidelity outputs, yet flow matching (FM) offers faster inference and empirical performance gains. However, current foundation FM models are computationally prohibitive for finetuning, while diffusion models like Stable Diffusion benefit from efficient architectures and ecosystem support. This work addresses the critical challenge of efficiently transferring knowledge from pre-trained diffusion models to flow matching. We propose Diff2Flow, a novel framework that systematically bridges diffusion and FM paradigms by rescaling timesteps, aligning interpolants, and deriving FM-compatible velocity fields from diffusion predictions. This alignment enables direct and efficient FM finetuning of diffusion priors with no extra computation overhead. Our experiments demonstrate that Diff2Flow outperforms naïve FM and diffusion finetuning particularly under parameter-efficient constraints, while achieving superior or competitive performance across diverse downstream tasks compared to state-of-the-art methods. We will release our code at https://github.com/CompVis/diff2flow.
中文: Diff2Flow通过重新调整时间步长和对齐插值,将预训练扩散模型的知识高效迁移至流匹配,实现了无需额外计算的卓越微调性能。
English: Diff2Flow efficiently transfers knowledge from pre-trained diffusion models to flow matching by aligning their paradigms, enabling superior finetuning performance without extra computation.
Authors:Navid NaderiAlizadeh, Darian Salehi, Xinran Liu, Soheil Kolouri
Abstract:
Sliced Wasserstein (SW) distances offer an efficient method for comparing high-dimensional probability measures by projecting them onto multiple 1-dimensional probability distributions. However, identifying informative slicing directions has proven challenging, often necessitating a large number of slices to achieve desirable performance and thereby increasing computational complexity. We introduce a constrained learning approach to optimize the slicing directions for SW distances. Specifically, we constrain the 1D transport plans to approximate the optimal plan in the original space, ensuring meaningful slicing directions. By leveraging continuous relaxations of these transport plans, we enable a gradient-based primal-dual approach to train the slicer parameters, alongside the remaining model parameters. We demonstrate how this constrained slicing approach can be applied to pool high-dimensional embeddings into fixed-length permutation-invariant representations. Numerical results on foundation models trained on images, point clouds, and protein sequences showcase the efficacy of the proposed constrained learning approach in learning more informative slicing directions. Our implementation code can be found at https://github.com/Stranja572/constrainedswe.
Chinese: 该约束学习方法通过将一维传输计划与原始空间的最优计划对齐,优化了切片瓦瑟斯坦距离的切片方向,实现了基于梯度的训练,并在多种数据类型上获得了更具信息量的切片结果。
English: The proposed constrained learning approach optimizes slicing directions for Sliced Wasserstein distances by aligning 1D transport plans with the original space's optimal plan, enabling gradient-based training and yielding more informative slices across various data types.
Authors:Michael Li, Nishant Subramani
Abstract:
Large transformer-based language models dominate modern NLP, yet our understanding of how they encode linguistic information is rooted in studies of early models like BERT and GPT-2. To better understand today's language models, we investigate how 25 models - from classical architectures (BERT, DeBERTa, GPT-2) to modern large language models (Pythia, OLMo-2, Gemma-2, Qwen2.5, Llama-3.1) - represent lexical identity and inflectional morphology across six typologically diverse languages. Using linear and nonlinear classifiers trained on hidden activations, we predict word lemmas and inflectional features layer by layer. We find that models concentrate lexical information linearly in early layers and increasingly nonlinearly in later layers, while keeping inflectional information uniformly accessible and linearly separable throughout. Additional experiments probe the nature of these encodings: attention and residual analyses examine where within layers information can be recovered, steering vector experiments test what information can be functionally manipulated, and intrinsic dimensionality analyses explore how the representational structure evolves across layers. Remarkably, these encoding patterns emerge across all models we test, despite differences in architecture, size, and training regime (pretrained and instruction-tuned variants). This suggests that, even with substantial advances in LLM technologies, transformer models organize linguistic information in similar ways, indicating that these properties are important for next token prediction and are learned early during pretraining. Our code is available at https://github.com/ml5885/model_internal_sleuthing
中文: 研究表明,无论架构如何差异,Transformer语言模型都一致地在早期层线性编码词汇信息,在后期层非线性编码,同时保持屈折形态信息在各层中均匀可访问且线性可分。
English: This study reveals that transformer language models consistently encode lexical information linearly in early layers and nonlinearly in later layers, while maintaining inflectional morphology as uniformly accessible linear representations across all layers, regardless of architectural differences.
Authors:Xuefeng Jiang, Tian Wen, Zhiqin Yang, Lvhua Wu, Yufeng Chen, Sheng Sun, Yuwei Wang, Min Liu
Abstract:
In recent years, federated learning (FL) has made significant advance in privacy-sensitive applications. However, it can be hard to ensure that FL participants provide well-annotated data for training. The corresponding annotations from different clients often contain complex label noise at varying levels. This label noise issue has a substantial impact on the performance of the trained models, and clients with greater noise levels can be largely attributed for this degradation. To this end, it is necessary to develop an effective optimization strategy to alleviate the adverse effects of these noisy clients.In this study, we present a two-stage optimization framework, MaskedOptim, to address this intricate label noise problem. The first stage is designed to facilitate the detection of noisy clients with higher label noise rates. The second stage focuses on rectifying the labels of the noisy clients' data through an end-to-end label correction mechanism, aiming to mitigate the negative impacts caused by misinformation within datasets. This is achieved by learning the potential ground-truth labels of the noisy clients' datasets via backpropagation. To further enhance the training robustness, we apply the geometric median based model aggregation instead of the commonly-used vanilla averaged model aggregation. We implement sixteen related methods and conduct evaluations on three image datasets and one text dataset with diverse label noise patterns for a comprehensive comparison. Extensive experimental results indicate that our proposed framework shows its robustness in different scenarios. Additionally, our label correction framework effectively enhances the data quality of the detected noisy clients' local datasets. % Our codes will be open-sourced to facilitate related research communities. Our codes are available via https://github.com/Sprinter1999/MaskedOptim .
Chinese: 联邦学习面临来自不同客户端的复杂标签噪声问题,这会降低模型性能,本研究提出了MaskedOptim框架,通过检测噪声客户端并利用反向传播纠正其标签,有效提升了训练鲁棒性和数据质量。
English: Federated learning faces challenges from complex label noise across clients, which degrades model performance, and this study introduces MaskedOptim, a two-stage framework that detects noisy clients and corrects their labels via backpropagation to improve robustness and data quality.
Authors:Xu Zhang, Haoye Qiu, Weixuan Liang, Hui Liu, Junhui Hou, Yuheng Jia
Abstract:
Ensemble clustering has demonstrated great success in practice; however, its theoretical foundations remain underexplored. This paper examines the generalization performance of ensemble clustering, focusing on generalization error, excess risk and consistency. We derive a convergence rate of generalization error bound and excess risk bound both of $\mathcal{O}(\sqrt{\frac{\log n}{m}}+\frac{1}{\sqrt{n}})$, with $n$ and $m$ being the numbers of samples and base clusterings. Based on this, we prove that when $m$ and $n$ approach infinity and $m$ is significantly larger than log $n$, i.e., $m,n\to \infty, m\gg \log n$, ensemble clustering is consistent. Furthermore, recognizing that $n$ and $m$ are finite in practice, the generalization error cannot be reduced to zero. Thus, by assigning varying weights to finite clusterings, we minimize the error between the empirical average clusterings and their expectation. From this, we theoretically demonstrate that to achieve better clustering performance, we should minimize the deviation (bias) of base clustering from its expectation and maximize the differences (diversity) among various base clusterings. Additionally, we derive that maximizing diversity is nearly equivalent to a robust (min-max) optimization model. Finally, we instantiate our theory to develop a new ensemble clustering algorithm. Compared with SOTA methods, our approach achieves average improvements of 6.1%, 7.3%, and 6.0% on 10 datasets w.r.t. NMI, ARI, and Purity. The code is available at https://github.com/xuz2019/GPEC.
中文: 本文通过推导泛化误差界并证明特定条件下的聚类一致性,为集成聚类建立了理论基础,同时提出新算法在多个数据集上实现了6%以上的性能提升。
English: This paper establishes theoretical foundations for ensemble clustering by deriving generalization error bounds and proving consistency under certain conditions, while proposing a new algorithm that achieves state-of-the-art performance improvements across multiple datasets.
Authors:Shuo Yan, Yuliang Yan, Bin Ma, Chenao Li, Haochun Tang, Jiahua Lu, Minhua Lin, Yuyuan Feng, Hui Xiong, Enyan Dai
Abstract:
Recently, extensive deep learning architectures and pretraining strategies have been explored to support downstream protein applications. Additionally, domain-specific models incorporating biological knowledge have been developed to enhance performance in specialized tasks. In this work, we introduce $\textbf{Protap}$, a comprehensive benchmark that systematically compares backbone architectures, pretraining strategies, and domain-specific models across diverse and realistic downstream protein applications. Specifically, Protap covers five applications: three general tasks and two novel specialized tasks, i.e., enzyme-catalyzed protein cleavage site prediction and targeted protein degradation, which are industrially relevant yet missing from existing benchmarks. For each application, Protap compares various domain-specific models and general architectures under multiple pretraining settings. Our empirical studies imply that: (i) Though large-scale pretraining encoders achieve great results, they often underperform supervised encoders trained on small downstream training sets. (ii) Incorporating structural information during downstream fine-tuning can match or even outperform protein language models pretrained on large-scale sequence corpora. (iii) Domain-specific biological priors can enhance performance on specialized downstream tasks. Code and datasets are publicly available at https://github.com/Trust-App-AI-Lab/protap.
中文:本文介绍了Protap基准测试,通过系统评估多种蛋白质建模方法,发现在特定任务中监督式编码器和结构信息整合能超越大规模预训练模型的性能。
English: This paper introduces Protap, a benchmark that systematically evaluates various protein modeling approaches, revealing that supervised encoders and structural information integration can outperform large-scale pretrained models in specific tasks.
Authors:Qingyu Xiao, Yuanlin Chang, Youtian Du
Abstract:
Effective agent exploration remains a core challenge in reinforcement learning (RL) for complex discrete state-space environments, particularly under partial observability. This paper presents a decoupled hierarchical RL framework integrating state abstraction (DcHRL-SA) to address this issue. The proposed method employs a dual-level architecture, consisting of a high level RL-based actor and a low-level rule-based policy, to promote effective exploration. Additionally, state abstraction method is incorporated to cluster discrete states, effectively lowering state dimensionality. Experiments conducted in two discrete customized grid environments demonstrate that the proposed approach consistently outperforms PPO in terms of exploration efficiency, convergence speed, cumulative reward, and policy stability. These results demonstrate a practical approach for integrating decoupled hierarchical policies and state abstraction in discrete grids with large-scale exploration space. Code will be available at https://github.com/XQY169/DcHRL-SA.
中文: 本文提出了一种结合状态抽象的分离式分层强化学习框架(DcHRL-SA),通过高层强化学习策略与底层规则策略的双层架构及状态聚类降维,在复杂离散网格环境中显著提升了探索效率、收敛速度和策略稳定性,实验证明其性能全面优于PPO算法。
English: This paper introduces a decoupled hierarchical reinforcement learning framework with state abstraction (DcHRL-SA) that enhances exploration efficiency in complex discrete environments by combining high-level RL policies with low-level rule-based actions and reducing state dimensionality, demonstrating superior performance over PPO in grid world experiments.
Authors:Xinxu Wei, Kanhao Zhao, Yong Jiao, Lifang He, Yu Zhang
Abstract:
As large language models (LLMs) continue to revolutionize AI research, there is a growing interest in building large-scale brain foundation models to advance neuroscience. While most existing brain foundation models are pre-trained on time-series signals or connectome features, we propose a novel graph-based pre-training paradigm for constructing a brain graph foundation model. In this paper, we introduce the Brain Graph Foundation Model, termed BrainGFM, a unified framework that leverages graph contrastive learning and graph masked autoencoders for large-scale fMRI-based pre-training. BrainGFM is pre-trained on a diverse mixture of brain atlases with varying parcellations, significantly expanding the pre-training corpus and enhancing the model's ability to generalize across heterogeneous fMRI-derived brain representations. To support efficient and versatile downstream transfer, we integrate both graph prompts and language prompts into the model design, enabling BrainGFM to flexibly adapt to a wide range of atlases, neurological and psychiatric disorders, and task settings. Furthermore, we employ meta-learning to optimize the graph prompts, facilitating strong generalization to previously unseen disorders under both few-shot and zero-shot learning conditions via language-guided prompting. BrainGFM is pre-trained on 27 neuroimaging datasets spanning 25 common neurological and psychiatric disorders, encompassing 2 types of brain atlases (functional and anatomical) across 8 widely-used parcellations, and covering over 25,000 subjects, 60,000 fMRI scans, and a total of 400,000 graph samples aggregated across all atlases and parcellations. The code is available at: https://github.com/weixinxu666/BrainGFM
中文: 本文提出BrainGFM这一基于图结构的脑科学基础模型,通过图对比学习和掩码自编码器进行大规模fMRI预训练,结合图提示与语言提示实现跨多种脑疾病与任务的灵活适配。
English: This paper introduces BrainGFM, a novel graph-based foundation model for neuroscience that uses graph contrastive learning and masked autoencoders for large-scale fMRI pre-training, enabling flexible adaptation to various brain disorders and tasks through integrated graph and language prompts.
Authors:Youze Xue, Dian Li, Gang Liu
Abstract:
With the rapid advancement of multi-modal large language models (MLLMs) in recent years, the foundational Contrastive Language-Image Pretraining (CLIP) framework has been successfully extended to MLLMs, enabling more powerful and universal multi-modal embeddings for a wide range of retrieval tasks. Despite these developments, the core contrastive learning paradigm remains largely unchanged from CLIP-style models to MLLMs. Within this framework, the effective mining of hard negative samples continues to be a critical factor for enhancing performance. Prior works have introduced both offline and online strategies for hard negative mining to improve the efficiency of contrastive learning. While these approaches have led to improved multi-modal embeddings, the specific contribution of each hard negative sample to the learning process has not been thoroughly investigated. In this work, we conduct a detailed analysis of the gradients of the info-NCE loss with respect to the query, positive, and negative samples, elucidating the role of hard negatives in updating model parameters. Building upon this analysis, we propose to explicitly amplify the gradients associated with hard negative samples, thereby encouraging the model to learn more discriminative embeddings. Our multi-modal embedding model, trained with the proposed Explicit Gradient Amplifier and based on the LLaVA-OneVision-7B architecture, achieves state-of-the-art performance on the MMEB benchmark compared to previous methods utilizing the same MLLM backbone. Furthermore, when integrated with our self-developed MLLM, QQMM, our approach attains the top rank on the MMEB leaderboard. Code and models are released on https://github.com/QQ-MM/QQMM-embed.
Chinese: 本研究提出显式梯度放大器来增强对比学习中困难负样本的梯度,基于LLaVA-OneVision-7B架构和自研QQMM多模态大模型,在MMEB基准测试中实现了最优性能。
English: This study introduces an Explicit Gradient Amplifier to enhance hard negative sample gradients in contrastive learning, achieving state-of-the-art performance on the MMEB benchmark with both the LLaVA-OneVision-7B architecture and a proprietary MLLM called QQMM.
Authors:Pengcuo Dege, Qiuming Luo, Rui Mao, Chang Kong
Abstract:
Efficient inference of Multi-Head Latent Attention (MLA) is challenged by deploying the DeepSeek-R1 671B model on a single Multi-GPU server. This paper introduces FlashMLA-ETAP, a novel framework that enhances MLA inference for the single-instance deployment scenario on NVIDIA H20 GPUs. We propose the Efficient Transpose Attention Pipeline (ETAP), which reconfigures attention computation through transposition to align the KV context length with the \(M\)-dimension in WGMMA operations, significantly reducing redundant computations. FlashMLA-ETAP achieves a 2.78x speedup over FlashMLA at 64K sequence length (batch size 16), with 5.24x and 4.94x improvements over FlashAttention-3 and FlashInfer, respectively, while maintaining numerical stability with a 15.2x lower RMSE (\(1.25 \times 10^{-5}\)) than FlashAttention-3. Furthermore, ETAP's design enables seamless integration into frameworks like FlashAttention-3 and FlashInfer, supported by a detailed theoretical analysis. Our work addresses a critical gap in resource-constrained inference, offering a scalable solution for mid-tier GPUs and paving the way for broader adoption in hardware-aware optimization. Code is available at https://github.com/pengcuo/FlashMLA-ETAP.
中文: 本文提出的FlashMLA-ETAP框架通过高效转置注意力管道重新配置注意力计算,在单台NVIDIA H20 GPU服务器上显著提升了DeepSeek-R1模型的多头潜在注意力推理效率,实现了最高2.78倍的加速比,同时保持了数值稳定性。
English: This paper introduces FlashMLA-ETAP, a novel framework that significantly accelerates Multi-Head Latent Attention inference for the DeepSeek-R1 model on single NVIDIA H20 GPU servers by reducing redundant computations through its Efficient Transpose Attention Pipeline, achieving up to 2.78x speedup while maintaining numerical stability.
Authors:Jennifer Chen, Aidar Myrzakhan, Yaxin Luo, Hassaan Muhammad Khan, Sondos Mahmoud Bsharat, Zhiqiang Shen
Abstract:
Retrieval-Augmented Generation (RAG) methods have proven highly effective for tasks requiring factual consistency and robust knowledge retrieval. However, large-scale RAG systems consume significant computational resources and are prone to generating hallucinated content from Humans. In this work, we introduce $\texttt{DRAG}$, a novel framework for distilling RAG knowledge from large-scale Language Models (LLMs) into small LMs (SLMs). Our approach leverages evidence- and knowledge graph-based distillation, ensuring that the distilled model retains critical factual knowledge while significantly reducing model size and computational cost. By aligning the smaller model's predictions with a structured knowledge graph and ranked evidence, $\texttt{DRAG}$ effectively mitigates hallucinations and improves factual accuracy. We further present a case demonstrating how our framework mitigates user privacy risks and introduce a corresponding benchmark. Experimental evaluations on multiple benchmarks demonstrate that our method outperforms the prior competitive RAG methods like MiniRAG for SLMs by up to 27.7% using the same models, preserving high-level efficiency and reliability. With $\texttt{DRAG}$, we provide a practical and resource-efficient roadmap to deploying enhanced retrieval and generation capabilities in small-sized LLMs.
中文: DRAG框架通过证据和知识图谱将大型语言模型的知识提炼到小型模型中,在显著降低计算成本的同时,有效提升事实准确性并减少幻觉生成。
English: The DRAG framework effectively distills knowledge from large to small language models using evidence and knowledge graphs, significantly reducing computational costs while enhancing factual accuracy and mitigating hallucinations.
Authors:Saar Huberman, Or Patashnik, Omer Dahary, Ron Mokady, Daniel Cohen-Or
Abstract:
Text-to-image diffusion models excel at generating high-quality, diverse images from natural language prompts. However, they often fail to produce semantically accurate results when the prompt contains concept combinations that contradict their learned priors. We define this failure mode as contextual contradiction, where one concept implicitly negates another due to entangled associations learned during training. To address this, we propose a stage-aware prompt decomposition framework that guides the denoising process using a sequence of proxy prompts. Each proxy prompt is constructed to match the semantic content expected to emerge at a specific stage of denoising, while ensuring contextual coherence. To construct these proxy prompts, we leverage a large language model (LLM) to analyze the target prompt, identify contradictions, and generate alternative expressions that preserve the original intent while resolving contextual conflicts. By aligning prompt information with the denoising progression, our method enables fine-grained semantic control and accurate image generation in the presence of contextual contradictions. Experiments across a variety of challenging prompts show substantial improvements in alignment to the textual prompt.
Chinese: 本文提出了一种阶段感知提示分解框架,利用大型语言模型解决文本到图像扩散模型中的上下文矛盾问题,通过将代理提示与去噪过程对齐来提升语义准确性。
English: This paper introduces a stage-aware prompt decomposition framework that uses large language models to resolve contextual contradictions in text-to-image diffusion models, improving semantic accuracy by aligning proxy prompts with the denoising process.
Authors:Subham Sekhar Sahoo, Zhihan Yang, Yash Akhauri, Johnna Liu, Deepansha Singh, Zhoujun Cheng, Zhengzhong Liu, Eric Xing, John Thickstun, Arash Vahdat
Abstract:
Diffusion-based language models offer a compelling alternative to autoregressive (AR) models by enabling parallel and controllable generation. Among this family of models, Masked Diffusion Models (MDMs) achieve the strongest performance but still underperform AR models in perplexity and lack key inference-time efficiency features--most notably, KV caching. In this work, we introduce Eso-LMs, a new family of models that fuses AR and MDM paradigms, enabling smooth interpolation between their perplexities while overcoming their respective limitations. Eso-LMs set a new state of the art on standard language modeling benchmarks. Crucially, we are the **first to introduce KV caching for MDMs** while preserving parallel generation, significantly improving inference efficiency. Combined with an optimized sampling schedule, our method achieves up to **65x** faster inference than standard MDMs and **4x** faster inference than prior semi-autoregressive approaches. We provide the code and model checkpoints on the project page: [http://s-sahoo.github.io/Eso-LMs](http://s-sahoo.github.io/Eso-LMs)
Authors:Amin Karimi Monsefi, Mridul Khurana, Rajiv Ramnath, Anuj Karpatne, Wei-Lun Chao, Cheng Zhang
Abstract:
We propose TaxaDiffusion, a taxonomy-informed training framework for diffusion models to generate fine-grained animal images with high morphological and identity accuracy. Unlike standard approaches that treat each species as an independent category, TaxaDiffusion incorporates domain knowledge that many species exhibit strong visual similarities, with distinctions often residing in subtle variations of shape, pattern, and color. To exploit these relationships, TaxaDiffusion progressively trains conditioned diffusion models across different taxonomic levels -- starting from broad classifications such as Class and Order, refining through Family and Genus, and ultimately distinguishing at the Species level. This hierarchical learning strategy first captures coarse-grained morphological traits shared by species with common ancestors, facilitating knowledge transfer before refining fine-grained differences for species-level distinction. As a result, TaxaDiffusion enables accurate generation even with limited training samples per species. Extensive experiments on three fine-grained animal datasets demonstrate that outperforms existing approaches, achieving superior fidelity in fine-grained animal image generation. Project page: https://amink8.github.io/TaxaDiffusion/
Authors:Kwanghee Choi, Masao Someki, Emma Strubell, Shinji Watanabe
Abstract:
Discrete speech units (DSUs) are derived from clustering the features of self-supervised speech models (S3Ms). DSUs offer significant advantages for on-device streaming speech applications due to their rich phonetic information, high transmission efficiency, and seamless integration with large language models. However, conventional DSU-based approaches are impractical as they require full-length speech input and computationally expensive S3Ms. In this work, we reduce both the attention window and the model size while preserving the effectiveness of DSUs. Our results demonstrate that we can reduce floating-point operations (FLOPs) by 50% with only a relative increase of 6.5% in character error rate (CER) on the ML-SUPERB 1h dataset. These findings highlight the potential of DSUs for real-time speech processing in resource-constrained environments.
中文: 离散语音单元通过自监督模型提取,适用于高效流式语音应用,但传统方法计算成本高昂;本研究通过缩小注意力窗口和模型规模,在保持性能的同时将浮点运算减少50%,显著提升了资源受限环境下的实时处理潜力。
English: Discrete speech units derived from self-supervised models offer efficient streaming speech applications but face impractical computational demands, which this work addresses by reducing attention windows and model size to achieve 50% fewer FLOPs with minimal performance loss, enhancing real-time processing in resource-limited settings.
Authors:Mustafa Shukor, Dana Aubakirova, Francesco Capuano, Pepijn Kooijmans, Steven Palma, Adil Zouitine, Michel Aractingi, Caroline Pascal, Martino Russi, Andres Marafioti, Simon Alibert, Matthieu Cord, Thomas Wolf, Remi Cadene
Abstract:
Vision-language models (VLMs) pretrained on large-scale multimodal datasets encode rich visual and linguistic knowledge, making them a strong foundation for robotics. Rather than training robotic policies from scratch, recent approaches adapt VLMs into vision-language-action (VLA) models that enable natural language-driven perception and control. However, existing VLAs are typically massive--often with billions of parameters--leading to high training costs and limited real-world deployability. Moreover, they rely on academic and industrial datasets, overlooking the growing availability of community-collected data from affordable robotic platforms. In this work, we present SmolVLA, a small, efficient, and community-driven VLA that drastically reduces both training and inference costs, while retaining competitive performance. SmolVLA is designed to be trained on a single GPU and deployed on consumer-grade GPUs or even CPUs. To further improve responsiveness, we introduce an asynchronous inference stack decoupling perception and action prediction from action execution, allowing higher control rates with chunked action generation. Despite its compact size, SmolVLA achieves performance comparable to VLAs that are 10x larger. We evaluate SmolVLA on a range of both simulated as well as real-world robotic benchmarks and release all code, pretrained models, and training data.
中文: SmolVLA是一种紧凑高效的视觉语言动作模型,在显著降低计算成本的同时保持竞争力,支持单GPU训练并可在消费级硬件上部署。
English: SmolVLA is a compact and efficient vision-language-action model that significantly reduces computational costs while maintaining competitive performance, enabling training on a single GPU and deployment on consumer hardware.
Authors:Zhao Yang, Jiwei Zhu, Bing Su
Abstract:
Inspired by the success of unsupervised pre-training paradigms, researchers have applied these approaches to DNA pre-training. However, we argue that these approaches alone yield suboptimal results because pure DNA sequences lack sufficient information, since their functions are regulated by genomic profiles like chromatin accessibility. Here, we demonstrate that supervised training for genomic profile prediction serves as a more effective alternative to pure sequence pre-training. Furthermore, considering the multi-species and multi-profile nature of genomic profile prediction, we introduce our $\textbf{S}$pecies-$\textbf{P}$rofile $\textbf{A}$daptive $\textbf{C}$ollaborative $\textbf{E}$xperts (SPACE) that leverages Mixture of Experts (MoE) to better capture the relationships between DNA sequences across different species and genomic profiles, thereby learning more effective DNA representations. Through extensive experiments across various tasks, our model achieves state-of-the-art performance, establishing that DNA models trained with supervised genomic profiles serve as powerful DNA representation learners. The code is available at https://github.com/ZhuJiwei111/SPACE.
中文摘要:与无监督DNA序列预训练相比,基于基因组谱监督训练的SPACE模型通过混合专家机制捕捉跨物种多谱系关联,实现了最优性能表现。
English Summary: Supervised training for genomic profile prediction is more effective than unsupervised DNA sequence pre-training, and the proposed SPACE model with Mixture of Experts achieves state-of-the-art performance by capturing cross-species and multi-profile relationships.
Authors:Zachary Coalson, Juhan Bae, Nicholas Carlini, Sanghyun Hong
Abstract:
We study how training data contributes to the emergence of toxic behaviors in large-language models. Most prior work on reducing model toxicity adopts $reactive$ approaches, such as fine-tuning pre-trained (and potentially toxic) models to align them with human values. In contrast, we propose a $proactive$ approach$-$IF-Guide$-$which leverages influence functions to identify harmful tokens within any training data and suppress their impact during training. To this end, we first show that standard influence functions are ineffective at discovering harmful training records. We then present a novel adaptation that measures token-level attributions from training data to model toxicity, along with techniques for selecting toxic training documents and a learning objective that can be integrated into both pre-training and fine-tuning. Moreover, IF-Guide does not rely on human-preference data, which is typically required by existing alignment methods. In evaluation, we demonstrate that IF-Guide substantially reduces both explicit and implicit toxicity$-$by up to 10$\times$ compared to uncensored models, and up to 3$\times$ compared to baseline alignment methods, e.g., DPO and RAD$-$across both pre-training and fine-tuning scenarios. IF-Guide is computationally efficient: a billion-parameter model is $not$ $necessary$ for computing influence scores; a million-parameter model$-$with 7.5$\times$ fewer parameters$-$can effectively serve as a proxy for identifying harmful data. Our code is publicly available at: https://github.com/ztcoalson/IF-Guide
Chinese: 本文提出IF-Guide方法,通过影响函数主动识别并抑制训练数据中的有害标记,无需依赖人工偏好数据即可显著降低模型毒性,其效果优于现有对齐技术。
English: This paper introduces IF-Guide, a proactive method that uses influence functions to detect and suppress harmful tokens in training data, significantly reducing model toxicity without relying on human-preference data and outperforming existing alignment techniques.
Authors:Genta Indra Winata, David Anugraha, Emmy Liu, Alham Fikri Aji, Shou-Yi Hung, Aditya Parashar, Patrick Amadeus Irawan, Ruochen Zhang, Zheng-Xin Yong, Jan Christian Blaise Cruz, Niklas Muennighoff, Seungone Kim, Hanyang Zhao, Sudipta Kar, Kezia Erina Suryoraharjo, M. Farid Adilazuarda, En-Shiun Annie Lee, Ayu Purwarianti, Derry Tanti Wijaya, Monojit Choudhury
Abstract:
High-quality datasets are fundamental to training and evaluating machine learning models, yet their creation-especially with accurate human annotations-remains a significant challenge. Many dataset paper submissions lack originality, diversity, or rigorous quality control, and these shortcomings are often overlooked during peer review. Submissions also frequently omit essential details about dataset construction and properties. While existing tools such as datasheets aim to promote transparency, they are largely descriptive and do not provide standardized, measurable methods for evaluating data quality. Similarly, metadata requirements at conferences promote accountability but are inconsistently enforced. To address these limitations, this position paper advocates for the integration of systematic, rubric-based evaluation metrics into the dataset review process-particularly as submission volumes continue to grow. We also explore scalable, cost-effective methods for synthetic data generation, including dedicated tools and LLM-as-a-judge approaches, to support more efficient evaluation. As a call to action, we introduce DataRubrics, a structured framework for assessing the quality of both human- and model-generated datasets. Leveraging recent advances in LLM-based evaluation, DataRubrics offers a reproducible, scalable, and actionable solution for dataset quality assessment, enabling both authors and reviewers to uphold higher standards in data-centric research. We also release code to support reproducibility of LLM-based evaluations at https://github.com/datarubrics/datarubrics.
中文: 本立场文件提出DataRubrics框架,通过基于量规的指标和LLM驱动的评估,系统性地评判数据集质量,以解决数据研究中原创性、透明度和可复现性不足的问题。
English: This position paper proposes DataRubrics, a structured framework using rubric-based metrics and LLM-powered evaluation to systematically assess dataset quality, addressing gaps in originality, transparency, and reproducibility in data-centric research.
Authors:Yafei Yang, Zihui Zhang, Bo Yang
Abstract:
We study the challenging problem of unsupervised multi-object segmentation on single images. Existing methods, which rely on image reconstruction objectives to learn objectness or leverage pretrained image features to group similar pixels, often succeed only in segmenting simple synthetic objects or discovering a limited number of real-world objects. In this paper, we introduce unMORE, a novel two-stage pipeline designed to identify many complex objects in real-world images. The key to our approach involves explicitly learning three levels of carefully defined object-centric representations in the first stage. Subsequently, our multi-object reasoning module utilizes these learned object priors to discover multiple objects in the second stage. Notably, this reasoning module is entirely network-free and does not require human labels. Extensive experiments demonstrate that unMORE significantly outperforms all existing unsupervised methods across 6 real-world benchmark datasets, including the challenging COCO dataset, achieving state-of-the-art object segmentation results. Remarkably, our method excels in crowded images where all baselines collapse.
中文: 本文提出unMORE这一新型两阶段无监督方法,通过习得物体中心化表征并采用无需网络结构的推理模块,在复杂真实图像上实现了最先进的多目标分割效果,显著超越了现有方法。
English: This paper introduces unMORE, a novel two-stage unsupervised method that learns object-centric representations and uses a network-free reasoning module to achieve state-of-the-art multi-object segmentation on complex real-world images, significantly outperforming existing approaches.
Authors:Zeming Wei, Chengcan Wu, Meng Sun
Abstract:
Large Language Models (LLMs) have achieved significant success in various tasks, yet concerns about their safety and security have emerged. In particular, they pose risks in generating harmful content and vulnerability to jailbreaking attacks. To analyze and monitor machine learning models, model-based analysis has demonstrated notable potential in stateful deep neural networks, yet suffers from scalability issues when extending to LLMs due to their vast feature spaces. In this paper, we propose ReGA, a model-based analysis framework with representation-guided abstraction, to safeguard LLMs against harmful prompts and generations. By leveraging safety-critical representations, which are low-dimensional directions emerging in hidden states that indicate safety-related concepts, ReGA effectively addresses the scalability issue when constructing the abstract model for safety modeling. Our comprehensive evaluation shows that ReGA performs sufficiently well in distinguishing between safe and harmful inputs, achieving an AUROC of 0.975 at the prompt level and 0.985 at the conversation level. Additionally, ReGA exhibits robustness to real-world attacks and generalization across different safety perspectives, outperforming existing safeguard paradigms in terms of interpretability and scalability. Overall, ReGA serves as an efficient and scalable solution to enhance LLM safety by integrating representation engineering with model-based abstraction, paving the way for new paradigms to utilize software insights for AI safety. Our code is available at https://github.com/weizeming/ReGA.
中文摘要:本文提出ReGA框架,通过表征引导的抽象方法构建可扩展的安全模型,有效提升大语言模型对恶意内容的识别与防御能力,在安全性和可解释性方面表现优异。
English Summary: This paper introduces ReGA, a representation-guided abstraction framework that enhances Large Language Model safety by effectively identifying and blocking harmful content through scalable model-based analysis.
Authors:Florian Fürrutter, Zohim Chandani, Ikko Hamamura, Hans J. Briegel, Gorka Muñoz-Gil
Abstract:
Efficiently compiling quantum operations remains a major bottleneck in scaling quantum computing. Today's state-of-the-art methods achieve low compilation error by combining search algorithms with gradient-based parameter optimization, but they incur long runtimes and require multiple calls to quantum hardware or expensive classical simulations, making their scaling prohibitive. Recently, machine-learning models have emerged as an alternative, though they are currently restricted to discrete gate sets. Here, we introduce a multimodal denoising diffusion model that simultaneously generates a circuit's structure and its continuous parameters for compiling a target unitary. It leverages two independent diffusion processes, one for discrete gate selection and one for parameter prediction. We benchmark the model over different experiments, analyzing the method's accuracy across varying qubit counts, circuit depths, and proportions of parameterized gates. Finally, by exploiting its rapid circuit generation, we create large datasets of circuits for particular operations and use these to extract valuable heuristics that can help us discover new insights into quantum circuit synthesis.
中文摘要:本文提出一种多模态去噪扩散模型,能同时生成量子电路结构和连续参数来高效编译目标酉矩阵,突破了现有方法的局限,并实现快速电路生成以构建数据集和提取启发式规则。
English Summary: This paper introduces a multimodal denoising diffusion model that simultaneously generates quantum circuit structures and continuous parameters to efficiently compile target unitaries, overcoming limitations of current methods while enabling rapid circuit generation for dataset creation and heuristic extraction.
Authors:Karl El Hajal, Enno Hermann, Sevada Hovsepyan, Mathew Magimai. -Doss
Abstract:
Automatic speech recognition (ASR) systems struggle with dysarthric speech due to high inter-speaker variability and slow speaking rates. To address this, we explore dysarthric-to-healthy speech conversion for improved ASR performance. Our approach extends the Rhythm and Voice (RnV) conversion framework by introducing a syllable-based rhythm modeling method suited for dysarthric speech. We assess its impact on ASR by training LF-MMI models and fine-tuning Whisper on converted speech. Experiments on the Torgo corpus reveal that LF-MMI achieves significant word error rate reductions, especially for more severe cases of dysarthria, while fine-tuning Whisper on converted data has minimal effect on its performance. These results highlight the potential of unsupervised rhythm and voice conversion for dysarthric ASR. Code available at: https://github.com/idiap/RnV
中文: 本研究通过在RnV转换框架中引入音节节奏建模方法,显著改善了针对构音障碍语音的自动语音识别性能,其中LF-MMI模型在严重病例上大幅降低了词错率,而微调Whisper模型效果甚微。
English: This study enhances ASR performance for dysarthric speech by introducing a syllable-based rhythm modeling method within the RnV conversion framework, achieving significant word error rate reductions in severe cases through LF-MMI models while showing minimal impact when fine-tuning Whisper.
Authors:Andy Bonnetto, Haozhe Qi, Franklin Leong, Matea Tashkovska, Mahdi Rad, Solaiman Shokur, Friedhelm Hummel, Silvestro Micera, Marc Pollefeys, Alexander Mathis
Abstract:
Understanding behavior requires datasets that capture humans while carrying out complex tasks. The kitchen is an excellent environment for assessing human motor and cognitive function, as many complex actions are naturally exhibited in kitchens from chopping to cleaning. Here, we introduce the EPFL-Smart-Kitchen-30 dataset, collected in a noninvasive motion capture platform inside a kitchen environment. Nine static RGB-D cameras, inertial measurement units (IMUs) and one head-mounted HoloLens~2 headset were used to capture 3D hand, body, and eye movements. The EPFL-Smart-Kitchen-30 dataset is a multi-view action dataset with synchronized exocentric, egocentric, depth, IMUs, eye gaze, body and hand kinematics spanning 29.7 hours of 16 subjects cooking four different recipes. Action sequences were densely annotated with 33.78 action segments per minute. Leveraging this multi-modal dataset, we propose four benchmarks to advance behavior understanding and modeling through 1) a vision-language benchmark, 2) a semantic text-to-motion generation benchmark, 3) a multi-modal action recognition benchmark, 4) a pose-based action segmentation benchmark. We expect the EPFL-Smart-Kitchen-30 dataset to pave the way for better methods as well as insights to understand the nature of ecologically-valid human behavior. Code and data are available at https://github.com/amathislab/EPFL-Smart-Kitchen
中文: EPFL-Smart-Kitchen-30数据集通过同步多模态传感器捕捉烹饪时的人类行为,并设立四项基准测试以推动生态效度行为研究的发展。
English: The EPFL-Smart-Kitchen-30 dataset captures multi-modal human movements during cooking tasks using synchronized sensors to advance behavior understanding through four proposed benchmarks.
Authors:Roman Plaud, Alexandre Perez-Lebel, Matthieu Labeau, Antoine Saillenfest, Thomas Bonald
Abstract:
Hierarchical classification offers an approach to incorporate the concept of mistake severity by leveraging a structured, labeled hierarchy. However, decoding in such settings frequently relies on heuristic decision rules, which may not align with task-specific evaluation metrics. In this work, we propose a framework for the optimal decoding of an output probability distribution with respect to a target metric. We derive optimal decision rules for increasingly complex prediction settings, providing universal algorithms when candidates are limited to the set of nodes. In the most general case of predicting a subset of nodes, we focus on rules dedicated to the hierarchical $hF_β$ scores, tailored to hierarchical settings. To demonstrate the practical utility of our approach, we conduct extensive empirical evaluations, showcasing the superiority of our proposed optimal strategies, particularly in underdetermined scenarios. These results highlight the potential of our methods to enhance the performance and reliability of hierarchical classifiers in real-world applications. The code is available at https://github.com/RomanPlaud/hierarchical_decision_rules
中文摘要:本文提出了一种针对层次分类的最优解码框架,通过将决策规则与任务特定指标对齐,在实证评估中展现出优越性能。
English Summary: This paper introduces an optimal decoding framework for hierarchical classification that aligns decision rules with task-specific metrics, demonstrating improved performance in empirical evaluations.
Authors:Zijian Zhao, Dian Jin, Zijing Zhou, Xiaoyu Zhang
Abstract:
Stage lighting plays an essential role in live music performances, influencing the engaging experience of both musicians and audiences. Given the high costs associated with hiring or training professional lighting engineers, Automatic Stage Lighting Control (ASLC) has gained increasing attention. However, most existing approaches only classify music into limited categories and map them to predefined light patterns, resulting in formulaic and monotonous outcomes that lack rationality. To address this issue, this paper presents an end-to-end solution that directly learns from experienced lighting engineers -- Skip-BART. To the best of our knowledge, this is the first work to conceptualize ASLC as a generative task rather than merely a classification problem. Our method modifies the BART model to take audio music as input and produce light hue and value (intensity) as output, incorporating a novel skip connection mechanism to enhance the relationship between music and light within the frame grid.We validate our method through both quantitative analysis and an human evaluation, demonstrating that Skip-BART outperforms conventional rule-based methods across all evaluation metrics and shows only a limited gap compared to real lighting engineers.Specifically, our method yields a p-value of 0.72 in a statistical comparison based on human evaluations with human lighting engineers, suggesting that the proposed approach closely matches human lighting engineering performance. To support further research, we have made our self-collected dataset, code, and trained model parameters available at https://github.com/RS2002/Skip-BART .
中文: 本文提出Skip-BART端到端生成模型,通过向专业灯光师学习直接将音频音乐转化为灯光效果,在性能上超越传统规则方法并接近人类灯光师水平。
English: This paper introduces Skip-BART, an end-to-end generative model that directly converts audio music into lighting cues by learning from professional lighting engineers, outperforming traditional rule-based methods and closely matching human performance.
Authors:Yulei Qin, Gang Li, Zongyi Li, Zihan Xu, Yuchen Shi, Zhekai Lin, Xiao Cui, Ke Li, Xing Sun
Abstract:
Existing large language models (LLMs) face challenges of following complex instructions, especially when multiple constraints are present and organized in paralleling, chaining, and branching structures. One intuitive solution, namely chain-of-thought (CoT), is expected to universally improve capabilities of LLMs. However, we find that the vanilla CoT exerts a negative impact on performance due to its superficial reasoning pattern of simply paraphrasing the instructions. It fails to peel back the compositions of constraints for identifying their relationship across hierarchies of types and dimensions. To this end, we propose RAIF, a systematic method to boost LLMs in dealing with complex instructions via incentivizing reasoning for test-time compute scaling. First, we stem from the decomposition of complex instructions under existing taxonomies and propose a reproducible data acquisition method. Second, we exploit reinforcement learning (RL) with verifiable rule-centric reward signals to cultivate reasoning specifically for instruction following. We address the shallow, non-essential nature of reasoning under complex instructions via sample-wise contrast for superior CoT enforcement. We also exploit behavior cloning of experts to facilitate steady distribution shift from fast-thinking LLMs to skillful reasoners. Extensive evaluations on seven comprehensive benchmarks confirm the validity of the proposed method, where a 1.5B LLM achieves 11.74% gains with performance comparable to a 8B LLM. Evaluation on OOD constraints also confirms the generalizability of our RAIF. Codes and data are available at https://github.com/yuleiqin/RAIF. Keywords: reinforcement learning with verifiable rewards (RLVR), instruction following, complex instructions
中文: 现有大语言模型在处理包含多重约束的复杂指令时存在困难,而提出的RAIF方法通过基于可验证奖励的强化学习来增强推理能力,显著提升了小规模模型的性能表现。
English: Current large language models struggle with complex instructions containing multiple constraints, but the proposed RAIF method uses reinforcement learning with verifiable rewards to enhance reasoning capabilities, significantly improving performance even in smaller models.
Authors:Xiang Zhao, Ruijie Li, Qiao Ning, Shikai Guo, Hui Li, Qian Ma
Abstract:
The identification of drug-target interactions (DTI) is critical for drug discovery and repositioning, as it reveals potential therapeutic uses of existing drugs, accelerating development and reducing costs. However, most existing models focus only on direct similarity in homogeneous graphs, failing to exploit the rich similarity in heterogeneous graphs. To address this gap, inspired by real-world social interaction behaviors, we propose SOC-DGL, which comprises two specialized modules: the Affinity-Driven Graph Learning (ADGL) module, learning global similarity through an affinity-enhanced drug-target graph, and the Equilibrium-Driven Graph Learning (EDGL) module, capturing higher-order similarity by amplifying the influence of even-hop neighbors using an even-polynomial graph filter based on balance theory. This dual approach enables SOC-DGL to effectively capture similarity information across multiple interaction scales within affinity and association matrices. To address the issue of imbalance in DTI datasets, we propose an adjustable imbalance loss function that adjusts the weight of negative samples by the parameter. Extensive experiments on four benchmark datasets demonstrate that SOC-DGL consistently outperforms existing state-of-the-art methods across both balanced and imbalanced scenarios. Moreover, SOC-DGL successfully predicts the top 9 drugs known to bind ABL1, and further analyzed the 10th drug, which has not been experimentally confirmed to interact with ABL1, providing supporting evidence for its potential binding.
中文: 该研究提出了SOC-DGL模型,通过亲和力与均衡驱动的双模块捕捉药物-靶点相互作用中的多尺度相似性,并采用可调节损失函数解决数据不平衡问题,在实验中展现出优越性能及精准预测能力。
English: The study introduces SOC-DGL, a novel model that captures multi-scale similarities in drug-target interactions using affinity and equilibrium-driven modules, and it addresses dataset imbalance with an adjustable loss function, demonstrating superior performance and accurate predictions in validation experiments.
Authors:Minghao Xu, Jiaze Song, Keming Wu, Xiangxin Zhou, Bin Cui, Wentao Zhang
Abstract:
Understanding the various properties of glycans with machine learning has shown some preliminary promise. However, previous methods mainly focused on modeling the backbone structure of glycans as graphs of monosaccharides (i.e., sugar units), while they neglected the atomic structures underlying each monosaccharide, which are actually important indicators of glycan properties. We fill this blank by introducing the GlycanAA model for All-Atom-wise Glycan modeling. GlycanAA models a glycan as a heterogeneous graph with monosaccharide nodes representing its global backbone structure and atom nodes representing its local atomic-level structures. Based on such a graph, GlycanAA performs hierarchical message passing to capture from local atomic-level interactions to global monosaccharide-level interactions. To further enhance model capability, we pre-train GlycanAA on a high-quality unlabeled glycan dataset, deriving the PreGlycanAA model. We design a multi-scale mask prediction algorithm to endow the model about different levels of dependencies in a glycan. Extensive benchmark results show the superiority of GlycanAA over existing glycan encoders and verify the further improvements achieved by PreGlycanAA. We maintain all resources at https://github.com/kasawa1234/GlycanAA
中文: GlycanAA模型通过将糖链表示为包含单糖骨架结构和原子级细节的异质图,采用分层信息传递和预训练技术,显著提升了糖链性质预测的性能。
English: The GlycanAA model introduces an all-atom approach to glycan modeling by representing glycans as heterogeneous graphs that capture both monosaccharide-level backbone structures and atomic-level details, achieving superior performance through hierarchical message passing and pre-training enhancements.
Authors:Fuxiang Zhang, Jiacheng Xu, Chaojie Wang, Ce Cui, Yang Liu, Bo An
Abstract:
Large Language Models (LLMs) have demonstrated remarkable progress in complex reasoning tasks through both post-training and test-time scaling laws. While prevalent test-time scaling approaches are often realized by using external reward models to guide the model generation process, we find only marginal gains can be acquired when scaling a model post-trained on specific reasoning tasks. We identify that the limited improvement stems from distribution discrepancies between the specific post-trained generator and the general reward model. To address this, we propose a framework that incentivizes LLMs to self-verify their own answers. By unifying answer generation and verification within a single reinforcement learning (RL) process, we train models that can effectively assess the correctness of their own solutions. The trained model can further scale its performance during inference time by verifying its generations, without the need for external verifiers. We train our self-verification models based on Qwen2.5-Math-7B and DeepSeek-R1-Distill-Qwen-1.5B, demonstrating its capabilities across varying reasoning context lengths. Experiments on multiple mathematical reasoning benchmarks show that our models can not only improve post-training performance but also enable effective test-time scaling. Our code is available at https://github.com/mansicer/self-verification.
中文: 大语言模型因特定后训练生成器与通用奖励模型间的分布差异导致扩展收益有限,我们提出的自验证框架通过单一强化学习过程统一答案生成与验证,无需外部验证器即可实现有效的测试时性能扩展。
English: Large Language Models achieve limited gains from post-training scaling due to distribution mismatches with external reward models, but our proposed self-verification framework unifies generation and verification in a single RL process to enable effective test-time scaling without external verifiers.
Authors:Dongwon Choi, Sunwoo Kim, Juyeon Kim, Kyungho Kim, Geon Lee, Shinhwan Kang, Myunghwan Kim, Kijung Shin
Abstract:
Relational databases (RDBs) are composed of interconnected tables, where relationships between them are defined through foreign keys. Recent research on applying machine learning to RDBs has explored graph-based representations of RDBs, where rows of tables are modeled as nodes, and foreign key relationships are modeled as edges. RDB-to-graph modeling helps capture cross-table dependencies, ultimately leading to enhanced performance across diverse tasks. However, there are numerous ways to model RDBs as graphs, and performance varies significantly depending on the chosen graph model. In our analysis, applying a common heuristic rule for graph modeling leads to up to a 10% drop in performance compared to the best-performing graph model, which remains non-trivial to identify. To foster research on intelligent RDB-to-graph modeling, we introduce RDB2G-Bench, the first benchmark framework for evaluating such methods. We construct extensive datasets covering 5 real-world RDBs and 12 predictive tasks, resulting in around 50k graph-performance pairs for efficient and reproducible evaluations. Thanks to our precomputed datasets, we were able to benchmark 9 automatic RDB-to-graph modeling methods on the 12 tasks over 600x faster than on-the-fly evaluation, which requires repeated model training. Our analysis of the datasets and benchmark results reveals key structural patterns affecting graph model effectiveness, along with practical implications for effective graph modeling.
中文: 近期研究将关系数据库建模为图以捕捉跨表依赖关系,但不同图模型的性能差异显著,为此我们推出了首个基准框架RDB2G-Bench,用于高效可复现地评估这些方法。
English: Recent research models relational databases as graphs to capture cross-table dependencies, but performance varies significantly with different graph models, prompting the introduction of RDB2G-Bench as the first benchmark framework for efficient and reproducible evaluation of these methods.
Authors:Haoyu Li, Xiangru Zhong, Bin Hu, Huan Zhang
Abstract:
Learning-based neural network (NN) control policies have shown impressive empirical performance. However, obtaining stability guarantees and estimations of the region of attraction of these learned neural controllers is challenging due to the lack of stable and scalable training and verification algorithms. Although previous works in this area have achieved great success, much conservatism remains in their framework. In this work, we propose a novel two-stage training framework to jointly synthesize the controller and Lyapunov function for continuous-time systems. By leveraging a Zubov-inspired region of attraction characterization to directly estimate stability boundaries, we propose a novel training data sampling strategy and a domain updating mechanism that significantly reduces the conservatism in training. Moreover, unlike existing works on continuous-time systems that rely on an SMT solver to formally verify the Lyapunov condition, we extend state-of-the-art neural network verifier $α,\!β$-CROWN with the capability of performing automatic bound propagation through the Jacobian of dynamical systems and a novel verification scheme that avoids expensive bisection. To demonstrate the effectiveness of our approach, we conduct numerical experiments by synthesizing and verifying controllers on several challenging nonlinear systems across multiple dimensions. We show that our training can yield region of attractions with volume $5 - 1.5\cdot 10^{5}$ times larger compared to the baselines, and our verification on continuous systems can be up to $40-10000$ times faster compared to the traditional SMT solver dReal. Our code is available at https://github.com/Verified-Intelligence/Two-Stage_Neural_Controller_Training.
中文: 本文提出了一种两阶段训练框架,通过联合合成控制器与李雅普诺夫函数,在连续时间系统中实现了比现有方法大得多的吸引域估计和快数千倍的验证速度。
English: This paper introduces a two-stage training framework that synthesizes neural network controllers and Lyapunov functions for continuous-time systems, achieving significantly larger regions of attraction and faster verification compared to existing methods.
Authors:Xinyu Zhu, Mengzhou Xia, Zhepei Wei, Wei-Lin Chen, Danqi Chen, Yu Meng
Abstract:
Reinforcement learning with verifiable rewards (RLVR) is a promising approach for training language models (LMs) on reasoning tasks that elicit emergent long chains of thought (CoTs). Unlike supervised learning, it updates the model using both correct and incorrect samples via policy gradients. To better understand its mechanism, we decompose the learning signal into reinforcing correct responses and penalizing incorrect ones, referred to as Positive and Negative Sample Reinforcement (PSR and NSR), respectively. We train Qwen2.5-Math-7B and Qwen3-4B on a mathematical reasoning dataset and uncover a surprising result: training with only negative samples -- without reinforcing correct responses -- can be highly effective: it consistently improves performance over the base model across the entire Pass@$k$ spectrum ($k$ up to $256$), often matching or surpassing PPO and GRPO. In contrast, reinforcing only correct responses improves Pass@$1$ but degrades performance at higher $k$, due to reduced diversity. These inference-scaling trends highlight that solely penalizing incorrect responses may contribute more to performance than previously recognized. Through gradient analysis, we show that NSR works by suppressing incorrect generations and redistributing probability mass toward other plausible candidates, guided by the model's prior beliefs. It refines the model's existing knowledge rather than introducing entirely new behaviors. Building on this insight, we propose a simple variant of the RL objective that upweights NSR, and show that it consistently improves overall Pass@$k$ performance on MATH, AIME 2025, and AMC23. Our code is available at https://github.com/TianHongZXY/RLVR-Decomposed.
中文: 带有可验证奖励的强化学习(RLVR)通过惩罚错误答案来有效训练语言模型在推理任务上的表现,这种方法能抑制错误生成并重新分配概率给其他合理选项,在多项评估指标上常达到或超越传统方法的效果。
English: Reinforcement learning with verifiable rewards (RLVR) effectively trains language models on reasoning tasks by penalizing incorrect responses, which suppresses wrong answers and redistributes probability to other plausible options, often matching or surpassing traditional methods while improving performance across various evaluation metrics.
Authors:Zeming Li, Xiangyue Liu, Xiangyu Zhang, Ping Tan, Heung-Yeung Shum
Abstract:
Diffusion models have emerged as powerful generative frameworks, creating data samples by progressively denoising an initial random state. Traditionally, this initial state is sampled from a simple, fixed distribution like isotropic Gaussian, inherently lacking structure and a direct mechanism for external control. While recent efforts have explored ways to introduce controllability into the diffusion process, particularly at the initialization stage, they often rely on deterministic or heuristic approaches. These methods can be suboptimal, lack expressiveness, and are difficult to scale or integrate into more sophisticated optimization frameworks. In this paper, we introduce NoiseAR, a novel method for AutoRegressive Initial Noise Prior for Diffusion Models. Instead of a static, unstructured source, NoiseAR learns to generate a dynamic and controllable prior distribution for the initial noise. We formulate the generation of the initial noise prior's parameters as an autoregressive probabilistic modeling task over spatial patches or tokens. This approach enables NoiseAR to capture complex spatial dependencies and introduce learned structure into the initial state. Crucially, NoiseAR is designed to be conditional, allowing text prompts to directly influence the learned prior, thereby achieving fine-grained control over the diffusion initialization. Our experiments demonstrate that NoiseAR can generate initial noise priors that lead to improved sample quality and enhanced consistency with conditional inputs, offering a powerful, learned alternative to traditional random initialization. A key advantage of NoiseAR is its probabilistic formulation, which naturally supports seamless integration into probabilistic frameworks like Markov Decision Processes and Reinforcement Learning. Our code will be available at https://github.com/HKUST-SAIL/NoiseAR/
NoiseAR introduces an autoregressive method to learn dynamic and controllable initial noise priors for diffusion models, enabling text-conditioned control and improved sample quality through structured initialization.
English Summary:
Authors:Chong Li, Xiangyang Xue, Jianfeng Feng, Taiping Zeng
Abstract:
Episodic memory enables humans to recall past experiences by associating semantic elements such as objects, locations, and time into coherent event representations. While large pretrained models have shown remarkable progress in modeling semantic memory, the mechanisms for forming associative structures that support episodic memory remain underexplored. Inspired by hippocampal CA3 dynamics and its role in associative memory, we propose the Latent Structured Hopfield Network (LSHN), a biologically inspired framework that integrates continuous Hopfield attractor dynamics into an autoencoder architecture. LSHN mimics the cortical-hippocampal pathway: a semantic encoder extracts compact latent representations, a latent Hopfield network performs associative refinement through attractor convergence, and a decoder reconstructs perceptual input. Unlike traditional Hopfield networks, our model is trained end-to-end with gradient descent, achieving scalable and robust memory retrieval. Experiments on MNIST, CIFAR-10, and a simulated episodic memory task demonstrate superior performance in recalling corrupted inputs under occlusion and noise, outperforming existing associative memory models. Our work provides a computational perspective on how semantic elements can be dynamically bound into episodic memory traces through biologically grounded attractor mechanisms. Code: https://github.com/fudan-birlab/LSHN.
中文:潜在结构Hopfield网络(LSHN)是一种受生物学启发的模型,将Hopfield吸引子动力学融入自编码器,通过动态绑定语义元素形成情景记忆,在多个数据集上展现出对受损输入的卓越回忆性能。
English: The Latent Structured Hopfield Network (LSHN) is a biologically inspired model that integrates Hopfield attractor dynamics into an autoencoder to dynamically bind semantic elements into episodic memories, demonstrating superior performance in recalling corrupted inputs across multiple datasets.
Authors:Jinmei Liu, Fuhong Liu, Jianye Hao, Bo Wang, Huaxiong Li, Chunlin Chen, Zhi Wang
Abstract:
Recent advancements in language models have demonstrated remarkable in-context learning abilities, prompting the exploration of in-context reinforcement learning (ICRL) to extend the promise to decision domains. Due to involving more complex dynamics and temporal correlations, existing ICRL approaches may face challenges in learning from suboptimal trajectories and achieving precise in-context inference. In the paper, we propose \textbf{S}calable \textbf{I}n-\textbf{C}ontext \textbf{Q}-\textbf{L}earning (\textbf{SICQL}), an innovative framework that harnesses dynamic programming and world modeling to steer ICRL toward efficient reward maximization and task generalization, while retaining the scalability and stability of supervised pretraining. We design a prompt-based multi-head transformer architecture that simultaneously predicts optimal policies and in-context value functions using separate heads. We pretrain a generalized world model to capture task-relevant information, enabling the construction of a compact prompt that facilitates fast and precise in-context inference. During training, we perform iterative policy improvement by fitting a state value function to an upper-expectile of the Q-function, and distill the in-context value functions into policy extraction using advantage-weighted regression. Extensive experiments across a range of discrete and continuous environments show consistent performance gains over various types of baselines, especially when learning from suboptimal data. Our code is available at https://github.com/NJU-RL/SICQL
中文摘要:本文提出SICQL框架,通过结合动态规划和世界建模,有效提升了上下文强化学习的性能与泛化能力,尤其在从次优轨迹中学习时表现突出。
English Summary: The paper introduces SICQL, a scalable in-context Q-learning framework that combines dynamic programming and world modeling to enhance reinforcement learning performance and generalization, particularly when learning from suboptimal trajectories.
Authors:Yudong Lu, Yazhe Niu, Shuai Hu, Haolin Wang
Abstract:
CleanS2S is a framework for human-like speech-to-speech interaction that advances conversational AI through single-file implementation and proactive dialogue capabilities. Our system integrates automatic speech recognition, large language models, and text-to-speech synthesis into a unified pipeline with real-time interruption handling, achieving low transition latency through full-duplex websocket connections and non-blocking I/O. Beyond conventional chatbot paradigms, we pioneer a proactive interaction mechanism, which combines memory systems with Subjective Action Judgement module, enabling five human-like response strategies: interruption, refusal, deflection, silence, and standard response. The memory module dynamically aggregates historical, and contextual data to inform interaction decisions. This approach breaks the rigid turn-based convention by allowing system-initiated dialog control and context-aware response selection. And we propose Action Judgement SFT that assesses input streams for responses strategies. The framework's single-file implementation with atomic configurations offers researchers unprecedented transparency and extensibility for interaction agents. The code of CleanS2S is released at \https://github.com/opendilab/CleanS2S.
Chinese: CleanS2S是一个类人语音交互框架,通过统一流程集成实时打断处理和主动对话机制,实现灵活响应策略,并为对话AI提供透明可扩展的单文件实现方案。
English: CleanS2S is a human-like speech-to-speech interaction framework that integrates real-time interruption handling and proactive dialogue mechanisms through a unified pipeline, enabling flexible response strategies and offering transparent, extensible implementation for conversational AI.
Authors:Majdi Hassan, Cristian Gabellini, Hatem Helal, Dominique Beaini, Kirill Neklyudov
Abstract:
Density Functional Theory (DFT) allows for predicting all the chemical and physical properties of molecular systems from first principles by finding an approximate solution to the many-body Schrödinger equation. However, the cost of these predictions becomes infeasible when increasing the scale of the energy evaluations, e.g., when calculating the ground-state energy for simulating molecular dynamics. Recent works have demonstrated that, for substantially large datasets of molecular conformations, Deep Learning-based models can predict the outputs of the classical DFT solvers by amortizing the corresponding optimization problems. In this paper, we propose a novel method that reduces the dependency of amortized DFT solvers on large pre-collected datasets by introducing a self-refining training strategy. Namely, we propose an efficient method that simultaneously trains a deep-learning model to predict the DFT outputs and samples molecular conformations that are used as training data for the model. We derive our method as a minimization of the variational upper bound on the KL-divergence measuring the discrepancy between the generated samples and the target Boltzmann distribution defined by the ground state energy. To demonstrate the utility of the proposed scheme, we perform an extensive empirical study comparing it with the models trained on the pre-collected datasets. Finally, we open-source our implementation of the proposed algorithm, optimized with asynchronous training and sampling stages, which enables simultaneous sampling and training. Code is available at https://github.com/majhas/self-refining-dft.
中文: 本文提出了一种自优化训练方法,通过同时训练模型和采样分子构象,降低了基于深度学习的DFT求解器对大型预收集数据集的依赖,并通过广泛实证研究验证了其有效性。
English: This paper introduces a self-refining training method that reduces the reliance of deep learning-based DFT solvers on large pre-collected datasets by simultaneously training a model and sampling molecular conformations, validated through extensive empirical comparisons.
Authors:Mark Muchane, Sean Richardson, Kiho Park, Victor Veitch
Abstract:
Sparse dictionary learning (and, in particular, sparse autoencoders) attempts to learn a set of human-understandable concepts that can explain variation on an abstract space. A basic limitation of this approach is that it neither exploits nor represents the semantic relationships between the learned concepts. In this paper, we introduce a modified SAE architecture that explicitly models a semantic hierarchy of concepts. Application of this architecture to the internal representations of large language models shows both that semantic hierarchy can be learned, and that doing so improves both reconstruction and interpretability. Additionally, the architecture leads to significant improvements in computational efficiency.
Chinese: 本文提出了一种改进的稀疏自编码器架构,能够学习概念间的语义层次结构,从而提升大型语言模型的重构能力、可解释性及计算效率。
English: This paper introduces a modified sparse autoencoder architecture that learns a semantic hierarchy of concepts, improving reconstruction, interpretability, and computational efficiency in large language models.
Authors:Erhan Xu, Kai Ye, Hongyi Zhou, Luhan Zhu, Francesco Quinzan, Chengchun Shi
Abstract:
This paper studies reinforcement learning from human feedback (RLHF) for aligning large language models with human preferences. While RLHF has demonstrated promising results, many algorithms are highly sensitive to misspecifications in the underlying preference model (e.g., the Bradley-Terry model), the reference policy, or the reward function, resulting in undesirable fine-tuning. To address model misspecification, we propose a doubly robust preference optimization algorithm that remains consistent when either the preference model or the reference policy is correctly specified (without requiring both). Our proposal demonstrates superior and more robust performance than state-of-the-art algorithms, both in theory and in practice. The code is available at https://github.com/DRPO4LLM/DRPO4LLM
中文: 本文提出了一种双重稳健偏好优化算法,用于基于人类反馈的强化学习,该算法在偏好模型或参考策略任一正确设定时均能保持一致性,相比现有方法展现出更优越的稳健性和性能表现。
English: This paper introduces a doubly robust preference optimization algorithm for reinforcement learning from human feedback that maintains consistency when either the preference model or reference policy is correctly specified, showing superior robustness and performance compared to existing methods.
Authors:Xintong Sun, Chi Wei, Minghao Tian, Shiwen Ni
Abstract:
Large Language Models (LLMs) have shown remarkable capabilities, yet ensuring their outputs conform to strict structural or grammatical constraints remains challenging, which is critical in function calls and domain-specific language (DSL) generation. Constrained decoding with context-free grammar is a flexible approach to guarantee LLMs' adherence to a specific format by dynamically building a token logits mask. However, creating this mask requires checking the validity of all tokens in the LLM vocabulary at every decoding step, which often incurs significant overheads in existing constrained decoding engines. To address this challenge, we propose $\textbf{ZapFormat}$, a novel $\textbf{dynamic pruning}$ strategy based on the Earley algorithm that identifies and eliminates invalid or redundant Earley states in real-time, significantly reducing memory occupation of the Earley algorithm's states. This further enables us to use a state cache to speed up structured generations on a large number of queries. We implemented ZapFormat in a new constrained decoding engine called Formatron which also incorporates existing optimizations. Through comprehensive experiments on structured generation tasks, including JSON generation, JSON Schema validation, and semantic parsing, we demonstrate that Formatron not only $\textbf{consistently maintains}$ high-precision compliant outputs but also achieves $\textbf{significant improvements}$ in inference speed up to 2x compared to state-of-the-art implementations. More importantly, Formatron is generally applicable across various LLM architectures. We release Formatron as open source at https://github.com/Dan-wanna-M/formatron.
中文摘要:ZapFormat基于Earley算法提出动态剪枝策略,有效降低约束解码的计算负担,使Formatron引擎在保持高合规性的同时,将推理速度提升高达两倍,并适用于多种大语言模型架构。
English Summary: ZapFormat introduces a dynamic pruning strategy based on the Earley algorithm to reduce computational overhead in constrained decoding, enabling the Formatron engine to maintain high compliance while doubling inference speed across various LLM architectures.
Authors:Yavuz Bakman, Duygu Nur Yaldiz, Sungmin Kang, Tuo Zhang, Baturalp Buyukates, Salman Avestimehr, Sai Praneeth Karimireddy
Abstract:
Large Language Model (LLM) Uncertainty Estimation (UE) methods have become a crucial tool for detecting hallucinations in recent years. While numerous UE methods have been proposed, most existing studies evaluate them in isolated short-form QA settings using threshold-independent metrics such as AUROC or PRR. However, real-world deployment of UE methods introduces several challenges. In this work, we systematically examine four key aspects of deploying UE methods in practical settings. Specifically, we assess (1) the sensitivity of UE methods to decision threshold selection, (2) their robustness to query transformations such as typos, adversarial prompts, and prior chat history, (3) their applicability to long-form generation, and (4) strategies for handling multiple UE scores for a single query. Our evaluations on 19 UE methods reveal that most of them are highly sensitive to threshold selection when there is a distribution shift in the calibration dataset. While these methods generally exhibit robustness against previous chat history and typos, they are significantly vulnerable to adversarial prompts. Additionally, while existing UE methods can be adapted for long-form generation through various strategies, there remains considerable room for improvement. Lastly, ensembling multiple UE scores at test time provides a notable performance boost, which highlights its potential as a practical improvement strategy. Code is available at: https://github.com/duygunuryldz/uncertainty_in_the_wild.
中文: 本研究系统评估了19种大语言模型不确定性估计方法在四个实际部署挑战中的表现,发现它们对阈值选择敏感、易受对抗性提示影响,同时通过集成策略展现出显著改进潜力。
English: This study systematically evaluates 19 LLM uncertainty estimation methods across four practical deployment challenges, revealing their sensitivity to threshold selection, vulnerability to adversarial prompts, and potential improvements through ensembling strategies.
Authors:Saibo Geng, Nathan Ranchin, Yunzhen yao, Maxime Peyrard, Chris Wendler, Michael Gastpar, Robert West
Abstract:
Tokenization efficiency plays a critical role in the performance and cost of large language models (LLMs), yet most models rely on static tokenizers optimized for general-purpose corpora. These tokenizers' fixed vocabularies often fail to adapt to domain- or language-specific inputs, leading to longer token sequences and higher computational costs. We introduce zip2zip, a framework that enables LLMs to dynamically adjust token vocabulary at inference time, allowing for fewer generated tokens and thus faster inference. zip2zip consists of three key components: (1) a tokenizer based on Lempel-Ziv-Welch (LZW) compression that incrementally compresses tokens into reusable "hypertokens" on the fly; (2) an embedding layer that computes embeddings for newly formed hypertokens at runtime; and (3) a causal language modeling variant that trains the model to operate on hypertokenized, compressed sequences. We show that an existing LLM can be zip2zip-fied in 10 GPU-hours via parameter-efficient finetuning. The resulting zip2zip LLMs effectively learn to use hypertokens at inference time, reducing input and output sequence length by 20-60\%, with significant improvements in inference latency.
中文摘要:zip2zip框架让大型语言模型能够在推理时动态调整词汇表,通过实时将令牌压缩为可复用的超令牌,使序列长度减少20-60%,并显著提升推理速度。
English Summary: The zip2zip framework enables large language models to dynamically adjust their token vocabulary during inference, reducing sequence length by 20-60% and significantly improving latency through real-time token compression into reusable hypertokens.
Authors:Attila Szász, Balázs Bánhelyi, Márk Jelasity
Abstract:
The ultimate goal of verification is to guarantee the safety of deployed neural networks. Here, we claim that all the state-of-the-art verifiers we are aware of fail to reach this goal. Our key insight is that theoretical soundness (bounding the full-precision output while computing with floating point) does not imply practical soundness (bounding the floating point output in a potentially stochastic environment). We prove this observation for the approaches that are currently used to achieve provable theoretical soundness, such as interval analysis and its variants. We also argue that achieving practical soundness is significantly harder computationally. We support our claims empirically as well by evaluating several well-known verification methods. To mislead the verifiers, we create adversarial networks that detect and exploit features of the deployment environment, such as the order and precision of floating point operations. We demonstrate that all the tested verifiers are vulnerable to our new deployment-specific attacks, which proves that they are not practically sound.
中文: 现有的神经网络验证器尽管具备理论上的严密性,但在实际应用中无法确保安全性,因为它们容易受到针对部署环境(如浮点运算特性)设计的对抗性攻击的破坏。
English: Current neural network verifiers fail to ensure practical safety despite theoretical soundness, as demonstrated by their vulnerability to deployment-specific adversarial attacks that exploit environmental features like floating-point operations.
Authors:Siyuan Li, Juanxi Tian, Zedong Wang, Xin Jin, Zicheng Liu, Wentao Zhang, Dan Xu
Abstract:
Training large language models (LLMs) poses challenges due to their massive scale and heterogeneous architectures. While adaptive optimizers like AdamW help address gradient variations, they still struggle with efficient and effective parameter-wise learning rate estimation, resulting in training instability, slow convergence, and poor compatibility with parameter-efficient fine-tuning (PEFT) techniques. This work introduces Scaling with Gradient Grouping (SGG), an optimizer wrapper that improves adaptive learning rate estimation by dynamic grouping and group-specific scaling. SGG first groups gradient statistics in each layer into clusters and then applies cluster-specific scaling to calibrate learning rates for each parameter, thus imposing collective group-wise constraints while maintaining precise per-parameter adaptation. Experiments on diverse (M)LLM benchmarks show that SGG integrates seamlessly with existing optimizers, and offers consistent gains and faster convergence over baselines, with various model sizes. Its stability across varying batch sizes and learning rates establishes SGG as a robust choice for LLM optimization.
中文: 本文提出SGG优化器包装器,通过动态梯度分组和簇特定缩放改进自适应学习率估计,在不同模型规模下均能提升训练稳定性、加速收敛并增强与参数高效微调技术的兼容性。
English: This paper introduces SGG, an optimizer wrapper that enhances adaptive learning rate estimation through dynamic gradient grouping and cluster-specific scaling, leading to improved training stability, faster convergence, and better compatibility with parameter-efficient fine-tuning across various model sizes.
Authors:Chengyi Cai, Zesheng Ye, Lei Feng, Jianzhong Qi, Feng Liu
Abstract:
Model reprogramming adapts pretrained models to downstream tasks by modifying only the input and output spaces. Visual reprogramming (VR) is one instance for vision tasks that adds a trainable noise pattern (i.e., a visual prompt) to input images to facilitate downstream classification. The existing VR approaches for CLIP train a single visual prompt using all descriptions of different downstream classes. However, the limited learning capacity may result in (1) a failure to capture diverse aspects of the descriptions (e.g., shape, color, and texture), and (2) a possible bias toward less informative attributes that do not help distinguish between classes. In this paper, we introduce a decoupling-and-reweighting framework. Our decoupled visual prompts (DVP) are optimized using descriptions grouped by explicit causes (DVP-cse) or unsupervised clusters (DVP-cls). Then, we integrate the outputs of these visual prompts with a probabilistic reweighting matrix (PRM) that measures their contributions to each downstream class. Theoretically, DVP lowers the empirical risk bound. Experimentally, DVP outperforms baselines on average across 11 downstream datasets. Notably, the DVP-PRM integration enables insights into how individual visual prompts influence classification decisions, providing a probabilistic framework for understanding reprogramming. Our code is available at https://github.com/tmlr-group/DecoupledVP.
Chinese: 本文提出了一种解耦与重加权框架,通过分组描述并使用概率重加权矩阵整合解耦视觉提示(DVP)的输出,解决了现有视觉重编程方法的局限性,在11个数据集上提升了性能并增强了可解释性。
English: This paper introduces a decoupling-and-reweighting framework with decoupled visual prompts (DVP) that address limitations in existing visual reprogramming methods by grouping descriptions and integrating outputs via a probabilistic reweighting matrix, improving performance and interpretability across 11 datasets.
Authors:Qiao Xiao, Boqian Wu, Andrey Poddubnyy, Elena Mocanu, Phuong H. Nguyen, Mykola Pechenizkiy, Decebal Constantin Mocanu
Abstract:
Federated learning (FL) enables collaborative model training across decentralized clients while preserving data privacy, leveraging aggregated updates to build robust global models. However, this training paradigm faces significant challenges due to data heterogeneity and limited local datasets, which often impede effective collaboration. In such scenarios, we identify the Layer-wise Inertia Phenomenon in FL, wherein the middle layers of global model undergo minimal updates after early communication rounds, ultimately limiting the effectiveness of global aggregation. We demonstrate the presence of this phenomenon across a wide range of federated settings, spanning diverse datasets and architectures. To address this issue, we propose LIPS (Layer-wise Inertia Phenomenon with Sparsity), a simple yet effective method that periodically introduces transient sparsity to stimulate meaningful updates and empower global aggregation. Experiments demonstrate that LIPS effectively mitigates layer-wise inertia, enhances aggregation effectiveness, and improves overall performance in various FL scenarios. This work not only deepens the understanding of layer-wise learning dynamics in FL but also paves the way for more effective collaboration strategies in resource-constrained environments. Our code is publicly available at: https://github.com/QiaoXiao7282/LIPS.
中文摘要:本研究揭示了联邦学习中的层间惯性现象,即模型中间层在早期训练后更新停滞,并提出LIPS方法通过周期性稀疏化激活层间更新,有效提升了不同联邦场景下的模型聚合性能。
English Summary: This study identifies the Layer-wise Inertia Phenomenon in federated learning where middle layers stagnate after initial training rounds, and proposes LIPS—a sparsity-based method that reactivates these layers to enhance global model performance across diverse federated scenarios.
Authors:Lennart Bramlage, Cristóbal Curio
Abstract:
Uncertainty quantification is critical in safety-sensitive applications but is often omitted from off-the-shelf neural networks due to adverse effects on predictive performance. Retrofitting uncertainty estimates post-hoc typically requires access to model parameters or gradients, limiting feasibility in practice. We propose a theoretically grounded framework for post-hoc uncertainty estimation in regression tasks by fitting an auxiliary model to both original inputs and frozen model outputs. Drawing from principles of maximum likelihood estimation and sequential parameter fitting, we formalize an exact post-hoc optimization objective that recovers the canonical MLE of Gaussian parameters, without requiring sampling or approximation at inference. While prior work has used model outputs to estimate uncertainty, we explicitly characterize the conditions under which this is valid and demonstrate the extent to which structured outputs can support quasi-epistemic inference. We find that using diverse auxiliary data, such as augmented subsets of the original training data, significantly enhances OOD detection and metric performance. Our hypothesis that frozen model outputs contain generalizable latent information about model error and predictive uncertainty is tested and confirmed. Finally, we ensure that our method maintains proper estimation of input-dependent uncertainty without relying exclusively on base model forecasts. These findings are demonstrated in toy problems and adapted to both UCI and depth regression benchmarks. Code: https://github.com/biggzlar/IO-CUE.
Chinese: 本文提出了一种基于理论的后验不确定性估计框架,通过将辅助模型拟合到原始输入和冻结模型输出中,有效提升了分布外检测性能,并在不依赖基础模型参数的情况下保持了不确定性量化的准确性。
English: This paper introduces a theoretically grounded framework for post-hoc uncertainty estimation in regression by fitting an auxiliary model to inputs and frozen model outputs, which enhances out-of-distribution detection and maintains accurate uncertainty quantification without relying on base model parameters.
Authors:Keyuan Cheng, Xudong Shen, Yihao Yang, Tengyue Wang, Yang Cao, Muhammad Asif Ali, Hanbin Wang, Lijie Hu, Di Wang
Abstract:
Large language models (LLMs) have shown remarkable capabilities across various software engineering tasks; however, their effectiveness in code migration, adapting code to run in different environments, remains insufficiently studied. In this work, we introduce CODEMENV: Code Migration Across Environment, a new benchmark specifically designed to assess LLMs' abilities in code migration scenarios. CODEMENV consists of 922 examples spanning 19 Python and Java packages, and covers three core tasks: (1) identifying functions incompatible with specific versions, (2) detecting changes in function definitions, and (3) adapting code to target environments. Experimental evaluation with seven LLMs on CODEMENV yields an average pass@1 rate of 26.50%, with GPT-4O achieving the highest score at 43.84%. Key findings include: (i) LLMs tend to be more proficient with newer function versions, which aids in migrating legacy code, and (ii) LLMs sometimes exhibit logical inconsistencies by identifying function changes irrelevant to the intended migration environment. The datasets are available at https://github.com/xdshen-ai/Benchmark-of-Code-Migration.
中文: 该摘要介绍了CODEMENV这一评估大语言模型在代码迁移任务中表现的新基准,结果显示平均通过率为26.50%,并揭示了模型对新版本函数的熟练度以及偶尔出现的逻辑不一致问题。
English: This abstract introduces CODEMENV, a new benchmark for evaluating large language models' performance in code migration tasks, revealing an average pass rate of 26.50% and highlighting both their proficiency with newer function versions and occasional logical inconsistencies.
Authors:Nidhi Kowtal, Raviraj Joshi
Abstract:
Emotion recognition in low-resource languages like Marathi remains challenging due to limited annotated data. We present L3Cube-MahaEmotions, a high-quality Marathi emotion recognition dataset with 11 fine-grained emotion labels. The training data is synthetically annotated using large language models (LLMs), while the validation and test sets are manually labeled to serve as a reliable gold-standard benchmark. Building on the MahaSent dataset, we apply the Chain-of-Translation (CoTR) prompting technique, where Marathi sentences are translated into English and emotion labeled via a single prompt. GPT-4 and Llama3-405B were evaluated, with GPT-4 selected for training data annotation due to superior label quality. We evaluate model performance using standard metrics and explore label aggregation strategies (e.g., Union, Intersection). While GPT-4 predictions outperform fine-tuned BERT models, BERT-based models trained on synthetic labels fail to surpass GPT-4. This highlights both the importance of high-quality human-labeled data and the inherent complexity of emotion recognition. An important finding of this work is that generic LLMs like GPT-4 and Llama3-405B generalize better than fine-tuned BERT for complex low-resource emotion recognition tasks. The dataset and model are shared publicly at https://github.com/l3cube-pune/MarathiNLP
中文摘要:本研究推出L3Cube-MahaEmotions马拉地语情感数据集,通过合成标注与人工验证相结合的方式,证明通用大语言模型(如GPT-4)在低资源情感识别任务中优于微调的BERT模型。
English Summary: This study introduces L3Cube-MahaEmotions, a Marathi emotion dataset combining synthetic LLM annotations with human-verified labels, demonstrating that general-purpose LLMs like GPT-4 outperform fine-tuned BERT models in low-resource emotion recognition tasks.
Authors:Haixin Wang, Jiashu Pan, Hao Wu, Fan Zhang, Tailin Wu
Abstract:
Modeling complex fluid systems, especially turbulence governed by partial differential equations (PDEs), remains a fundamental challenge in science and engineering. Recently, diffusion-based generative models have gained attention as a powerful approach for these tasks, owing to their capacity to capture long-range dependencies and recover hierarchical structures. However, we present both empirical and theoretical evidence showing that generative models struggle with significant spectral bias and common-mode noise when generating high-fidelity turbulent flows. Here we propose FourierFlow, a novel generative turbulence modeling framework that enhances the frequency-aware learning by both implicitly and explicitly mitigating spectral bias and common-mode noise. FourierFlow comprises three key innovations. Firstly, we adopt a dual-branch backbone architecture, consisting of a salient flow attention branch with local-global awareness to focus on sensitive turbulence areas. Secondly, we introduce a frequency-guided Fourier mixing branch, which is integrated via an adaptive fusion strategy to explicitly mitigate spectral bias in the generative model. Thirdly, we leverage the high-frequency modeling capabilities of the masked auto-encoder pre-training and implicitly align the features of the generative model toward high-frequency components. We validate the effectiveness of FourierFlow on three canonical turbulent flow scenarios, demonstrating superior performance compared to state-of-the-art methods. Furthermore, we show that our model exhibits strong generalization capabilities in challenging settings such as out-of-distribution domains, long-term temporal extrapolation, and robustness to noisy inputs. The code can be found at https://github.com/AI4Science-WestlakeU/FourierFlow.
中文: FourierFlow是一种创新的生成式湍流建模框架,通过双分支架构和频率感知学习解决频谱偏差和共模噪声问题,在湍流场景中展现出卓越性能和强大泛化能力。
English: FourierFlow is a novel generative turbulence modeling framework that addresses spectral bias and common-mode noise through dual-branch architecture and frequency-aware learning, demonstrating superior performance and strong generalization in turbulent flow scenarios.
Authors:Keyuan Cheng, Zijian Kan, Zhixian He, Zhuoran Zhang, Muhammad Asif Ali, Ke Xu, Lijie Hu, Di Wang
Abstract:
Knowledge Editing, which efficiently modifies the knowledge in large language models, has gathered great attention. Current benchmarks primarily use multi-hop question answering to assess and analyze newly injected or updated knowledge. However, we argue that these benchmarks fail to effectively evaluate how well the updated models apply this knowledge in real-life scenarios, particularly when questions require complex reasoning, involving one-to-many relationships or multi-step logical intersections. To fill in this gap, we introduce a new benchmark, COMPKE: Complex Question Answering under Knowledge Editing, which includes 11,924 complex questions that reflect real-life situations. We conduct an extensive evaluation of four knowledge editing methods on COMPKE, revealing that their effectiveness varies notably across different models. For instance, MeLLo attains an accuracy of 39.47 on GPT-4O-MINI, but this drops sharply to 3.83 on QWEN2.5-3B. We further investigate the underlying causes of these disparities from both methodological and model-specific perspectives. The datasets are available at https://github.com/kzjkzj666/CompKE.
中文: 本文提出了COMPKE新基准,通过复杂的现实问题评估大语言模型的知识编辑能力,发现不同模型和方法的效果存在显著差异。
English: This paper introduces COMPKE, a new benchmark designed to evaluate knowledge editing in large language models through complex, real-life questions, revealing significant performance variations across different models and methods.
Authors:Zihang Liu, Tianyu Pang, Oleg Balabanov, Chaoqun Yang, Tianjin Huang, Lu Yin, Yaoqing Yang, Shiwei Liu
Abstract:
Recent studies have shown that supervised fine-tuning of LLMs on a small number of high-quality datasets can yield strong reasoning capabilities. However, full fine-tuning (Full FT), while powerful, is computationally expensive and susceptible to overfitting and catastrophic forgetting, particularly when data is limited. Sparse fine-tuning, which previously achieved notable success by updating only a small subset of model parameters, offers a promising trade-off between efficiency and effectiveness. Yet, it has lagged behind in the LLM era due to the difficulty of identifying parameters truly critical for reasoning. In this work, we state that weights with the largest magnitude after low-rank approximation are critical weights for fine-tuning, which we call Principal Weights. Surprisingly, while magnitude-based sparse fine-tuning performs poorly as a baseline on LLM fine-tuning, it becomes highly effective after rank reduction. These insights motivate our method: Low-rank Informed Sparse Fine-Tuning (LIFT). LIFT only updates the top 5% Principal Weights throughout training and consistently achieves better performance on reasoning tasks than Full FT, while maintaining memory efficiency on par with popular parameter-efficient fine-tuning methods. In addition to strong performance on target domains such as arithmetic reasoning, LIFT also retains up to 20% more source-domain knowledge, compared to Full FT and LoRA. Our code is available at: https://github.com/zihanghliu/LIFT.
中文: 近期研究提出低秩信息稀疏微调方法LIFT,通过低秩近似识别并仅更新前5%的主权重,在推理任务中表现优于全参数微调,同时保持高内存效率并显著保留源领域知识。
English: Recent research introduces Low-rank Informed Sparse Fine-Tuning (LIFT), a method that updates only the top 5% of principal weights identified through low-rank approximation, achieving superior reasoning performance and memory efficiency compared to full fine-tuning while better preserving source-domain knowledge.
Authors:Tianze Yang, Tyson Jordan, Ninghao Liu, Jin Sun
Abstract:
We present Common Inpainted Objects In-N-Out of Context (COinCO), a novel dataset addressing the scarcity of out-of-context examples in existing vision datasets. By systematically replacing objects in COCO images through diffusion-based inpainting, we create 97,722 unique images featuring both contextually coherent and inconsistent scenes, enabling effective context learning. Each inpainted object is meticulously verified and categorized as in- or out-of-context through a multimodal large language model assessment. Our analysis reveals significant patterns in semantic priors that influence inpainting success across object categories. We demonstrate three key tasks enabled by COinCO: (1) training context classifiers that effectively determine whether existing objects belong in their context; (2) a novel Objects-from-Context prediction task that determines which new objects naturally belong in given scenes at both instance and clique levels, and (3) context-enhanced fake detection on state-of-the-art methods without fine-tuning. COinCO provides a controlled testbed with contextual variations, establishing a foundation for advancing context-aware visual understanding in computer vision and image forensics. Our code and data are at: https://github.com/YangTianze009/COinCO.
中文摘要:COinCO数据集通过系统替换图像中的对象创建了97,722张包含上下文一致与不一致场景的图像,为推进计算机视觉中的上下文感知理解和图像取证研究提供了重要基础。
English Summary: The COinCO dataset introduces 97,722 images with systematically replaced objects to study contextual coherence, enabling advancements in context-aware computer vision tasks and image forensics.
Authors:Wei Dai, Peilin Chen, Chanakya Ekbote, Paul Pu Liang
Abstract:
Clinical decision-making routinely demands reasoning over heterogeneous data, yet existing multimodal language models (MLLMs) remain largely vision-centric and fail to generalize across clinical specialties. To bridge this gap, we introduce QoQ-Med-7B/32B, the first open generalist clinical foundation model that jointly reasons across medical images, time-series signals, and text reports. QoQ-Med is trained with Domain-aware Relative Policy Optimization (DRPO), a novel reinforcement-learning objective that hierarchically scales normalized rewards according to domain rarity and modality difficulty, mitigating performance imbalance caused by skewed clinical data distributions. Trained on 2.61 million instruction tuning pairs spanning 9 clinical domains, we show that DRPO training boosts diagnostic performance by 43% in macro-F1 on average across all visual domains as compared to other critic-free training methods like GRPO. Furthermore, with QoQ-Med trained on intensive segmentation data, it is able to highlight salient regions related to the diagnosis, with an IoU 10x higher than open models while reaching the performance of OpenAI o4-mini. To foster reproducibility and downstream research, we release (i) the full model weights, (ii) the modular training pipeline, and (iii) all intermediate reasoning traces at https://github.com/DDVD233/QoQ_Med.
中文: QoQ-Med-7B/32B 是首个开放通用的临床基础模型,能够联合推理医学图像、时间序列信号和文本报告,并通过创新的领域感知相对策略优化(DRPO)显著提升诊断性能,有效应对临床数据分布不均的问题。
English: QoQ-Med-7B/32B is the first open generalist clinical foundation model that integrates reasoning across medical images, time-series signals, and text reports, utilizing a novel Domain-aware Relative Policy Optimization (DRPO) to enhance diagnostic performance and address data imbalance across clinical specialties.
Authors:Valter Hudovernik, Minkai Xu, Juntong Shi, Lovro Å ubelj, Stefano Ermon, Erik Å trumbelj, Jure Leskovec
Abstract:
Real-world databases are predominantly relational, comprising multiple interlinked tables that contain complex structural and statistical dependencies. Learning generative models on relational data has shown great promise in generating synthetic data and imputing missing values. However, existing methods often struggle to capture this complexity, typically reducing relational data to conditionally generated flat tables and imposing limiting structural assumptions. To address these limitations, we introduce RelDiff, a novel diffusion generative model that synthesizes complete relational databases by explicitly modeling their foreign key graph structure. RelDiff combines a joint graph-conditioned diffusion process across all tables for attribute synthesis, and a $2K+$SBM graph generator based on the Stochastic Block Model for structure generation. The decomposition of graph structure and relational attributes ensures both high fidelity and referential integrity, both of which are crucial aspects of synthetic relational database generation. Experiments on 11 benchmark datasets demonstrate that RelDiff consistently outperforms prior methods in producing realistic and coherent synthetic relational databases. Code is available at https://github.com/ValterH/RelDiff.
中文摘要:RelDiff是一种新颖的扩散模型,通过显式建模外键图结构来合成完整的关系数据库,结合属性合成与图生成技术,在生成真实合成数据方面优于现有方法。
English Summary: RelDiff is a novel diffusion model that synthesizes complete relational databases by explicitly modeling foreign key graph structures, combining attribute synthesis with graph generation to outperform existing methods in producing realistic synthetic data.
Authors:Tianze Yang, Yucheng Shi, Mengnan Du, Xuansheng Wu, Qiaoyu Tan, Jin Sun, Ninghao Liu
Abstract:
Vector-Quantized Generative Models (VQGMs) have emerged as powerful tools for image generation. However, the key component of VQGMs -- the codebook of discrete tokens -- is still not well understood, e.g., which tokens are critical to generate an image of a certain concept? This paper introduces Concept-Oriented Token Explanation (CORTEX), a novel approach for interpreting VQGMs by identifying concept-specific token combinations. Our framework employs two methods: (1) a sample-level explanation method that analyzes token importance scores in individual images, and (2) a codebook-level explanation method that explores the entire codebook to find globally relevant tokens. Experimental results demonstrate CORTEX's efficacy in providing clear explanations of token usage in the generative process, outperforming baselines across multiple pretrained VQGMs. Besides enhancing VQGMs transparency, CORTEX is useful in applications such as targeted image editing and shortcut feature detection. Our code is available at https://github.com/YangTianze009/CORTEX.
Chinese: 本文提出CORTEX框架,通过样本级和码本级的解释方法识别概念相关的令牌组合来解读矢量量化生成模型,有效提升模型透明度并支持图像编辑和特征检测等应用。
English: This paper introduces CORTEX, a novel framework that interprets Vector-Quantized Generative Models by identifying concept-specific token combinations through sample-level and codebook-level methods, enhancing model transparency and enabling applications like image editing and feature detection.
Authors:Li Zhang, Morgan Gray, Jaromir Savelka, Kevin D. Ashley
Abstract:
Large Language Models (LLMs) demonstrate potential in complex legal tasks like argument generation, yet their reliability remains a concern. Building upon pilot work assessing LLM generation of 3-ply legal arguments using human evaluation, this paper introduces an automated pipeline to evaluate LLM performance on this task, specifically focusing on faithfulness (absence of hallucination), factor utilization, and appropriate abstention. We define hallucination as the generation of factors not present in the input case materials and abstention as the model's ability to refrain from generating arguments when instructed and no factual basis exists. Our automated method employs an external LLM to extract factors from generated arguments and compares them against the ground-truth factors provided in the input case triples (current case and two precedent cases). We evaluated eight distinct LLMs on three tests of increasing difficulty: 1) generating a standard 3-ply argument, 2) generating an argument with swapped precedent roles, and 3) recognizing the impossibility of argument generation due to lack of shared factors and abstaining. Our findings indicate that while current LLMs achieve high accuracy (over 90%) in avoiding hallucination on viable argument generation tests (Tests 1 & 2), they often fail to utilize the full set of relevant factors present in the cases. Critically, on the abstention test (Test 3), most models failed to follow instructions to stop, instead generating spurious arguments despite the lack of common factors. This automated pipeline provides a scalable method for assessing these crucial LLM behaviors, highlighting the need for improvements in factor utilization and robust abstention capabilities before reliable deployment in legal settings. Link: https://lizhang-aiandlaw.github.io/An-Automated-Pipeline-for-Evaluating-LLM-Generated-3-ply-Case-Based-Legal-Arguments/
Authors:Saad Hossain, Samanvay Vajpayee, Sirisha Rambhatla
Abstract:
As large language models (LLMs) become ubiquitous, parameter-efficient fine-tuning methods and safety-first defenses have proliferated rapidly. However, the number of approaches and their recent increase have resulted in diverse evaluations-varied datasets, metrics, and inconsistent threat settings-making it difficult to fairly compare safety, utility, and robustness across methods. To address this, we introduce SafeTuneBed, a benchmark and toolkit unifying fine-tuning and defense evaluation. SafeTuneBed (i) curates a diverse repository of multiple fine-tuning datasets spanning sentiment analysis, question-answering, multi-step reasoning, and open-ended instruction tasks, and allows for the generation of harmful-variant splits; (ii) enables integration of state-of-the-art defenses, including alignment-stage immunization, in-training safeguards, and post-tuning repair; and (iii) provides evaluators for safety (attack success rate, refusal consistency) and utility. Built on Python-first, dataclass-driven configs and plugins, SafeTuneBed requires minimal additional code to specify any fine-tuning regime, defense method, and metric suite, while ensuring end-to-end reproducibility. We showcase its value by benchmarking representative defenses across varied poisoning scenarios and tasks. By standardizing data, code, and metrics, SafeTuneBed is the first focused toolkit of its kind to accelerate rigorous and comparable research in safe LLM fine-tuning. Code is available at: https://github.com/criticalml-uw/SafeTuneBed
Chinese: 针对大语言模型参数高效微调与安全防御方法评估标准不统一的问题,SafeTuneBed工具包通过整合多样化数据集、集成先进防御机制、提供标准化评估指标,建立了首个专注于安全微调的可复现基准框架,有力推动该领域研究的规范化和可比性。
English: The proliferation of diverse evaluation methods for parameter-efficient fine-tuning and safety defenses in large language models has created challenges in fair comparison, prompting the introduction of SafeTuneBed—a unified benchmark and toolkit that standardizes datasets, defenses, and metrics to enable rigorous and reproducible safety research.
Authors:Leila Mahmoodi, Peyman Moghadam, Munawar Hayat, Christian Simon, Mehrtash Harandi
Abstract:
We introduce Flashback Learning (FL), a novel method designed to harmonize the stability and plasticity of models in Continual Learning (CL). Unlike prior approaches that primarily focus on regularizing model updates to preserve old information while learning new concepts, FL explicitly balances this trade-off through a bidirectional form of regularization. This approach effectively guides the model to swiftly incorporate new knowledge while actively retaining its old knowledge. FL operates through a two-phase training process and can be seamlessly integrated into various CL methods, including replay, parameter regularization, distillation, and dynamic architecture techniques. In designing FL, we use two distinct knowledge bases: one to enhance plasticity and another to improve stability. FL ensures a more balanced model by utilizing both knowledge bases to regularize model updates. Theoretically, we analyze how the FL mechanism enhances the stability-plasticity balance. Empirically, FL demonstrates tangible improvements over baseline methods within the same training budget. By integrating FL into at least one representative baseline from each CL category, we observed an average accuracy improvement of up to 4.91% in Class-Incremental and 3.51% in Task-Incremental settings on standard image classification benchmarks. Additionally, measurements of the stability-to-plasticity ratio confirm that FL effectively enhances this balance. FL also outperforms state-of-the-art CL methods on more challenging datasets like ImageNet.
中文摘要:Flashback Learning(FL)是一种新颖的持续学习方法,通过双向正则化有效平衡模型的稳定性与可塑性,在多个基准测试中显著提升准确率并优化了稳定性与可塑性的平衡。
English Summary: Flashback Learning (FL) is a novel continual learning method that balances stability and plasticity through bidirectional regularization, achieving significant accuracy improvements and better stability-plasticity balance across various benchmarks.
Authors:Ioan-Paul Ciobanu, Andrei-Iulian Hiji, Nicolae-Catalin Ristea, Paul Irofti, Cristian Rusu, Radu Tudor Ionescu
Abstract:
Recent advances in audio generation led to an increasing number of deepfakes, making the general public more vulnerable to financial scams, identity theft, and misinformation. Audio deepfake detectors promise to alleviate this issue, with many recent studies reporting accuracy rates close to 99%. However, these methods are typically tested in an in-domain setup, where the deepfake samples from the training and test sets are produced by the same generative models. To this end, we introduce XMAD-Bench, a large-scale cross-domain multilingual audio deepfake benchmark comprising 668.8 hours of real and deepfake speech. In our novel dataset, the speakers, the generative methods, and the real audio sources are distinct across training and test splits. This leads to a challenging cross-domain evaluation setup, where audio deepfake detectors can be tested ``in the wild''. Our in-domain and cross-domain experiments indicate a clear disparity between the in-domain performance of deepfake detectors, which is usually as high as 100%, and the cross-domain performance of the same models, which is sometimes similar to random chance. Our benchmark highlights the need for the development of robust audio deepfake detectors, which maintain their generalization capacity across different languages, speakers, generative methods, and data sources. Our benchmark is publicly released at https://github.com/ristea/xmad-bench/.
中文:当前音频伪造检测器在受控环境中准确率接近完美,但在跨领域场景下表现不佳,凸显了开发能跨语言、说话人和生成方法泛化的鲁棒模型的必要性。
English: Recent audio deepfake detectors achieve near-perfect accuracy in controlled settings but perform poorly in cross-domain scenarios, highlighting the need for more robust models that generalize across languages, speakers, and generation methods.
Authors:Junwoo Park, Hyuck Lee, Dohyun Lee, Daehoon Gwak, Jaegul Choo
Abstract:
Large Language Models (LLMs) have shown remarkable performance across diverse tasks without domain-specific training, fueling interest in their potential for time-series forecasting. While LLMs have shown potential in zero-shot forecasting through prompting alone, recent studies suggest that LLMs lack inherent effectiveness in forecasting. Given these conflicting findings, a rigorous validation is essential for drawing reliable conclusions. In this paper, we evaluate the effectiveness of LLMs as zero-shot forecasters compared to state-of-the-art domain-specific models. Our experiments show that LLM-based zero-shot forecasters often struggle to achieve high accuracy due to their sensitivity to noise, underperforming even simple domain-specific models. We have explored solutions to reduce LLMs' sensitivity to noise in the zero-shot setting, but improving their robustness remains a significant challenge. Our findings suggest that rather than emphasizing zero-shot forecasting, a more promising direction would be to focus on fine-tuning LLMs to better process numerical sequences. Our experimental code is available at https://github.com/junwoopark92/revisiting-LLMs-zeroshot-forecaster.
Chinese: 大语言模型在零样本时间序列预测中因对噪声敏感而表现不佳,相比追求零样本能力,对其进行微调以更好地处理数值序列是更有前景的方向。
English: Large Language Models (LLMs) underperform in zero-shot time-series forecasting due to noise sensitivity, and fine-tuning them for numerical data processing is a more viable approach than pursuing zero-shot capabilities.
Authors:Hao Li, Hao Wan, Yuzhou Chen, Dongsheng Ye, Yulia Gel, Hao Jiang
Abstract:
Dynamic graphs evolve continuously, presenting challenges for traditional graph learning due to their changing structures and temporal dependencies. Recent advancements have shown potential in addressing these challenges by developing suitable meta-learning-based dynamic graph neural network models. However, most meta-learning approaches for dynamic graphs rely on fixed weight update parameters, neglecting the essential intrinsic complex high-order topological information of dynamically evolving graphs. We have designed Dowker Zigzag Persistence (DZP), an efficient and stable dynamic graph persistent homology representation method based on Dowker complex and zigzag persistence, to capture the high-order features of dynamic graphs. Armed with the DZP ideas, we propose TMetaNet, a new meta-learning parameter update model based on dynamic topological features. By utilizing the distances between high-order topological features, TMetaNet enables more effective adaptation across snapshots. Experiments on real-world datasets demonstrate TMetaNet's state-of-the-art performance and resilience to graph noise, illustrating its high potential for meta-learning and dynamic graph analysis. Our code is available at https://github.com/Lihaogx/TMetaNet.
Chinese: 本文提出TMetaNet,一种基于Dowker Zigzag持久性捕捉动态图高阶拓扑特征的新型元学习模型,实验证明其具有卓越性能和抗噪能力。
English: This paper introduces TMetaNet, a novel meta-learning model that leverages Dowker Zigzag Persistence to capture high-order topological features of dynamic graphs, achieving superior performance and noise resilience in experiments.
Authors:Seunghan Lee, Taeyoung Park, Kibok Lee
Abstract:
Channel identifiability (CID) refers to the ability to distinguish between individual channels in time series (TS) modeling. The absence of CID often results in producing identical outputs for identical inputs, disregarding channel-specific characteristics. In this paper, we highlight the importance of CID and propose Channel Normalization (CN), a simple yet effective normalization strategy that enhances CID by assigning distinct affine transformation parameters to each channel. We further extend CN in two ways: 1) Adaptive CN (ACN) dynamically adjusts parameters based on the input TS, improving adaptability in TS models, and 2) Prototypical CN (PCN) introduces a set of learnable prototypes instead of per-channel parameters, enabling applicability to datasets with unknown or varying number of channels and facilitating use in TS foundation models. We demonstrate the effectiveness of CN and its variants by applying them to various TS models, achieving significant performance gains for both non-CID and CID models. In addition, we analyze the success of our approach from an information theory perspective. Code is available at https://github.com/seunghan96/CN.
Chinese: 本文提出了通道归一化(CN)及其自适应与原型变体,旨在增强时间序列建模中的通道可识别性,显著提升了多种模型的性能表现。
English: This paper introduces Channel Normalization (CN) and its adaptive and prototypical variants to enhance channel identifiability in time series modeling, significantly improving model performance across various applications.
Authors:Ziwen Wang
Abstract:
Single-cell RNA sequencing (scRNA-seq) has revolutionized our understanding of cellular processes by enabling gene expression analysis at the individual cell level. Clustering allows for the identification of cell types and the further discovery of intrinsic patterns in single-cell data. However, the high dimensionality and sparsity of scRNA-seq data continue to challenge existing clustering models. In this paper, we introduce JojoSCL, a novel self-supervised contrastive learning framework for scRNA-seq clustering. By incorporating a shrinkage estimator based on hierarchical Bayesian estimation, which adjusts gene expression estimates towards more reliable cluster centroids to reduce intra-cluster dispersion, and optimized using Stein's Unbiased Risk Estimate (SURE), JojoSCL refines both instance-level and cluster-level contrastive learning. Experiments on ten scRNA-seq datasets substantiate that JojoSCL consistently outperforms prevalent clustering methods, with further validation of its practicality through robustness analysis and ablation studies. JojoSCL's code is available at: https://github.com/ziwenwang28/JojoSCL.
中文: JojoSCL是一种新颖的自监督对比学习框架,通过基于分层贝叶斯估计的收缩估计器和Stein无偏风险估计优化,有效降低单细胞RNA测序数据聚类中的簇内离散度,在多个数据集上均优于现有聚类方法。
English: JojoSCL is a novel self-supervised contrastive learning framework that enhances scRNA-seq clustering by reducing intra-cluster dispersion through hierarchical Bayesian estimation and Stein's Unbiased Risk Estimate, consistently outperforming existing methods across multiple datasets.
Authors:Yakun Song, Jiawei Chen, Xiaobin Zhuang, Chenpeng Du, Ziyang Ma, Jian Wu, Jian Cong, Dongya Jia, Zhuo Chen, Yuping Wang, Yuxuan Wang, Xie Chen
Abstract:
Neural audio codecs have made significant strides in efficiently mapping raw audio waveforms into discrete token representations, which are foundational for contemporary audio generative models. However, most existing codecs are optimized primarily for reconstruction quality, often at the expense of the downstream modelability of the encoded tokens. Motivated by the need to overcome this bottleneck, we introduce $\textbf{MagiCodec}$, a novel single-layer, streaming Transformer-based audio codec. MagiCodec is designed with a multistage training pipeline that incorporates Gaussian noise injection and latent regularization, explicitly targeting the enhancement of semantic expressiveness in the generated codes while preserving high reconstruction fidelity. We analytically derive the effect of noise injection in the frequency domain, demonstrating its efficacy in attenuating high-frequency components and fostering robust tokenization. Extensive experimental evaluations show that MagiCodec surpasses state-of-the-art codecs in both reconstruction quality and downstream tasks. Notably, the tokens produced by MagiCodec exhibit Zipf-like distributions, as observed in natural languages, thereby improving compatibility with language-model-based generative architectures. The code and pre-trained models are available at https://github.com/Ereboas/MagiCodec.
中文:MagiCodec是一种创新的流式音频编解码器,在保持高重建质量的同时增强了编码的语义表达能力,在音频保真度和下游任务性能上均优于现有模型。
English: MagiCodec is a novel streaming audio codec that enhances semantic expressiveness in token representations while maintaining high reconstruction quality, outperforming existing models in both audio fidelity and downstream task performance.
Authors:Sarthak Kumar Maharana, Saksham Singh Kushwaha, Baoming Zhang, Adrian Rodriguez, Songtao Wei, Yapeng Tian, Yunhui Guo
Abstract:
While recent audio-visual models have demonstrated impressive performance, their robustness to distributional shifts at test-time remains not fully understood. Existing robustness benchmarks mainly focus on single modalities, making them insufficient for thoroughly assessing the robustness of audio-visual models. Motivated by real-world scenarios where shifts can occur $\textit{simultaneously}$ in both audio and visual modalities, we introduce $\texttt{AVROBUSTBENCH}$, a comprehensive benchmark designed to evaluate the test-time robustness of audio-visual recognition models. $\texttt{AVROBUSTBENCH}$ comprises four audio-visual benchmark datasets, $\texttt{AUDIOSET-2C}$, $\texttt{VGGSOUND-2C}$, $\texttt{KINETICS-2C}$, and $\texttt{EPICKITCHENS-2C}$, each incorporating 75 bimodal audio-visual corruptions that are $\textit{co-occurring}$ and $\textit{correlated}$. Through extensive evaluations, we observe that state-of-the-art supervised and self-supervised audio-visual models exhibit declining robustness as corruption severity increases. Furthermore, online test-time adaptation (TTA) methods, on $\texttt{VGGSOUND-2C}$ and $\texttt{KINETICS-2C}$, offer minimal improvements in performance under bimodal corruptions. We further propose $\texttt{AV2C}$, a simple TTA approach enabling on-the-fly cross-modal fusion by penalizing high-entropy samples, which achieves improvements on $\texttt{VGGSOUND-2C}$. We hope that $\texttt{AVROBUSTBENCH}$ will steer the development of more effective and robust audio-visual TTA approaches. Our code is available $\href{https://github.com/sarthaxxxxx/AV-C-Robustness-Benchmark}{here}$.
中文: 该摘要介绍了AVROBUSTBENCH这一基准测试,用于评估视听模型在双模态同时受损时的鲁棒性,发现现有模型和适应方法对此类变化表现不佳,并提出了一种新的方法显示出改进效果。
English: The abstract introduces AVROBUSTBENCH, a benchmark for evaluating audio-visual models' robustness to simultaneous corruptions in both modalities, revealing that current models and adaptation methods struggle with such shifts, while proposing a new approach that shows improvement.
Authors:Muhammad Adnan, Nithesh Kurella, Akhil Arunkumar, Prashant J. Nair
Abstract:
Diffusion Transformers (DiTs) achieve state-of-the-art results in text-to-image, text-to-video generation, and editing. However, their large model size and the quadratic cost of spatial-temporal attention over multiple denoising steps make video generation computationally expensive. Static caching mitigates this by reusing features across fixed steps but fails to adapt to generation dynamics, leading to suboptimal trade-offs between speed and quality. We propose Foresight, an adaptive layer-reuse technique that reduces computational redundancy across denoising steps while preserving baseline performance. Foresight dynamically identifies and reuses DiT block outputs for all layers across steps, adapting to generation parameters such as resolution and denoising schedules to optimize efficiency. Applied to OpenSora, Latte, and CogVideoX, Foresight achieves up to \latencyimprv end-to-end speedup, while maintaining video quality. The source code of Foresight is available at \href{https://github.com/STAR-Laboratory/foresight}{https://github.com/STAR-Laboratory/foresight}.
Chinese: Foresight 提出了一种自适应层重用技术,在保持视频生成质量的同时,有效降低了扩散变换器的计算成本,显著提升了 OpenSora 等模型的运行速度。
English: Foresight introduces an adaptive layer-reuse technique that reduces computational costs in Diffusion Transformers for video generation while maintaining quality, achieving significant speed improvements in models like OpenSora.
Authors:Sofiane Mahiou, Amir Dizche, Reza Nazari, Xinmin Wu, Ralph Abbey, Jorge Silva, Georgi Ganev
Abstract:
We propose dpmm, an open-source library for synthetic data generation with Differentially Private (DP) guarantees. It includes three popular marginal models -- PrivBayes, MST, and AIM -- that achieve superior utility and offer richer functionality compared to alternative implementations. Additionally, we adopt best practices to provide end-to-end DP guarantees and address well-known DP-related vulnerabilities. Our goal is to accommodate a wide audience with easy-to-install, highly customizable, and robust model implementations.
Our codebase is available from https://github.com/sassoftware/dpmm.
中文: 我们推出dpmm开源库,用于生成具有差分隐私保障的合成数据,包含三种高效边际模型,并提供端到端的隐私保护功能。
English: We introduce dpmm, an open-source library for generating synthetic data with differential privacy guarantees, featuring three high-utility marginal models and robust end-to-end privacy protections.
Authors:Can Polat, Erchin Serpedin, Mustafa Kurban, Hasan Kurban
Abstract:
Most materials science datasets are limited to atomic geometries (e.g., XYZ files), restricting their utility for multimodal learning and comprehensive data-centric analysis. These constraints have historically impeded the adoption of advanced machine learning techniques in the field. This work introduces MultiCrystalSpectrumSet (MCS-Set), a curated framework that expands materials datasets by integrating atomic structures with 2D projections and structured textual annotations, including lattice parameters and coordination metrics. MCS-Set enables two key tasks: (1) multimodal property and summary prediction, and (2) constrained crystal generation with partial cluster supervision. Leveraging a human-in-the-loop pipeline, MCS-Set combines domain expertise with standardized descriptors for high-quality annotation. Evaluations using state-of-the-art language and vision-language models reveal substantial modality-specific performance gaps and highlight the importance of annotation quality for generalization. MCS-Set offers a foundation for benchmarking multimodal models, advancing annotation practices, and promoting accessible, versatile materials science datasets. The dataset and implementations are available at https://github.com/KurbanIntelligenceLab/MultiCrystalSpectrumSet.
中文摘要:本文提出的MCS-Set框架通过整合原子结构、二维投影和文本标注,构建多模态材料科学数据集,支持属性预测与晶体生成等任务,为多模态模型基准测试和材料数据分析提供新基础。
English Summary: This paper introduces MCS-Set, a multimodal framework that enhances materials science datasets by integrating atomic structures with visual projections and textual annotations to enable advanced machine learning applications like property prediction and crystal generation.
Authors:Boshra Khajehpiri, Eric Granger, Massimiliano de Zambotti, Fiona C. Baker, Mohamad Forouzanfar
Abstract:
Despite extensive research on the relationship between sleep and cognition, the connection between sleep microstructure and human performance across specific cognitive domains remains underexplored. This study investigates whether deep learning models can predict executive functions, particularly cognitive adaptability and conceptual reasoning from physiological processes during a night's sleep. To address this, we introduce CogPSGFormer, a multi-scale convolutional-transformer model designed to process multi-modal polysomnographic data. This model integrates one-channel ECG and EEG signals along with extracted features, including EEG power bands and heart rate variability parameters, to capture complementary information across modalities. A thorough evaluation of the CogPSGFormer architecture was conducted to optimize the processing of extended sleep signals and identify the most effective configuration. The proposed framework was evaluated on 817 individuals from the STAGES dataset using cross-validation. The model achieved 80.3\% accuracy in classifying individuals into low vs. high cognitive performance groups on unseen data based on Penn Conditional Exclusion Test (PCET) scores. These findings highlight the effectiveness of our multi-scale feature extraction and multi-modal learning approach in leveraging sleep-derived signals for cognitive performance prediction. To facilitate reproducibility, our code is publicly accessible (https://github.com/boshrakh95/CogPSGFormer.git).
中文: 本研究提出的CogPSGFormer深度学习模型通过分析多模态睡眠数据,能以80.3%的准确率预测认知表现,揭示了睡眠微结构在评估执行功能方面的重要价值。
English: This study introduces CogPSGFormer, a deep learning model that analyzes multi-modal sleep data to predict cognitive performance with 80.3% accuracy, demonstrating the potential of sleep microstructure for assessing executive functions.
Authors:Dang Nguyen, Ali Payani, Baharan Mirzasoleiman
Abstract:
Hallucination in large language models (LLMs) can be detected by assessing the uncertainty of model outputs, typically measured using entropy. Semantic entropy (SE) enhances traditional entropy estimation by quantifying uncertainty at the semantic cluster level. However, as modern LLMs generate longer one-sentence responses, SE becomes less effective because it overlooks two crucial factors: intra-cluster similarity (the spread within a cluster) and inter-cluster similarity (the distance between clusters). To address these limitations, we propose a simple black-box uncertainty quantification method inspired by nearest neighbor estimates of entropy. Our approach can also be easily extended to white-box settings by incorporating token probabilities. Additionally, we provide theoretical results showing that our method generalizes semantic entropy. Extensive empirical results demonstrate its effectiveness compared to semantic entropy across two recent LLMs (Phi3 and Llama3) and three common text generation tasks: question answering, text summarization, and machine translation. Our code is available at https://github.com/BigML-CS-UCLA/SNNE.
中文摘要:本文提出了一种新的黑盒不确定性量化方法,通过考虑簇内和簇间相似性改进了语义熵,在多种大语言模型和文本生成任务中展现出更优性能。
English Summary: This paper introduces a new black-box uncertainty quantification method that improves upon semantic entropy by accounting for intra-cluster and inter-cluster similarities, demonstrating superior performance across multiple LLMs and text generation tasks.
Authors:Rebekah A. GelpÃ, Yibing Ju, Ethan C. Jackson, Yikai Tang, Shon Verch, Claas Voelcker, William A. Cunningham
Abstract:
We introduce Sorrel (https://github.com/social-ai-uoft/sorrel), a simple Python interface for generating and testing new multi-agent reinforcement learning environments. This interface places a high degree of emphasis on simplicity and accessibility, and uses a more psychologically intuitive structure for the basic agent-environment loop, making it a useful tool for social scientists to investigate how learning and social interaction leads to the development and change of group dynamics. In this short paper, we outline the basic design philosophy and features of Sorrel.
中文: Sorrel是一个简洁的Python接口,专注于生成和测试多智能体强化学习环境,其设计强调易用性和心理直观性,帮助社会科学家通过学习和社交互动研究群体动态的发展与变化。
English: Sorrel is a user-friendly Python interface designed for creating and evaluating multi-agent reinforcement learning environments, emphasizing simplicity and a psychologically intuitive structure to aid social scientists in studying group dynamics through learning and social interactions.
Authors:Edward Fish, Richard Bowden
Abstract:
Recent progress in Sign Language Translation (SLT) has focussed primarily on improving the representational capacity of large language models to incorporate Sign Language features. This work explores an alternative direction: enhancing the geometric properties of skeletal representations themselves. We propose Geo-Sign, a method that leverages the properties of hyperbolic geometry to model the hierarchical structure inherent in sign language kinematics. By projecting skeletal features derived from Spatio-Temporal Graph Convolutional Networks (ST-GCNs) into the Poincaré ball model, we aim to create more discriminative embeddings, particularly for fine-grained motions like finger articulations. We introduce a hyperbolic projection layer, a weighted Fréchet mean aggregation scheme, and a geometric contrastive loss operating directly in hyperbolic space. These components are integrated into an end-to-end translation framework as a regularisation function, to enhance the representations within the language model. This work demonstrates the potential of hyperbolic geometry to improve skeletal representations for Sign Language Translation, improving on SOTA RGB methods while preserving privacy and improving computational efficiency. Code available here: https://github.com/ed-fish/geo-sign.
中文: 本研究提出Geo-Sign方法,通过将骨骼特征投影到双曲空间来增强手语翻译中的骨架表示,从而更好地捕捉层次化运动结构,在提升精度的同时提高了计算效率。
English: This work introduces Geo-Sign, a method that enhances skeletal representations for sign language translation by projecting features into hyperbolic space to better capture hierarchical kinematic structures, improving both accuracy and computational efficiency over existing approaches.
Authors:Dipam Goswami, Liying Wang, BartÅomiej Twardowski, Joost van de Weijer
Abstract:
Text embedding models enable semantic search, powering several NLP applications like Retrieval Augmented Generation by efficient information retrieval (IR). However, text embedding models are commonly studied in scenarios where the training data is static, thus limiting its applications to dynamic scenarios where new training data emerges over time. IR methods generally encode a huge corpus of documents to low-dimensional embeddings and store them in a database index. During retrieval, a semantic search over the corpus is performed and the document whose embedding is most similar to the query embedding is returned. When updating an embedding model with new training data, using the already indexed corpus is suboptimal due to the non-compatibility issue, since the model which was used to obtain the embeddings of the corpus has changed. While re-indexing of old corpus documents using the updated model enables compatibility, it requires much higher computation and time. Thus, it is critical to study how the already indexed corpus can still be effectively used without the need of re-indexing. In this work, we establish a continual learning benchmark with large-scale datasets and continually train dense retrieval embedding models on query-document pairs from new datasets in each task and observe forgetting on old tasks due to significant drift of embeddings. We employ embedding distillation on both query and document embeddings to maintain stability and propose a novel query drift compensation method during retrieval to project new model query embeddings to the old embedding space. This enables compatibility with previously indexed corpus embeddings extracted using the old model and thus reduces the forgetting. We show that the proposed method significantly improves performance without any re-indexing. Code is available at https://github.com/dipamgoswami/QDC.
中文: 本研究针对动态数据场景下嵌入模型不兼容的问题,提出查询漂移补偿方法,将新查询嵌入映射至旧模型空间,无需重新索引即可保持检索性能,显著提升系统效率。
English: This study addresses the challenge of embedding model incompatibility in dynamic data scenarios by proposing a query drift compensation method that projects new query embeddings into the old model's space, eliminating the need for costly re-indexing while maintaining retrieval performance.
Authors:Yaxin Luo, Zhaoyi Li, Jiacheng Liu, Jiacheng Cui, Xiaohan Zhao, Zhiqiang Shen
Abstract:
CAPTCHAs have been a critical bottleneck for deploying web agents in real-world applications, often blocking them from completing end-to-end automation tasks. While modern multimodal LLM agents have demonstrated impressive performance in static perception tasks, their ability to handle interactive, multi-step reasoning challenges like CAPTCHAs is largely untested. To address this gap, we introduce Open CaptchaWorld, the first web-based benchmark and platform specifically designed to evaluate the visual reasoning and interaction capabilities of MLLM-powered agents through diverse and dynamic CAPTCHA puzzles. Our benchmark spans 20 modern CAPTCHA types, totaling 225 CAPTCHAs, annotated with a new metric we propose: CAPTCHA Reasoning Depth, which quantifies the number of cognitive and motor steps required to solve each puzzle. Experimental results show that humans consistently achieve near-perfect scores, state-of-the-art MLLM agents struggle significantly, with success rates at most 40.0% by Browser-Use Openai-o3, far below human-level performance, 93.3%. This highlights Open CaptchaWorld as a vital benchmark for diagnosing the limits of current multimodal agents and guiding the development of more robust multimodal reasoning systems. Code and Data are available at this https URL.
Chinese Summary: Open CaptchaWorld 是首个基于网络的基准测试平台,用于评估多模态大语言模型代理解决各类验证码的能力,实验显示人类表现接近完美,而顶尖模型成功率远低于人类水平,揭示了当前交互推理能力的重大不足。
English Summary: Open CaptchaWorld is introduced as the first web-based benchmark to test multimodal LLM agents' capabilities in solving diverse CAPTCHA puzzles, revealing that while humans achieve near-perfect scores, current state-of-the-art agents significantly underperform, highlighting critical gaps in interactive reasoning.
Authors:Shuyao Xu, Cheng Peng, Jiangxuan Long, Weidi Xu, Wei Chu, Yuan Qi
Abstract:
Recent advances in model distillation demonstrate that data from advanced reasoning models (e.g., DeepSeek-R1, OpenAI's o1) can effectively transfer complex reasoning abilities to smaller, efficient student models. However, standard practices employ rejection sampling, discarding incorrect reasoning examples -- valuable, yet often underutilized data. This paper addresses the critical question: How can both positive and negative distilled reasoning traces be effectively leveraged to maximize LLM reasoning performance in an offline setting? To this end, We propose Reinforcement Distillation (REDI), a two-stage framework. Stage 1 learns from positive traces via Supervised Fine-Tuning (SFT). Stage 2 further refines the model using both positive and negative traces through our proposed REDI objective. This novel objective is a simple, reference-free loss function that outperforms established methods like DPO and SimPO in this distillation context. Our empirical evaluations demonstrate REDI's superiority over baseline Rejection Sampling SFT or SFT combined with DPO/SimPO on mathematical reasoning tasks. Notably, the Qwen-REDI-1.5B model, post-trained on just 131k positive and negative examples from the open Open-R1 dataset, achieves an 83.1% score on MATH-500 (pass@1). Its performance matches or surpasses that of DeepSeek-R1-Distill-Qwen-1.5B (a model post-trained on 800k proprietary data) across various mathematical reasoning benchmarks, establishing a new state-of-the-art for 1.5B models post-trained offline with openly available data.
中文: 本文提出强化蒸馏(REDI)框架,通过两阶段学习同时利用正负推理轨迹来增强模型推理能力,在数学任务上使用公开数据训练的1.5B模型取得了最优性能。
English: This paper introduces Reinforcement Distillation (REDI), a two-stage framework that leverages both positive and negative reasoning traces to enhance model reasoning, achieving state-of-the-art performance for 1.5B models on mathematical tasks with openly available data.
Authors:Wanyun Xie, Francesco Tonin, Volkan Cevher
Abstract:
Training data mixtures greatly impact the generalization performance of large language models. Existing domain reweighting methods often rely on costly weight computations and require retraining when new data is introduced. To this end, we introduce a flexible and efficient data mixing framework, Chameleon, that employs leverage scores to quantify domain importance within a learned embedding space. We first construct a domain affinity matrix over domain embeddings. The induced leverage scores determine a mixture that upweights domains sharing common representations in embedding space. This formulation allows direct transfer to new data by computing the new domain embeddings. In experiments, we demonstrate improvements over three key scenarios: (i) our computed weights improve performance on pretraining domains with a fraction of the compute of existing methods; (ii) Chameleon can adapt to data changes without proxy retraining, boosting few-shot reasoning accuracies when transferred to new data; (iii) our method enables efficient domain reweighting in finetuning, consistently improving test perplexity on all finetuning domains over uniform mixture. Our code is available at https://github.com/LIONS-EPFL/Chameleon.
Chinese: Chameleon框架通过利用嵌入空间中的杠杆分数动态调整训练领域权重,无需昂贵重训练即可适应新数据,在预训练、迁移学习和微调场景中持续提升语言模型性能。
English: The Chameleon framework efficiently improves language model performance by using leverage scores to dynamically reweight training domains in embedding space, enabling seamless adaptation to new data without costly retraining across pretraining, transfer learning, and fine-tuning scenarios.
Authors:Fuyuan Lyu, Linfeng Du, Yunpeng Weng, Qiufang Ying, Zhiyan Xu, Wen Zou, Haolun Wu, Xiuqiang He, Xing Tang
Abstract:
Fund allocation has been an increasingly important problem in the financial domain. In reality, we aim to allocate the funds to buy certain assets within a certain future period. Naive solutions such as prediction-only or Predict-then-Optimize approaches suffer from goal mismatch. Additionally, the introduction of the SOTA time series forecasting model inevitably introduces additional uncertainty in the predicted result. To solve both problems mentioned above, we introduce a Risk-aware Time-Series Predict-and-Allocate (RTS-PnO) framework, which holds no prior assumption on the forecasting models. Such a framework contains three features: (i) end-to-end training with objective alignment measurement, (ii) adaptive forecasting uncertainty calibration, and (iii) agnostic towards forecasting models. The evaluation of RTS-PnO is conducted over both online and offline experiments. For offline experiments, eight datasets from three categories of financial applications are used: Currency, Stock, and Cryptos. RTS-PnO consistently outperforms other competitive baselines. The online experiment is conducted on the Cross-Border Payment business at FiT, Tencent, and an 8.4\% decrease in regret is witnessed when compared with the product-line approach. The code for the offline experiment is available at https://github.com/fuyuanlyu/RTS-PnO.
中文: RTS-PnO框架通过端到端训练、自适应预测不确定性校准和模型无关特性,解决了资金分配中的目标不匹配和预测不确定性问题,在离线和在线金融实验中均表现出优越性能。
English: The RTS-PnO framework addresses goal mismatch and forecasting uncertainty in fund allocation by integrating end-to-end training, adaptive uncertainty calibration, and model-agnostic features, demonstrating superior performance in both offline and online financial experiments.
Authors:Marc González, Rachid Guerraoui, Rafael Pinot, Geovani Rizk, John Stephan, François Taïani
Abstract:
We present ByzFL, an open-source Python library for developing and benchmarking robust federated learning (FL) algorithms. ByzFL provides a unified and extensible framework that includes implementations of state-of-the-art robust aggregators, a suite of configurable attacks, and tools for simulating a variety of FL scenarios, including heterogeneous data distributions, multiple training algorithms, and adversarial threat models. The library enables systematic experimentation via a single JSON-based configuration file and includes built-in utilities for result visualization. Compatible with PyTorch tensors and NumPy arrays, ByzFL is designed to facilitate reproducible research and rapid prototyping of robust FL solutions. ByzFL is available at https://byzfl.epfl.ch/, with source code hosted on GitHub: https://github.com/LPD-EPFL/byzfl.
ByzFL 是一个开源的 Python 库,为开发和评估鲁棒的联邦学习算法提供了统一框架,包含可配置攻击、模拟场景及可视化工具,以支持可重复性研究。
ByzFL is an open-source Python library offering a unified framework for developing and benchmarking robust federated learning algorithms, featuring configurable attacks, simulations, and visualization tools to support reproducible research.
Authors:Yidong Luo, Chenguang Wang, Jiahao Yang, Fanzeng Xia, Tianshu Yu
Abstract:
Mixed-Integer Linear Programming (MILP) is fundamental to solving complex decision-making problems. The proliferation of MILP instance generation methods, driven by machine learning's demand for diverse optimization datasets and the limitations of static benchmarks, has significantly outpaced standardized evaluation techniques. Consequently, assessing the fidelity and utility of synthetic MILP instances remains a critical, multifaceted challenge. This paper introduces a comprehensive benchmark framework designed for the systematic and objective evaluation of MILP instance generation methods. Our framework provides a unified and extensible methodology, assessing instance quality across crucial dimensions: mathematical validity, structural similarity, computational hardness, and utility in downstream machine learning tasks. A key innovation is its in-depth analysis of solver-internal features -- particularly by comparing distributions of key solver outputs including root node gap, heuristic success rates, and cut plane usage -- leveraging the solver's dynamic solution behavior as an `expert assessment' to reveal nuanced computational resemblances. By offering a structured approach with clearly defined solver-independent and solver-dependent metrics, our benchmark aims to facilitate robust comparisons among diverse generation techniques, spur the development of higher-quality instance generators, and ultimately enhance the reliability of research reliant on synthetic MILP data. The framework's effectiveness in systematically comparing the fidelity of instance sets is demonstrated using contemporary generative models.
中文摘要:本文提出一个综合性基准框架,通过数学有效性、结构相似性、计算复杂度和下游应用价值四个维度,结合求解器内部特征分析,系统评估混合整数线性规划实例生成方法的质量。
English Summary: This paper introduces a comprehensive benchmark framework for systematically evaluating Mixed-Integer Linear Programming instance generation methods, assessing mathematical validity, structural similarity, computational hardness, and downstream utility through novel solver-internal feature analysis.
Authors:Zafir Stojanovski, Oliver Stanley, Joe Sharratt, Richard Jones, Abdulhakeem Adefioye, Jean Kaddour, Andreas Köpf
Abstract:
We introduce Reasoning Gym (RG), a library of reasoning environments for reinforcement learning with verifiable rewards. It provides over 100 data generators and verifiers spanning multiple domains including algebra, arithmetic, computation, cognition, geometry, graph theory, logic, and various common games. Its key innovation is the ability to generate virtually infinite training data with adjustable complexity, unlike most previous reasoning datasets, which are typically fixed. This procedural generation approach allows for continuous evaluation across varying difficulty levels. Our experimental results demonstrate the efficacy of RG in both evaluating and reinforcement learning of reasoning models.
中文: Reasoning Gym是一个强化学习推理环境库,通过100多个可验证奖励的领域生成器实现无限可调难度的训练数据,有效支持推理模型的评估与训练。
English: Reasoning Gym is a reinforcement learning library featuring over 100 procedurally generated environments with verifiable rewards across multiple domains, enabling infinite adjustable-difficulty training data for effective model evaluation and training.
Authors:Benjamin Holzschuh, Qiang Liu, Georg Kohl, Nils Thuerey
Abstract:
We introduce PDE-Transformer, an improved transformer-based architecture for surrogate modeling of physics simulations on regular grids. We combine recent architectural improvements of diffusion transformers with adjustments specific for large-scale simulations to yield a more scalable and versatile general-purpose transformer architecture, which can be used as the backbone for building large-scale foundation models in physical sciences. We demonstrate that our proposed architecture outperforms state-of-the-art transformer architectures for computer vision on a large dataset of 16 different types of PDEs. We propose to embed different physical channels individually as spatio-temporal tokens, which interact via channel-wise self-attention. This helps to maintain a consistent information density of tokens when learning multiple types of PDEs simultaneously. We demonstrate that our pre-trained models achieve improved performance on several challenging downstream tasks compared to training from scratch and also beat other foundation model architectures for physics simulations.
中文: PDE-Transformer是一种改进的Transformer架构,在规则网格上模拟多种偏微分方程时优于现有最佳模型,并通过预训练在下游任务中表现出色。
English: PDE-Transformer is an enhanced transformer architecture that outperforms state-of-the-art models in simulating various PDEs on regular grids and excels in downstream tasks through pre-training.
Authors:Masahiro Negishi, Thomas Gärtner, Pascal Welke
Abstract:
We investigate the distance function learned by message passing neural networks (MPNNs) in specific tasks, aiming to capture the functional distance between prediction targets that MPNNs implicitly learn. This contrasts with previous work, which links MPNN distances on arbitrary tasks to structural distances on graphs that ignore task-specific information. To address this gap, we distill the distance between MPNN embeddings into an interpretable graph distance. Our method uses optimal transport on the Weisfeiler Leman Labeling Tree (WILT), where the edge weights reveal subgraphs that strongly influence the distance between embeddings. This approach generalizes two well-known graph kernels and can be computed in linear time. Through extensive experiments, we demonstrate that MPNNs define the relative position of embeddings by focusing on a small set of subgraphs that are known to be functionally important in the domain.
Chinese: 本研究探讨了消息传递神经网络如何学习预测目标间的功能距离,通过基于Weisfeiler Leman标记树的最优传输方法将其提炼为可解释的图距离,从而高效识别关键子图结构。
English: This study explores how message passing neural networks (MPNNs) learn functional distances between prediction targets, distilling these into an interpretable graph distance using optimal transport on the Weisfeiler Leman Labeling Tree to identify influential subgraphs efficiently.
Authors:Falih Gozi Febrinanto, Kristen Moore, Chandra Thapa, Jiangang Ma, Vidya Saikrishna, Feng Xia
Abstract:
The performance of existing audio deepfake detection frameworks degrades when confronted with new deepfake attacks. Rehearsal-based continual learning (CL), which updates models using a limited set of old data samples, helps preserve prior knowledge while incorporating new information. However, existing rehearsal techniques don't effectively capture the diversity of audio characteristics, introducing bias and increasing the risk of forgetting. To address this challenge, we propose Rehearsal with Auxiliary-Informed Sampling (RAIS), a rehearsal-based CL approach for audio deepfake detection. RAIS employs a label generation network to produce auxiliary labels, guiding diverse sample selection for the memory buffer. Extensive experiments show RAIS outperforms state-of-the-art methods, achieving an average Equal Error Rate (EER) of 1.953 % across five experiences. The code is available at: https://github.com/falihgoz/RAIS.
中文: 提出的RAIS方法通过辅助标签选择多样化样本进行排练,提升了音频深度伪造检测性能,以1.953%的平均等错误率超越现有技术。
English: The proposed RAIS method enhances audio deepfake detection by using auxiliary labels to select diverse samples for rehearsal, outperforming existing techniques with a 1.953% average EER.
Authors:Fei Bai, Yingqian Min, Beichen Zhang, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, Zheng Liu, Zhongyuan Wang, Ji-Rong Wen
Abstract:
In this paper, we investigate code-integrated reasoning, where models generate code when necessary and integrate feedback by executing it through a code interpreter. To acquire this capability, models must learn when and how to use external code tools effectively, which is supported by tool-augmented reinforcement learning (RL) through interactive learning. Despite its benefits, tool-augmented RL can still suffer from potential instability in the learning dynamics. In light of this challenge, we present a systematic approach to improving the training effectiveness and stability of tool-augmented RL for code-integrated reasoning. Specifically, we develop enhanced training strategies that balance exploration and stability, progressively building tool-use capabilities while improving reasoning performance. Through extensive experiments on five mainstream mathematical reasoning benchmarks, our model demonstrates significant performance improvements over multiple competitive baselines. Furthermore, we conduct an in-depth analysis of the mechanism and effect of code-integrated reasoning, revealing several key insights, such as the extension of model's capability boundaries and the simultaneous improvement of reasoning efficiency through code integration. All data and code for reproducing this work are available at: https://github.com/RUCAIBox/CIR.
中文摘要:本文提出了一种系统性方法,通过平衡探索与稳定性的增强训练策略,有效提升了代码集成推理中工具增强强化学习的训练效果,在多个数学推理基准测试中实现了显著性能提升,并揭示了代码集成扩展模型能力边界的关键机制。
English Summary: This paper introduces a systematic approach to enhance the training stability and effectiveness of tool-augmented reinforcement learning for code-integrated reasoning, demonstrating significant performance gains across mathematical benchmarks through improved strategies that balance exploration and capability development.
Authors:Jingyao Li, Senqiao Yang, Sitong Wu, Han Shi, Chuanyang Zheng, Hong Xu, Jiaya Jia
Abstract:
In recent years, developing compact and efficient large language models (LLMs) has emerged as a thriving area of research. Traditional Supervised Fine-Tuning (SFT), which relies on singular ground truth labels, often fails to capture token-level dependencies and linguistic diversity. To address these limitations, we propose a logits-based fine-tuning framework that integrates the strengths of supervised learning and knowledge distillation. Our approach constructs enriched training targets by combining teacher logits with ground truth labels, preserving both correctness and linguistic diversity. This ensures more reliable and effective training. We constructed a large-scale 1.2M logits dataset and trained a series of science-focused models. Experimental results demonstrate that our method achieves significant improvements, with accuracy gains of 18% on Mawps and 22.7% on TabMWP. Across nine widely used mathematical benchmarks, our method consistently outperforms prior SFT models, achieving an average improvement of 7.28%. Codes are available at https://github.com/dvlab-research/Logits-Based-Finetuning.
中文: 本研究提出了一种基于逻辑值的微调框架,将监督学习与知识蒸馏相结合,利用教师逻辑值和真实标签优化训练,在科学领域基准测试中取得了显著的准确率提升。
English: This study introduces a logits-based fine-tuning framework that merges supervised learning with knowledge distillation, using teacher logits and ground truth labels to enhance training, resulting in significant accuracy gains on science-focused benchmarks.
Authors:Heejo Kong, Sung-Jin Kim, Gunho Jung, Seong-Whan Lee
Abstract:
Conventional semi-supervised learning (SSL) ideally assumes that labeled and unlabeled data share an identical class distribution, however in practice, this assumption is easily violated, as unlabeled data often includes unknown class data, i.e., outliers. The outliers are treated as noise, considerably degrading the performance of SSL models. To address this drawback, we propose a novel framework, Diversify and Conquer (DAC), to enhance SSL robustness in the context of open-set semi-supervised learning. In particular, we note that existing open-set SSL methods rely on prediction discrepancies between inliers and outliers from a single model trained on labeled data. This approach can be easily failed when the labeled data is insufficient, leading to performance degradation that is worse than naive SSL that do not account for outliers. In contrast, our approach exploits prediction disagreements among multiple models that are differently biased towards the unlabeled distribution. By leveraging the discrepancies arising from training on unlabeled data, our method enables robust outlier detection even when the labeled data is underspecified. Our key contribution is constructing a collection of differently biased models through a single training process. By encouraging divergent heads to be differently biased towards outliers while making consistent predictions for inliers, we exploit the disagreement among these heads as a measure to identify unknown concepts. Our code is available at https://github.com/heejokong/DivCon.
中文: 提出的“多样化与征服”(DAC)框架通过训练多个具有不同偏差的模型,利用其预测差异来检测异常数据,有效提升了半监督学习在开放集场景下的鲁棒性,克服了现有方法在标注数据不足时性能下降的问题。
English: The proposed Diversify and Conquer (DAC) framework enhances semi-supervised learning robustness by training multiple divergent models that detect outliers through their prediction disagreements, overcoming limitations of existing methods that fail with insufficient labeled data.
Authors:Xianglong Yan, Zhiteng Li, Tianao Zhang, Linghe Kong, Yulun Zhang, Xiaokang Yang
Abstract:
Large language models (LLMs) have achieved remarkable performance, yet their capability on long-context reasoning is often constrained by the excessive memory required to store the Key-Value (KV) cache. This makes KV cache compression an essential step toward enabling efficient long-context reasoning. Recent methods have explored reducing the hidden dimensions of the KV cache, but many introduce additional computation through projection layers or suffer from significant performance degradation under high compression ratios. To address these challenges, we propose ReCalKV, a post-training KV cache compression method that reduces the hidden dimensions of the KV cache. We develop distinct compression strategies for Keys and Values based on their different roles and varying importance in the attention mechanism. For Keys, we propose Head-wise Similarity-aware Reordering (HSR), which clusters similar heads and applies grouped SVD to the key projection matrix, reducing additional computation while preserving accuracy. For Values, we propose Offline Calibration and Matrix Fusion (OCMF) to preserve accuracy without extra computational overhead. Experiments show that ReCalKV outperforms existing low-rank compression methods, achieving high compression ratios with minimal performance loss. The code and models will be available at: https://github.com/XIANGLONGYAN/ReCalKV.
中文: ReCalKV提出了一种新颖的KV缓存低秩压缩方法,通过键向量的头部相似性重排序和值向量的离线校准策略,在保持高压缩比的同时显著优于现有方法且性能损失最小。
English: ReCalKV introduces a novel low-rank compression method for KV cache by employing head-wise similarity reordering for Keys and offline value calibration for Values, significantly outperforming existing techniques with minimal performance loss at high compression ratios.
Authors:Gilles Quentin Hacheme, Girmaw Abebe Tadesse, Caleb Robinson, Akram Zaytar, Rahul Dodhia, Juan M. Lavista Ferres
Abstract:
Classifying geospatial imagery remains a major bottleneck for applications such as disaster response and land-use monitoring-particularly in regions where annotated data is scarce or unavailable. Existing tools (e.g., RS-CLIP) that claim zero-shot classification capabilities for satellite imagery nonetheless rely on task-specific pretraining and adaptation to reach competitive performance. We introduce GeoVision Labeler (GVL), a strictly zero-shot classification framework: a vision Large Language Model (vLLM) generates rich, human-readable image descriptions, which are then mapped to user-defined classes by a conventional Large Language Model (LLM). This modular, and interpretable pipeline enables flexible image classification for a large range of use cases. We evaluated GVL across three benchmarks-SpaceNet v7, UC Merced, and RESISC45. It achieves up to 93.2% zero-shot accuracy on the binary Buildings vs. No Buildings task on SpaceNet v7. For complex multi-class classification tasks (UC Merced, RESISC45), we implemented a recursive LLM-driven clustering to form meta-classes at successive depths, followed by hierarchical classification-first resolving coarse groups, then finer distinctions-to deliver competitive zero-shot performance. GVL is open-sourced at https://github.com/microsoft/geo-vision-labeler to catalyze adoption in real-world geospatial workflows.
Chinese: GeoVision标注器(GVL)是一种严格零样本分类框架,通过视觉大语言模型生成图像描述并由常规大语言模型将其映射至用户定义类别,无需任务特定训练即可实现高精度地理空间影像分类。
English: The GeoVision Labeler (GVL) is a strictly zero-shot classification framework that uses a vision Large Language Model to generate image descriptions and a conventional LLM to map them to user-defined classes, achieving high accuracy in geospatial imagery classification without task-specific training.
Authors:Minsu Kang, Seolhee Lee, Choonghyeon Lee, Namhyun Cho
Abstract:
Human to non-human voice conversion (H2NH-VC) transforms human speech into animal or designed vocalizations. Unlike prior studies focused on dog-sounds and 16 or 22.05kHz audio transformation, this work addresses a broader range of non-speech sounds, including natural sounds (lion-roars, birdsongs) and designed voice (synthetic growls). To accomodate generation of diverse non-speech sounds and 44.1kHz high-quality audio transformation, we introduce a preprocessing pipeline and an improved CVAE-based H2NH-VC model, both optimized for human and non-human voices. Experimental results showed that the proposed method outperformed baselines in quality, naturalness, and similarity MOS, achieving effective voice conversion across diverse non-human timbres. Demo samples are available at https://nc-ai.github.io/speech/publications/nonhuman-vc/
Authors:Wei Fu, Jiaxuan Gao, Xujie Shen, Chen Zhu, Zhiyu Mei, Chuyi He, Shusheng Xu, Guo Wei, Jun Mei, Jiashu Wang, Tongkai Yang, Binhang Yuan, Yi Wu
Abstract:
Reinforcement learning (RL) has become a dominant paradigm for training large language models (LLMs), particularly for reasoning tasks. Effective RL for LLMs requires massive parallelization and poses an urgent need for efficient training systems. Most existing large-scale RL systems for LLMs are synchronous, alternating generation and training in a batch setting where rollouts in each training batch are generated by the same model. This approach stabilizes RL training but suffers from severe system-level inefficiency: generation must wait until the longest output in the batch is completed before model updates, resulting in GPU underutilization. We present AReaL, a fully asynchronous RL system that completely decouples generation from training. Rollout workers in AReaL continuously generate new outputs without waiting, while training workers update the model whenever a batch of data is collected. AReaL also incorporates a collection of system-level optimizations, leading to substantially higher GPU utilization. To stabilize RL training, AReaL balances the workload of rollout and training workers to control data staleness, and adopts a staleness-enhanced PPO variant to better handle outdated training samples. Extensive experiments on math and code reasoning benchmarks show that AReaL achieves up to 2.77$\times$ training speedup compared to synchronous systems with the same number of GPUs and matched or improved final performance. The code of AReaL is available at https://github.com/inclusionAI/AReaL/.
Chinese: AReaL是一种异步强化学习系统,通过解耦生成与训练来提高GPU利用率并加速大语言模型训练,在推理任务上实现了最高2.77倍的加速,同时保持或提升了性能。
English: AReaL is an asynchronous reinforcement learning system that decouples generation from training to enhance GPU utilization and accelerate LLM training, achieving up to 2.77× speedup with stable or improved performance on reasoning tasks.
Authors:James R. Golden
Abstract:
We demonstrate that the inference operations of several open-weight large language models (LLMs) can be mapped to an exactly equivalent linear system for an input sequence without modifying the model weights or altering output predictions. Extending techniques from image diffusion models that exhibit local or piecewise linearity, we strategically alter the gradient computation with respect to a given input sequence for a next-token prediction such that the Jacobian of the model nearly exactly reproduces the forward prediction with a linear system. We demonstrate this approach across models (Llama 3, Gemma 3, Qwen 3, Phi 4, Mistral Ministral and OLMo 2, up to Llama 3.3 70B Q4) and show through the singular value decomposition of the detached Jacobian that these LLMs operate in extremely low-dimensional subspaces where many of the largest singular vectors decode to concepts related to the most-likely output token. This approach also allows us to examine the operation of each successive layer (and its attention and MLP components) as nearly-exact linear systems and observe the emergence of semantic concepts. Despite their expressive power and global nonlinearity, modern LLMs can be interpreted through nearly-exact locally linear decompositions that provide insights into their internal representations and reveal interpretable semantic structures in the next-token prediction process.
中文: 研究表明,多种大型语言模型的推理过程无需修改权重即可精确映射为线性系统,通过奇异值分解发现这些模型在低维空间中运行,其中主要向量对应预测词汇相关的语义概念。
English: This study shows that the inference processes of various large language models can be accurately represented as linear systems without changing model weights, revealing through singular value decomposition that these models operate in low-dimensional spaces where dominant vectors correspond to semantic concepts related to predicted tokens.
Authors:Lan-Cuong Nguyen, Quan Nguyen-Tri, Bang Tran Khanh, Dung D. Le, Long Tran-Thanh, Khoat Than
Abstract:
Few-shot image classification remains challenging due to the scarcity of labeled training examples. Augmenting them with synthetic data has emerged as a promising way to alleviate this issue, but models trained on synthetic samples often face performance degradation due to the inherent gap between real and synthetic distributions. To address this limitation, we develop a theoretical framework that quantifies the impact of such distribution discrepancies on supervised learning, specifically in the context of image classification. More importantly, our framework suggests practical ways to generate good synthetic samples and to train a predictor with high generalization ability. Building upon this framework, we propose a novel theoretical-based algorithm that integrates prototype learning to optimize both data partitioning and model training, effectively bridging the gap between real few-shot data and synthetic data. Extensive experiments results show that our approach demonstrates superior performance compared to state-of-the-art methods, outperforming them across multiple datasets.
Chinese Summary: 本研究提出了一种理论框架和算法,通过优化的原型学习有效弥合了少样本图像分类中真实与合成数据间的分布差距,在多个数据集上超越了现有最优方法的性能表现。
English Summary: This study introduces a theoretical framework and algorithm that effectively bridge the distribution gap between real and synthetic data in few-shot image classification, achieving superior performance over state-of-the-art methods through optimized prototype learning.
Authors:Katherine Tieu, Dongqi Fu, Jun Wu, Jingrui He
Abstract:
In the era of foundation models, Out-of- Distribution (OOD) problems, i.e., the data discrepancy between the training environments and testing environments, hinder AI generalization. Further, relational data like graphs disobeying the Independent and Identically Distributed (IID) condition makes the problem more challenging, especially much harder when it is associated with time. Motivated by this, to realize the robust invariant learning over temporal graphs, we want to investigate what components in temporal graphs are most invariant and representative with respect to labels. With the Information Bottleneck (IB) method, we propose an error-bounded Invariant Link Selector that can distinguish invariant components and variant components during the training process to make the deep learning model generalizable for different testing scenarios. Besides deriving a series of rigorous generalizable optimization functions, we also equip the training with task-specific loss functions, e.g., temporal link prediction, to make pretrained models solve real-world application tasks like citation recommendation and merchandise recommendation, as demonstrated in our experiments with state-of-the-art (SOTA) methods. Our code is available at https://github.com/kthrn22/OOD-Linker.
中文: 针对时序图中的分布外问题,本研究基于信息瓶颈方法提出了一种误差有界的恒定链接选择器,能区分恒定与变异组件,从而提升模型在引文推荐和商品推荐等实际任务中的泛化能力。
English: In response to Out-of-Distribution challenges in temporal graphs, this study introduces an error-bounded Invariant Link Selector using the Information Bottleneck method to identify invariant components, enhancing model generalization for tasks like citation and merchandise recommendation.
Authors:Vishal Dey, Xiao Hu, Xia Ning
Abstract:
In real-world drug design, molecule optimization requires selectively improving multiple molecular properties up to pharmaceutically relevant levels, while maintaining others that already meet such criteria. However, existing computational approaches and instruction-tuned LLMs fail to capture such nuanced property-specific objectives, limiting their practical applicability. To address this, we introduce C-MuMOInstruct, the first instruction-tuning dataset focused on multi-property optimization with explicit, property-specific objectives. Leveraging C-MuMOInstruct, we develop GeLLMO-Cs, a series of instruction-tuned LLMs that can perform targeted property-specific optimization. Our experiments across 5 in-distribution and 5 out-of-distribution tasks show that GeLLMO-Cs consistently outperform strong baselines, achieving up to 126% higher success rate. Notably, GeLLMO-Cs exhibit impressive 0-shot generalization to novel optimization tasks and unseen instructions. This offers a step toward a foundational LLM to support realistic, diverse optimizations with property-specific objectives. C-MuMOInstruct and code are accessible through https://github.com/ninglab/GeLLMO-C.
中文摘要:本研究提出了首个针对多属性分子优化的指令调优数据集C-MuMOInstruct,并开发了GeLLMO-Cs模型系列,该模型在保持特定属性的同时显著提升其他分子性能,成功率达基线126%以上,且对新任务展现出卓越的零样本泛化能力。
English Summary: This study introduces C-MuMOInstruct, the first instruction-tuning dataset for multi-property molecular optimization, and develops GeLLMO-Cs models that significantly outperform baselines with up to 126% higher success rates while demonstrating strong generalization to novel tasks.
Authors:Aurosweta Mahapatra, Ismail Rasim Ulgen, Abinay Reddy Naini, Carlos Busso, Berrak Sisman
Abstract:
Traditional anti-spoofing focuses on models and datasets built on synthetic speech with mostly neutral state, neglecting diverse emotional variations. As a result, their robustness against high-quality, emotionally expressive synthetic speech is uncertain. We address this by introducing EmoSpoof-TTS, a corpus of emotional text-to-speech samples. Our analysis shows existing anti-spoofing models struggle with emotional synthetic speech, exposing risks of emotion-targeted attacks. Even trained on emotional data, the models underperform due to limited focus on emotional aspect and show performance disparities across emotions. This highlights the need for emotion-focused anti-spoofing paradigm in both dataset and methodology. We propose GEM, a gated ensemble of emotion-specialized models with a speech emotion recognition gating network. GEM performs effectively across all emotions and neutral state, improving defenses against spoofing attacks. We release the EmoSpoof-TTS Dataset: https://emospoof-tts.github.io/Dataset/
Authors:Jianyang Gu, Samuel Stevens, Elizabeth G Campolongo, Matthew J Thompson, Net Zhang, Jiaman Wu, Andrei Kopanev, Zheda Mai, Alexander E. White, James Balhoff, Wasila Dahdul, Daniel Rubenstein, Hilmar Lapp, Tanya Berger-Wolf, Wei-Lun Chao, Yu Su
Abstract:
Foundation models trained at scale exhibit remarkable emergent behaviors, learning new capabilities beyond their initial training objectives. We find such emergent behaviors in biological vision models via large-scale contrastive vision-language training. To achieve this, we first curate TreeOfLife-200M, comprising 214 million images of living organisms, the largest and most diverse biological organism image dataset to date. We then train BioCLIP 2 on TreeOfLife-200M to distinguish different species. Despite the narrow training objective, BioCLIP 2 yields extraordinary accuracy when applied to various biological visual tasks such as habitat classification and trait prediction. We identify emergent properties in the learned embedding space of BioCLIP 2. At the inter-species level, the embedding distribution of different species aligns closely with functional and ecological meanings (e.g., beak sizes and habitats). At the intra-species level, instead of being diminished, the intra-species variations (e.g., life stages and sexes) are preserved and better separated in subspaces orthogonal to inter-species distinctions. We provide formal proof and analyses to explain why hierarchical supervision and contrastive objectives encourage these emergent properties. Crucially, our results reveal that these properties become increasingly significant with larger-scale training data, leading to a biologically meaningful embedding space.
Authors:Chang Liu, Bohao Zhao, Jingtao Ding, Huandong Wang, Yong Li
Abstract:
Long-term forecasting of chaotic systems from short-term observations remains a fundamental and underexplored challenge due to the intrinsic sensitivity to initial conditions and the complex geometry of strange attractors. Existing approaches often rely on long-term training data or focus on short-term sequence correlations, struggling to maintain predictive stability and dynamical coherence over extended horizons. We propose PhyxMamba, a novel framework that integrates a Mamba-based state-space model with physics-informed principles to capture the underlying dynamics of chaotic systems. By reconstructing the attractor manifold from brief observations using time-delay embeddings, PhyxMamba extracts global dynamical features essential for accurate forecasting. Our generative training scheme enables Mamba to replicate the physical process, augmented by multi-token prediction and attractor geometry regularization for physical constraints, enhancing prediction accuracy and preserving key statistical invariants. Extensive evaluations on diverse simulated and real-world chaotic systems demonstrate that PhyxMamba delivers superior long-term forecasting and faithfully captures essential dynamical invariants from short-term data. This framework opens new avenues for reliably predicting chaotic systems under observation-scarce conditions, with broad implications across climate science, neuroscience, epidemiology, and beyond. Our code is open-source at https://github.com/tsinghua-fib-lab/PhyxMamba.
中文摘要:PhyxMamba是一种新颖的物理信息Mamba框架,通过重构吸引子流形并保持动力学不变量,能够基于短期观测数据实现对混沌系统的精准长期预测。
English Summary: PhyxMamba is a novel physics-informed Mamba framework that accurately forecasts chaotic systems long-term from short observations by reconstructing attractor manifolds and preserving dynamical invariants.
Authors:Zheng Gong, Ziyi Jiang, Weihao Gao, Deng Zhuo, Lan Ma
Abstract:
The mRNA optimization is critical for therapeutic and biotechnological applications, since sequence features directly govern protein expression levels and efficacy. However, current methods face significant challenges in simultaneously achieving three key objectives: (1) fidelity (preventing unintended amino acid changes), (2) computational efficiency (speed and scalability), and (3) the scope of optimization variables considered (multi-objective capability). Furthermore, existing methods often fall short of comprehensively incorporating the factors related to the mRNA lifecycle and translation process, including intrinsic mRNA sequence properties, secondary structure, translation elongation kinetics, and tRNA availability. To address these limitations, we introduce \textbf{RNop}, a novel deep learning-based method for mRNA optimization. We collect a large-scale dataset containing over 3 million sequences and design four specialized loss functions, the GPLoss, CAILoss, tAILoss, and MFELoss, which simultaneously enable explicit control over sequence fidelity while optimizing species-specific codon adaptation, tRNA availability, and desirable mRNA secondary structure features. Then, we demonstrate RNop's effectiveness through extensive in silico and in vivo experiments. RNop ensures high sequence fidelity, achieves significant computational throughput up to 47.32 sequences/s, and yields optimized mRNA sequences resulting in a significant increase in protein expression for functional proteins compared to controls. RNop surpasses current methodologies in both quantitative metrics and experimental validation, enlightening a new dawn for efficient and effective mRNA design. Code and models will be available at https://github.com/HudenJear/RPLoss.
Chinese: RNop是一种新型的深度学习方法,通过确保高序列保真度、计算效率及对密码子适应性、tRNA可用性和mRNA结构的多目标控制,克服了当前mRNA优化的局限,显著提高了蛋白质表达水平。
English: RNop is a novel deep learning method that overcomes current limitations in mRNA optimization by ensuring high sequence fidelity, computational efficiency, and multi-objective control over codon adaptation, tRNA availability, and mRNA structure, resulting in significantly enhanced protein expression.
Authors:Renye Zhang, Mengyun Yang, Qichang Zhao, Jianxin Wang
Abstract:
Drug repositioning aims to identify potential new indications for existing drugs to reduce the time and financial costs associated with developing new drugs. Most existing deep learning-based drug repositioning methods predominantly utilize graph-based representations. However, graph-based drug repositioning methods struggle to perform effective inference in cold-start scenarios involving novel drugs because of the lack of association information with the diseases. Unlike traditional graph-based approaches, we propose a bidirectional behavior learning strategy for drug repositioning, known as BiBLDR. This innovative framework redefines drug repositioning as a behavior sequential learning task to capture drug-disease interaction patterns. First, we construct bidirectional behavioral sequences based on drug and disease sides. The consideration of bidirectional information ensures a more meticulous and rigorous characterization of the behavioral sequences. Subsequently, we propose a two-stage strategy for drug repositioning. In the first stage, we construct prototype spaces to characterize the representational attributes of drugs and diseases. In the second stage, these refined prototypes and bidirectional behavior sequence data are leveraged to predict potential drug-disease associations. Based on this learning approach, the model can more robustly and precisely capture the interactive relationships between drug and disease features from bidirectional behavioral sequences. Extensive experiments demonstrate that our method achieves state-of-the-art performance on benchmark datasets. Meanwhile, BiBLDR demonstrates significantly superior performance compared to previous methods in cold-start scenarios. Our code is published in https://github.com/Renyeeah/BiBLDR.
中文摘要:本文提出的BiBLDR框架将药物重定位重新定义为双向行为学习任务,通过构建原型空间和利用双向行为序列数据,有效解决了传统图方法在冷启动场景下对新药关联预测的局限性。
English Summary: The proposed BiBLDR framework redefines drug repositioning as a bidirectional behavior learning task to overcome cold-start limitations of graph-based methods by capturing drug-disease interaction patterns through prototype spaces and sequential data analysis.
Authors:Sahil Verma, Keegan Hines, Jeff Bilmes, Charlotte Siska, Luke Zettlemoyer, Hila Gonen, Chandan Singh
Abstract:
The emerging capabilities of large language models (LLMs) have sparked concerns about their immediate potential for harmful misuse. The core approach to mitigate these concerns is the detection of harmful queries to the model. Current detection approaches are fallible, and are particularly susceptible to attacks that exploit mismatched generalization of model capabilities (e.g., prompts in low-resource languages or prompts provided in non-text modalities such as image and audio). To tackle this challenge, we propose OMNIGUARD, an approach for detecting harmful prompts across languages and modalities. Our approach (i) identifies internal representations of an LLM/MLLM that are aligned across languages or modalities and then (ii) uses them to build a language-agnostic or modality-agnostic classifier for detecting harmful prompts. OMNIGUARD improves harmful prompt classification accuracy by 11.57\% over the strongest baseline in a multilingual setting, by 20.44\% for image-based prompts, and sets a new SOTA for audio-based prompts. By repurposing embeddings computed during generation, OMNIGUARD is also very efficient ($\approx 120 \times$ faster than the next fastest baseline). Code and data are available at: https://github.com/vsahil/OmniGuard.
中文摘要:本文提出OMNIGUARD方法,通过利用大语言模型中跨语言和跨模态对齐的内部表征来检测有害提示,在多语言和跨模态场景下显著提升了分类准确率与检测效率。
English Summary: The paper introduces OMNIGUARD, a method that detects harmful prompts across languages and modalities by leveraging aligned internal representations of LLMs, significantly improving classification accuracy and efficiency over existing baselines.
Authors:Trenton Chang, Tobias Schnabel, Adith Swaminathan, Jenna Wiens
Abstract:
Despite advances in large language models (LLMs) on reasoning and instruction-following benchmarks, it remains unclear whether they can reliably produce outputs aligned with a broad variety of user goals, a concept we refer to as steerability. The abundance of methods proposed to modify LLM behavior makes it unclear whether current LLMs are already steerable, or require further intervention. In particular, LLMs may exhibit (i) poor coverage, where rare user goals are underrepresented; (ii) miscalibration, where models overshoot requests; and (iii) side effects, where changes to one dimension of text inadvertently affect others. To systematically evaluate these failures, we introduce a framework based on a multi-dimensional goal space that models user goals and LLM outputs as vectors with dimensions corresponding to text attributes (e.g., reading difficulty). Applied to a text-rewriting task, we find that current LLMs struggle with steerability, as side effects are persistent. Interventions to improve steerability, such as prompt engineering, best-of-$N$ sampling, and reinforcement learning fine-tuning, have varying effectiveness, yet side effects remain problematic. Our findings suggest that even strong LLMs struggle with steerability, and existing alignment strategies may be insufficient. We open-source our steerability evaluation framework at https://github.com/MLD3/steerability.
中文: 当前大型语言模型在可控性方面存在不足,表现出覆盖率低、校准偏差和副作用持续等问题,现有对齐策略虽效果各异但仍显不足。
English: Current large language models struggle with steerability, exhibiting issues like poor coverage, miscalibration, and persistent side effects, and existing alignment strategies remain insufficient despite varying effectiveness.
Authors:Amber Yijia Zheng, Cedar Site Bai, Brian Bullins, Raymond A. Yeh
Abstract:
Model immunization aims to pre-train models that are difficult to fine-tune on harmful tasks while retaining their utility on other non-harmful tasks. Though prior work has shown empirical evidence for immunizing text-to-image models, the key understanding of when immunization is possible and a precise definition of an immunized model remain unclear. In this work, we propose a framework, based on the condition number of a Hessian matrix, to analyze model immunization for linear models. Building on this framework, we design an algorithm with regularization terms to control the resulting condition numbers after pre-training. Empirical results on linear models and non-linear deep-nets demonstrate the effectiveness of the proposed algorithm on model immunization. The code is available at https://github.com/amberyzheng/model-immunization-cond-num.
Chinese: 本研究提出了一个基于Hessian矩阵条件数的框架来分析模型免疫,并通过设计控制条件数的正则化算法,在线性模型和非线性深度网络中成功实现了对抗有害微调的模型免疫保护。
English: This study introduces a framework using the Hessian matrix's condition number to analyze and achieve model immunization, proposing an algorithm that effectively immunizes both linear and non-linear models against harmful fine-tuning while preserving utility.
Authors:Diankun Wu, Fangfu Liu, Yi-Hsin Hung, Yueqi Duan
Abstract:
Recent advancements in Multimodal Large Language Models (MLLMs) have significantly enhanced performance on 2D visual tasks. However, improving their spatial intelligence remains a challenge. Existing 3D MLLMs always rely on additional 3D or 2.5D data to incorporate spatial awareness, restricting their utility in scenarios with only 2D inputs, such as images or videos. In this paper, we present Spatial-MLLM, a novel framework for visual-based spatial reasoning from purely 2D observations. Unlike conventional video MLLMs which rely on CLIP-based visual encoders optimized for semantic understanding, our key insight is to unleash the strong structure prior from the feed-forward visual geometry foundation model. Specifically, we propose a dual-encoder architecture: a pretrained 2D visual encoder to extract semantic features, and a spatial encoder-initialized from the backbone of the visual geometry model-to extract 3D structure features. A connector then integrates both features into unified visual tokens for enhanced spatial understanding. Furthermore, we propose a space-aware frame sampling strategy at inference time, which selects the spatially informative frames of a video sequence, ensuring that even under limited token length, the model focuses on frames critical for spatial reasoning. Beyond architecture improvements, we construct the Spatial-MLLM-120k dataset and train the model on it using supervised fine-tuning and GRPO. Extensive experiments on various real-world datasets demonstrate that our spatial-MLLM achieves state-of-the-art performance in a wide range of visual-based spatial understanding and reasoning tasks. Project page: https://diankun-wu.github.io/Spatial-MLLM/.
Authors:Hao Dong, Moru Liu, Jian Liang, Eleni Chatzi, Olga Fink
Abstract:
Vision-Language Models (VLMs) have demonstrated strong capabilities in aligning visual and textual modalities, enabling a wide range of applications in multimodal understanding and generation. While they excel in zero-shot and transfer learning scenarios, VLMs remain susceptible to misclassification, often yielding confident yet incorrect predictions. This limitation poses a significant risk in safety-critical domains, where erroneous predictions can lead to severe consequences. In this work, we introduce TrustVLM, a training-free framework designed to address the critical challenge of estimating when VLM's predictions can be trusted. Motivated by the observed modality gap in VLMs and the insight that certain concepts are more distinctly represented in the image embedding space, we propose a novel confidence-scoring function that leverages this space to improve misclassification detection. We rigorously evaluate our approach across 17 diverse datasets, employing 4 architectures and 2 VLMs, and demonstrate state-of-the-art performance, with improvements of up to 51.87% in AURC, 9.14% in AUROC, and 32.42% in FPR95 compared to existing baselines. By improving the reliability of the model without requiring retraining, TrustVLM paves the way for safer deployment of VLMs in real-world applications. The code is available at https://github.com/EPFL-IMOS/TrustVLM.
Chinese: TrustVLM是一种无需训练的框架,通过基于图像嵌入的新型置信度评分函数提升视觉语言模型的可靠性,在多个数据集和架构上实现了误分类检测的最先进性能。
English: TrustVLM is a training-free framework that enhances the reliability of Vision-Language Models by introducing a novel confidence-scoring function based on image embeddings, achieving state-of-the-art performance in misclassification detection across multiple datasets and architectures.
Authors:Juncheol Shin, Minsang Seok, Seonggon Kim, Eunhyeok Park
Abstract:
Model merging has emerged as a powerful technique for combining task-specific weights, achieving superior performance in multi-target domain adaptation. However, when applied to practical scenarios, such as quantized models, new challenges arise. In practical scenarios, quantization is often applied to target-specific data, but this process restricts the domain of interest and introduces discretization effects, making model merging highly non-trivial. In this study, we analyze the impact of quantization on model merging through the lens of error barriers. Leveraging these insights, we propose a novel post-training quantization, HDRQ - Hessian and distant regularizing quantization - that is designed to consider model merging for multi-target domain adaptation. Our approach ensures that the quantization process incurs minimal deviation from the source pre-trained model while flattening the loss surface to facilitate smooth model merging. To our knowledge, this is the first study on this challenge, and extensive experiments confirm its effectiveness.
Chinese: 本研究提出了HDRQ,一种新颖的训练后量化方法,通过最小化与源模型的偏差并平滑损失表面,解决了量化带来的挑战,实现了多目标领域自适应中的有效模型融合。
English: This study introduces HDRQ, a novel post-training quantization method that minimizes deviation from the source model and flattens the loss surface to enable effective model merging for multi-target domain adaptation, addressing challenges posed by quantization.
Authors:Qingyu Shi, Jinbin Bai, Zhuoran Zhao, Wenhao Chai, Kaidong Yu, Jianzong Wu, Shuangyong Song, Yunhai Tong, Xiangtai Li, Xuelong Li, Shuicheng Yan
Abstract:
Unified generation models aim to handle diverse tasks across modalities -- such as text generation, image generation, and vision-language reasoning -- within a single architecture and decoding paradigm. Autoregressive unified models suffer from slow inference due to sequential decoding, and non-autoregressive unified models suffer from weak generalization due to limited pretrained backbones. We introduce Muddit, a unified discrete diffusion transformer that enables fast and parallel generation across both text and image modalities. Unlike prior unified diffusion models trained from scratch, Muddit integrates strong visual priors from a pretrained text-to-image backbone with a lightweight text decoder, enabling flexible and high-quality multimodal generation under a unified architecture. Empirical results show that Muddit achieves competitive or superior performance compared to significantly larger autoregressive models in both quality and efficiency. The work highlights the potential of purely discrete diffusion, when equipped with strong visual priors, as a scalable and effective backbone for unified generation.
中文: Muddit作为一种统一的离散扩散变换器,通过将预训练骨干网络的强大视觉先验与轻量级文本解码器相结合,实现了快速、并行且高质量的多模态生成,在效率和性能上均优于更大的自回归模型。
English: Muddit is a unified discrete diffusion transformer that enables fast, parallel, and high-quality multimodal generation by integrating strong visual priors from a pretrained backbone with a lightweight text decoder, outperforming larger autoregressive models in efficiency and quality.
Authors:Xi Chen, Soham Jana, Christopher A. Metzler, Arian Maleki, Shirin Jalali
Abstract:
Multilook coherent imaging is a widely used technique in applications such as digital holography, ultrasound imaging, and synthetic aperture radar. A central challenge in these systems is the presence of multiplicative noise, commonly known as speckle, which degrades image quality. Despite the widespread use of coherent imaging systems, their theoretical foundations remain relatively underexplored. In this paper, we study both the theoretical and algorithmic aspects of likelihood-based approaches for multilook coherent imaging, providing a rigorous framework for analysis and method development. Our theoretical contributions include establishing the first theoretical upper bound on the Mean Squared Error (MSE) of the maximum likelihood estimator under the deep image prior hypothesis. Our results capture the dependence of MSE on the number of parameters in the deep image prior, the number of looks, the signal dimension, and the number of measurements per look. On the algorithmic side, we employ projected gradient descent (PGD) as an efficient method for computing the maximum likelihood solution. Furthermore, we introduce two key ideas to enhance the practical performance of PGD. First, we incorporate the Newton-Schulz algorithm to compute matrix inverses within the PGD iterations, significantly reducing computational complexity. Second, we develop a bagging strategy to mitigate projection errors introduced during PGD updates. We demonstrate that combining these techniques with PGD yields state-of-the-art performance. Our code is available at https://github.com/Computational-Imaging-RU/Bagged-DIP-Speckle.
中文: 本文为多视相干成像建立了理论框架,首次推导出深度图像先验下最大似然估计的均方误差上界,并提出结合矩阵逆优化和误差抑制的投影梯度下降算法,实现了最先进的性能。
English: This paper establishes a theoretical framework for multilook coherent imaging, deriving the first MSE bound for maximum likelihood estimation under deep image priors and proposing an enhanced projected gradient descent algorithm with matrix inversion optimization and error mitigation for state-of-the-art performance.
Authors:Adibvafa Fallahpour, Andrew Magnuson, Purav Gupta, Shihao Ma, Jack Naimer, Arnav Shah, Haonan Duan, Omar Ibrahim, Hani Goodarzi, Chris J. Maddison, Bo Wang
Abstract:
Unlocking deep, interpretable biological reasoning from complex genomic data is a major AI challenge hindering scientific discovery. Current DNA foundation models, despite strong sequence representation, struggle with multi-step reasoning and lack inherent transparent, biologically intuitive explanations. We introduce BioReason, a pioneering architecture that, for the first time, deeply integrates a DNA foundation model with a Large Language Model (LLM). This novel connection enables the LLM to directly process and reason with genomic information as a fundamental input, fostering a new form of multimodal biological understanding. BioReason's sophisticated multi-step reasoning is developed through supervised fine-tuning and targeted reinforcement learning, guiding the system to generate logical, biologically coherent deductions. On biological reasoning benchmarks including KEGG-based disease pathway prediction - where accuracy improves from 88% to 97% - and variant effect prediction, BioReason demonstrates an average 15% performance gain over strong single-modality baselines. BioReason reasons over unseen biological entities and articulates decision-making through interpretable, step-by-step biological traces, offering a transformative approach for AI in biology that enables deeper mechanistic insights and accelerates testable hypothesis generation from genomic data. Data, code, and checkpoints are publicly available at https://github.com/bowang-lab/BioReason
中文: BioReason创新性地整合了DNA基础模型与大语言模型,实现了多步骤生物推理,在基准测试中性能提升15%,并提供可解释的逐步分析,推动基因组数据的深度理解和假设生成。
English: BioReason is a novel AI architecture that integrates a DNA foundation model with a Large Language Model, enabling advanced multi-step biological reasoning and achieving a 15% performance improvement on benchmarks while providing interpretable, step-by-step explanations.
Authors:Yiran Guo, Lijie Xu, Jie Liu, Dan Ye, Shuang Qiu
Abstract:
Enhancing the reasoning capabilities of large language models effectively using reinforcement learning (RL) remains a crucial challenge. Existing approaches primarily adopt two contrasting advantage estimation granularities: Token-level methods (e.g., PPO) aim to provide the fine-grained advantage signals but suffer from inaccurate estimation due to difficulties in training an accurate critic model. On the other extreme, trajectory-level methods (e.g., GRPO) solely rely on a coarse-grained advantage signal from the final reward, leading to imprecise credit assignment. To address these limitations, we propose Segment Policy Optimization (SPO), a novel RL framework that leverages segment-level advantage estimation at an intermediate granularity, achieving a better balance by offering more precise credit assignment than trajectory-level methods and requiring fewer estimation points than token-level methods, enabling accurate advantage estimation based on Monte Carlo (MC) without a critic model. SPO features three components with novel strategies: (1) flexible segment partition; (2) accurate segment advantage estimation; and (3) policy optimization using segment advantages, including a novel probability-mask strategy. We further instantiate SPO for two specific scenarios: (1) SPO-chain for short chain-of-thought (CoT), featuring novel cutpoint-based partition and chain-based advantage estimation, achieving $6$-$12$ percentage point improvements in accuracy over PPO and GRPO on GSM8K. (2) SPO-tree for long CoT, featuring novel tree-based advantage estimation, which significantly reduces the cost of MC estimation, achieving $7$-$11$ percentage point improvements over GRPO on MATH500 under 2K and 4K context evaluation. We make our code publicly available at https://github.com/AIFrameResearch/SPO.
中文摘要:分段策略优化(SPO)是一种新颖的强化学习框架,通过引入分段级优势估计克服了词元级和轨迹级方法的局限,在不依赖评论家模型的情况下实现了更优的推理性能。
English Summary: Segment Policy Optimization (SPO) is a novel reinforcement learning framework that introduces segment-level advantage estimation to overcome the limitations of token-level and trajectory-level methods, achieving superior reasoning performance without requiring a critic model.
Authors:Xu Chu, Xinrong Chen, Guanyu Wang, Zhijie Tan, Kui Huang, Wenyu Lv, Tong Mo, Weiping Li
Abstract:
Inference time scaling drives extended reasoning to enhance the performance of Vision-Language Models (VLMs), thus forming powerful Vision-Language Reasoning Models (VLRMs). However, long reasoning dilutes visual tokens, causing visual information to receive less attention and may trigger hallucinations. Although introducing text-only reflection processes shows promise in language models, we demonstrate that it is insufficient to suppress hallucinations in VLMs. To address this issue, we introduce Qwen-LookAgain (Qwen-LA), a novel VLRM designed to mitigate hallucinations by incorporating a vision-text reflection process that guides the model to re-attention visual information during reasoning. We first propose a reinforcement learning method Balanced Reflective Policy Optimization (BRPO), which guides the model to decide when to generate vision-text reflection on its own and balance the number and length of reflections. Then, we formally prove that VLRMs lose attention to visual tokens as reasoning progresses, and demonstrate that supplementing visual information during reflection enhances visual attention. Therefore, during training and inference, Visual Token COPY and Visual Token ROUTE are introduced to force the model to re-attention visual information at the visual level, addressing the limitations of text-only reflection. Experiments on multiple visual QA datasets and hallucination metrics indicate that Qwen-LA achieves leading accuracy performance while reducing hallucinations. Our code is available at: https://github.com/Liar406/Look_Again
中文摘要:Qwen-LookAgain是一种新型视觉语言推理模型,通过引入视觉-文本反思过程和强化学习机制,引导模型在推理过程中重新关注视觉信息,从而有效减少幻觉现象,在多项视觉问答基准测试中取得领先准确率。
English Summary: Qwen-LookAgain is a novel vision-language reasoning model that mitigates hallucinations by incorporating a vision-text reflection process and reinforcement learning to guide the model in re-attending to visual information during reasoning, achieving leading accuracy on multiple benchmarks.
Authors:Raj Ghugare, Benjamin Eysenbach
Abstract:
Modern reinforcement learning (RL) algorithms have found success by using powerful probabilistic models, such as transformers, energy-based models, and diffusion/flow-based models. To this end, RL researchers often choose to pay the price of accommodating these models into their algorithms -- diffusion models are expressive, but are computationally intensive due to their reliance on solving differential equations, while autoregressive transformer models are scalable but typically require learning discrete representations. Normalizing flows (NFs), by contrast, seem to provide an appealing alternative, as they enable likelihoods and sampling without solving differential equations or autoregressive architectures. However, their potential in RL has received limited attention, partly due to the prevailing belief that normalizing flows lack sufficient expressivity. We show that this is not the case. Building on recent work in NFs, we propose a single NF architecture which integrates seamlessly into RL algorithms, serving as a policy, Q-function, and occupancy measure. Our approach leads to much simpler algorithms, and achieves higher performance in imitation learning, offline, goal conditioned RL and unsupervised RL.
Authors:Ron Shapira Weber, Shahar Ben Ishay, Andrey Lavrinenko, Shahaf E. Finder, Oren Freifeld
Abstract:
Fast and scalable alignment of time series is a fundamental challenge in many domains. The standard solution, Dynamic Time Warping (DTW), struggles with poor scalability and sensitivity to noise. We introduce TimePoint, a self-supervised method that dramatically accelerates DTW-based alignment while typically improving alignment accuracy by learning keypoints and descriptors from synthetic data. Inspired by 2D keypoint detection but carefully adapted to the unique challenges of 1D signals, TimePoint leverages efficient 1D diffeomorphisms, which effectively model nonlinear time warping, to generate realistic training data. This approach, along with fully convolutional and wavelet convolutional architectures, enables the extraction of informative keypoints and descriptors. Applying DTW to these sparse representations yield major speedups and typically higher alignment accuracy than standard DTW applied to the full signals. TimePoint demonstrates strong generalization to real-world time series when trained solely on synthetic data, and further improves with fine-tuning on real data. Extensive experiments demonstrate that TimePoint consistently achieves faster and more accurate alignments than standard DTW, making it a scalable solution for time-series analysis. Our code is available at https://github.com/BGU-CS-VIL/TimePoint
Chinese: TimePoint是一种自监督方法,通过从合成数据中学习关键点来加速动态时间规整(DTW)对齐,相比标准DTW实现了更快速度和更高精度。
English: TimePoint is a self-supervised method that accelerates Dynamic Time Warping (DTW) alignment by learning keypoints from synthetic data, achieving faster and more accurate results than standard DTW.
Authors:Weizhe Kong, Xiao Wang, Ruichong Gao, Chenglong Li, Yu Zhang, Xing Yang, Yaowei Wang, Jin Tang
Abstract:
Pedestrian Attribute Recognition (PAR) is an indispensable task in human-centered research and has made great progress in recent years with the development of deep neural networks. However, the potential vulnerability and anti-interference ability have still not been fully explored. To bridge this gap, this paper proposes the first adversarial attack and defense framework for pedestrian attribute recognition. Specifically, we exploit both global- and patch-level attacks on the pedestrian images, based on the pre-trained CLIP-based PAR framework. It first divides the input pedestrian image into non-overlapping patches and embeds them into feature embeddings using a projection layer. Meanwhile, the attribute set is expanded into sentences using prompts and embedded into attribute features using a pre-trained CLIP text encoder. A multi-modal Transformer is adopted to fuse the obtained vision and text tokens, and a feed-forward network is utilized for attribute recognition. Based on the aforementioned PAR framework, we adopt the adversarial semantic and label-perturbation to generate the adversarial noise, termed ASL-PAR. We also design a semantic offset defense strategy to suppress the influence of adversarial attacks. Extensive experiments conducted on both digital domains (i.e., PETA, PA100K, MSP60K, RAPv2) and physical domains fully validated the effectiveness of our proposed adversarial attack and defense strategies for the pedestrian attribute recognition. The source code of this paper will be released on https://github.com/Event-AHU/OpenPAR.
中文摘要:本文针对行人属性识别首次提出对抗攻击与防御框架,通过多模态Transformer和CLIP嵌入技术开发攻击策略及语义偏移防御方法,在数字与物理场景下的实验验证了其有效性。
English Summary: This paper introduces the first adversarial attack and defense framework for pedestrian attribute recognition, utilizing multimodal Transformers and CLIP-based embeddings to develop both attack strategies and a semantic offset defense method validated across digital and physical domains.
Authors:Yunkee Chae, Kyogu Lee
Abstract:
We present MGE-LDM, a unified latent diffusion framework for simultaneous music generation, source imputation, and query-driven source separation. Unlike prior approaches constrained to fixed instrument classes, MGE-LDM learns a joint distribution over full mixtures, submixtures, and individual stems within a single compact latent diffusion model. At inference, MGE-LDM enables (1) complete mixture generation, (2) partial generation (i.e., source imputation), and (3) text-conditioned extraction of arbitrary sources. By formulating both separation and imputation as conditional inpainting tasks in the latent space, our approach supports flexible, class-agnostic manipulation of arbitrary instrument sources. Notably, MGE-LDM can be trained jointly across heterogeneous multi-track datasets (e.g., Slakh2100, MUSDB18, MoisesDB) without relying on predefined instrument categories. Audio samples are available at our project page: https://yoongi43.github.io/MGELDM_Samples/.
Authors:James Xu Zhao, Jimmy Z. J. Liu, Bryan Hooi, See-Kiong Ng
Abstract:
Large language models (LLMs) are widely used for long-form text generation. However, factual errors in the responses would undermine their reliability. Despite growing attention to LLM factuality, the effect of response length on factuality remains underexplored. In this work, we systematically investigate this relationship by first introducing an automatic and bi-level long-form factuality evaluation framework, which achieves high agreement with human annotations while being cost-effective. Using this framework, we conduct controlled experiments and find that longer responses exhibit lower factual precision, confirming the presence of length bias. To explain this phenomenon, we empirically examine three hypotheses: error propagation, long context, and facts exhaustion. Our results reveal that facts exhaustion, where the model gradually exhausts more reliable knowledge, is the primary cause of factual degradation, rather than the other two hypotheses.
中文: 本研究揭示大型语言模型的生成长度与事实准确性呈负相关,主要原因是知识耗尽导致可靠信息逐渐减少,而非错误传播或长上下文问题。
English: This study reveals that longer responses from large language models exhibit lower factual precision due to facts exhaustion, where the model gradually depletes its reliable knowledge, rather than error propagation or long context issues.
Authors:Mao-Lin Luo, Zi-Hao Zhou, Tong Wei, Min-Ling Zhang
Abstract:
Continual learning with vision-language models like CLIP offers a pathway toward scalable machine learning systems by leveraging its transferable representations. Existing CLIP-based methods adapt the pre-trained image encoder by adding multiple sets of learnable parameters, with each task using a partial set of parameters. This requires selecting the expected parameters for input images during inference, which is prone to error that degrades performance. To address this problem, we introduce LADA (Label-specific ADApter). Instead of partitioning parameters across tasks, LADA appends lightweight, label-specific memory units to the frozen CLIP image encoder, enabling discriminative feature generation by aggregating task-agnostic knowledge. To prevent catastrophic forgetting, LADA employs feature distillation for seen classes, preventing their features from being interfered with by new classes. Positioned after the image encoder, LADA prevents gradient flow to the frozen CLIP parameters, ensuring efficient training. Extensive results show that LADA achieves state-of-the-art performance in continual learning settings. The implementation code is available at https://github.com/MaolinLuo/LADA.
中文: LADA通过向冻结的CLIP图像编码器添加标签特定记忆单元,实现判别性特征生成,并利用特征蒸馏防止灾难性遗忘,在持续学习中达到最优性能。
English: LADA introduces label-specific memory units to the frozen CLIP image encoder, enabling discriminative feature generation and preventing catastrophic forgetting through feature distillation, achieving state-of-the-art continual learning performance.
Authors:Wanfu Gao, Jun Gao, Qingqi Han, Hanlin Pan, Kunpeng Liu
Abstract:
The rapid growth in feature dimension may introduce implicit associations between features and labels in multi-label datasets, making the relationships between features and labels increasingly complex. Moreover, existing methods often adopt low-dimensional linear decomposition to explore the associations between features and labels. However, linear decomposition struggles to capture complex nonlinear associations and may lead to misalignment between the feature space and the label space. To address these two critical challenges, we propose innovative solutions. First, we design a random walk graph that integrates feature-feature, label-label, and feature-label relationships to accurately capture nonlinear and implicit indirect associations, while optimizing the latent representations of associations between features and labels after low-rank decomposition. Second, we align the variable spaces by leveraging low-dimensional representation coefficients, while preserving the manifold structure between the original high-dimensional multi-label data and the low-dimensional representation space. Extensive experiments and ablation studies conducted on seven benchmark datasets and three representative datasets using various evaluation metrics demonstrate the superiority of the proposed method\footnote{Code: https://github.com/Heilong623/-GRW-}.
中文总结:该方法通过设计融合特征与标签关系的随机游走图来捕捉非线性关联,并利用低维表示系数对齐变量空间,有效解决了多标签数据中特征与标签的复杂关联问题。
English Summary: The proposed method addresses complex nonlinear feature-label associations in multi-label datasets by designing a random walk graph that captures implicit relationships and aligns variable spaces while preserving manifold structures.
Authors:Lifan Zhao, Yanyan Shen, Zhaoyang Liu, Xue Wang, Jiaji Deng
Abstract:
Scaling laws motivate the development of Time Series Foundation Models (TSFMs) that pre-train vast parameters and achieve remarkable zero-shot forecasting performance. Surprisingly, even after fine-tuning, TSFMs cannot consistently outperform smaller, specialized models trained on full-shot downstream data. A key question is how to realize effective adaptation of TSFMs for a target forecasting task. Through empirical studies on various TSFMs, the pre-trained models often exhibit inherent sparsity and redundancy in computation, suggesting that TSFMs have learned to activate task-relevant network substructures to accommodate diverse forecasting tasks. To preserve this valuable prior knowledge, we propose a structured pruning method to regularize the subsequent fine-tuning process by focusing it on a more relevant and compact parameter space. Extensive experiments on seven TSFMs and six benchmarks demonstrate that fine-tuning a smaller, pruned TSFM significantly improves forecasting performance compared to fine-tuning original models. This prune-then-finetune paradigm often enables TSFMs to achieve state-of-the-art performance and surpass strong specialized baselines. Source code is made publicly available at https://github.com/SJTU-DMTai/Prune-then-Finetune.
Chinese: 尺度定律推动了时间序列基础模型的发展以实现卓越的零样本预测,但微调后仍无法持续超越小型专业模型,因此提出结构化剪枝方法,通过聚焦相关参数显著提升预测性能。
English: Scaling laws drive the development of Time Series Foundation Models (TSFMs) for superior zero-shot forecasting, but fine-tuning them fails to consistently outperform smaller specialized models, leading to a proposed structured pruning method that enhances performance by focusing on relevant parameters.
Authors:Shiwei Li, Xiandi Luo, Xing Tang, Haozhao Wang, Hao Chen, Weihong Luo, Yuhua Li, Xiuqiang He, Ruixuan Li
Abstract:
Low-rank adaptation (LoRA) is a widely used parameter-efficient fine-tuning method. In standard LoRA layers, one of the matrices, $A$ or $B$, is initialized to zero, ensuring that fine-tuning starts from the pretrained model. However, there is no theoretical support for this practice. In this paper, we investigate the impact of non-zero initialization on LoRA's fine-tuning dynamics from an infinite-width perspective. Our analysis reveals that, compared to zero initialization, simultaneously initializing $A$ and $B$ to non-zero values improves LoRA's robustness to suboptimal learning rates, particularly smaller ones. Further analysis indicates that although the non-zero initialization of $AB$ introduces random noise into the pretrained weight, it generally does not affect fine-tuning performance. In other words, fine-tuning does not need to strictly start from the pretrained model. The validity of our findings is confirmed through extensive experiments across various models and datasets. The code is available at https://github.com/Leopold1423/non_zero_lora-icml25.
中文摘要:本研究通过理论分析和实验验证,发现对LoRA双矩阵进行非零初始化能提升其对次优学习率的鲁棒性,且不影响微调效果,从而证明无需严格从预训练模型开始微调。
English Summary: This study demonstrates that initializing both LoRA matrices with non-zero values enhances robustness to suboptimal learning rates without compromising fine-tuning performance, challenging the necessity of starting strictly from pretrained weights.
Authors:Shiwei Li, Xiandi Luo, Haozhao Wang, Xing Tang, Shijie Xu, Weihong Luo, Yuhua Li, Xiuqiang He, Ruixuan Li
Abstract:
To improve the training efficiency of federated learning (FL), previous research has employed low-rank decomposition techniques to reduce communication overhead. In this paper, we seek to enhance the performance of these low-rank decomposition methods. Specifically, we focus on three key issues related to decomposition in FL: what to decompose, how to decompose, and how to aggregate. Subsequently, we introduce three novel techniques: Model Update Decomposition (MUD), Block-wise Kronecker Decomposition (BKD), and Aggregation-Aware Decomposition (AAD), each targeting a specific issue. These techniques are complementary and can be applied simultaneously to achieve optimal performance. Additionally, we provide a rigorous theoretical analysis to ensure the convergence of the proposed MUD. Extensive experimental results show that our approach achieves faster convergence and superior accuracy compared to relevant baseline methods. The code is available at https://github.com/Leopold1423/fedmud-icml25.
Chinese: 本文针对联邦学习中的低秩分解方法,提出了模型更新分解、分块克罗内克分解和聚合感知分解三种互补技术,在理论保证下实现了更快的收敛速度和更高的精度。
English: This paper introduces three complementary techniques—Model Update Decomposition, Block-wise Kronecker Decomposition, and Aggregation-Aware Decomposition—to enhance low-rank decomposition in federated learning, achieving faster convergence and higher accuracy with theoretical guarantees.
Authors:Shohei Enomoto
Abstract:
Deep learning models often struggle to maintain performance when deployed on data distributions different from their training data, particularly in real-world applications where environmental conditions frequently change. While Multi-source Domain Generalization (MDG) has shown promise in addressing this challenge by leveraging multiple source domains during training, its practical application is limited by the significant costs and difficulties associated with creating multi-domain datasets. To address this limitation, we propose Pseudo Multi-source Domain Generalization (PMDG), a novel framework that enables the application of sophisticated MDG algorithms in more practical Single-source Domain Generalization (SDG) settings. PMDG generates multiple pseudo-domains from a single source domain through style transfer and data augmentation techniques, creating a synthetic multi-domain dataset that can be used with existing MDG algorithms. Through extensive experiments with PseudoDomainBed, our modified version of the DomainBed benchmark, we analyze the effectiveness of PMDG across multiple datasets and architectures. Our analysis reveals several key findings, including a positive correlation between MDG and PMDG performance and the potential of pseudo-domains to match or exceed actual multi-domain performance with sufficient data. These comprehensive empirical results provide valuable insights for future research in domain generalization. Our code is available at https://github.com/s-enmt/PseudoDomainBed.
中文摘要:本文提出的伪多源域泛化(PMDG)框架通过从单源数据生成合成多域数据集,解决了多源域泛化的实际应用限制,使MDG算法能在更现实的单源场景中有效运用。
English Summary: The proposed Pseudo Multi-source Domain Generalization (PMDG) framework addresses the limitations of multi-source domain generalization by generating synthetic multi-domain datasets from single-source data, enabling effective application of MDG algorithms in practical settings.
Authors:Zhe Ye, Zhengxu Yan, Jingxuan He, Timothe Kasriel, Kaiyu Yang, Dawn Song
Abstract:
Large language models (LLMs) are increasingly integrated in software development, but ensuring correctness in LLM-generated code remains challenging and often requires costly manual review. Verifiable code generation -- jointly generating code, specifications, and proofs of code-specification alignment -- offers a promising path to address this limitation and further unleash LLMs' benefits in coding. Yet, there exists a significant gap in evaluation: current benchmarks often lack support for end-to-end verifiable code generation. In this paper, we introduce Verina (Verifiable Code Generation Arena), a high-quality benchmark enabling a comprehensive and modular evaluation of code, specification, and proof generation as well as their compositions. Verina consists of 189 manually curated coding tasks in Lean, with detailed problem descriptions, reference implementations, formal specifications, and extensive test suites. Our extensive evaluation of state-of-the-art LLMs reveals significant challenges in verifiable code generation, especially in proof generation, underscoring the need for improving LLM-based theorem provers in verification domains. The best model, OpenAI o4-mini, generates only 61.4% correct code, 51.0% sound and complete specifications, and 3.6% successful proofs, with one trial per task. We hope Verina will catalyze progress in verifiable code generation by providing a rigorous and comprehensive benchmark. We release our dataset on https://huggingface.co/datasets/sunblaze-ucb/verina and our evaluation code on https://github.com/sunblaze-ucb/verina.
中文: 大语言模型在生成正确代码方面存在挑战,因此推出了Verina基准来评估可验证代码生成,揭示了当前模型在证明和规范能力上的显著不足。
English: Large language models face challenges in generating correct code, prompting the development of Verina, a benchmark for evaluating verifiable code generation that reveals significant gaps in current models' proof and specification capabilities.
Authors:Pengfei Zhou, Yunlong Liu, Junli Liang, Qi Song, Xiangyang Li
Abstract:
Time series forecasting with exogenous variables is a critical emerging paradigm that presents unique challenges in modeling dependencies between variables. Traditional models often struggle to differentiate between endogenous and exogenous variables, leading to inefficiencies and overfitting. In this paper, we introduce CrossLinear, a novel Linear-based forecasting model that addresses these challenges by incorporating a plug-and-play cross-correlation embedding module. This lightweight module captures the dependencies between variables with minimal computational cost and seamlessly integrates into existing neural networks. Specifically, it captures time-invariant and direct variable dependencies while disregarding time-varying or indirect dependencies, thereby mitigating the risk of overfitting in dependency modeling and contributing to consistent performance improvements. Furthermore, CrossLinear employs patch-wise processing and a global linear head to effectively capture both short-term and long-term temporal dependencies, further improving its forecasting precision. Extensive experiments on 12 real-world datasets demonstrate that CrossLinear achieves superior performance in both short-term and long-term forecasting tasks. The ablation study underscores the effectiveness of the cross-correlation embedding module. Additionally, the generalizability of this module makes it a valuable plug-in for various forecasting tasks across different domains. Codes are available at https://github.com/mumiao2000/CrossLinear.
Chinese: CrossLinear是一种基于线性预测的新模型,通过轻量化的互相关嵌入模块有效捕捉变量间依赖关系并防止过拟合,在多个数据集的短期和长期预测任务中均表现出卓越性能。
English: CrossLinear is a novel linear-based forecasting model that introduces a lightweight cross-correlation embedding module to efficiently capture dependencies between variables while mitigating overfitting, achieving superior performance in both short-term and long-term forecasting tasks across multiple datasets.
Authors:Ning Liu, Yue Yu
Abstract:
Attention mechanisms have emerged as transformative tools in core AI domains such as natural language processing and computer vision. Yet, their largely untapped potential for modeling intricate physical systems presents a compelling frontier. Learning such systems often entails discovering operators that map between functional spaces using limited instances of function pairs -- a task commonly framed as a severely ill-posed inverse PDE problem. In this work, we introduce Neural Interpretable PDEs (NIPS), a novel neural operator architecture that builds upon and enhances Nonlocal Attention Operators (NAO) in both predictive accuracy and computational efficiency. NIPS employs a linear attention mechanism to enable scalable learning and integrates a learnable kernel network that acts as a channel-independent convolution in Fourier space. As a consequence, NIPS eliminates the need to explicitly compute and store large pairwise interactions, effectively amortizing the cost of handling spatial interactions into the Fourier transform. Empirical evaluations demonstrate that NIPS consistently surpasses NAO and other baselines across diverse benchmarks, heralding a substantial leap in scalable, interpretable, and efficient physics learning. Our code and data accompanying this paper are available at https://github.com/fishmoon1234/Nonlocal-Attention-Operator.
中文摘要:本文提出神经可解释偏微分方程(NIPS),这是一种通过线性注意力机制和傅里叶空间卷积增强非局部注意力算子的新架构,在物理学习基准测试中显著提升了预测精度与计算效率。
English Summary: The paper introduces Neural Interpretable PDEs (NIPS), a neural operator architecture that enhances Nonlocal Attention Operators with improved accuracy and efficiency through linear attention and Fourier-space convolution, outperforming existing methods in physics learning benchmarks.
Authors:Zhen Xiang, Aliyah R. Hsu, Austin V. Zane, Aaron E. Kornblith, Margaret J. Lin-Martore, Jasmanpreet C. Kaur, Vasuda M. Dokiparthi, Bo Li, Bin Yu
Abstract:
Clinical decision-making is inherently complex and fast-paced, particularly in emergency departments (EDs) where critical, rapid and high-stakes decisions are made. Clinical Decision Rules (CDRs) are standardized evidence-based tools that combine signs, symptoms, and clinical variables into decision trees to make consistent and accurate diagnoses. CDR usage is often hindered by the clinician's cognitive load, limiting their ability to quickly recall and apply the appropriate rules. We introduce CDR-Agent, a novel LLM-based system designed to enhance ED decision-making by autonomously identifying and applying the most appropriate CDRs based on unstructured clinical notes. To validate CDR-Agent, we curated two novel ED datasets: synthetic and CDR-Bench, although CDR-Agent is applicable to non ED clinics. CDR-Agent achieves a 56.3\% (synthetic) and 8.7\% (CDR-Bench) accuracy gain relative to the standalone LLM baseline in CDR selection. Moreover, CDR-Agent significantly reduces computational overhead. Using these datasets, we demonstrated that CDR-Agent not only selects relevant CDRs efficiently, but makes cautious yet effective imaging decisions by minimizing unnecessary interventions while successfully identifying most positively diagnosed cases, outperforming traditional LLM prompting approaches. Code for our work can be found at: https://github.com/zhenxianglance/medagent-cdr-agent
中文: CDR-Agent是一种基于大语言模型的新型系统,通过自主从非结构化临床记录中筛选并应用合适的临床决策规则来提升急诊科决策水平,相比基准模型显著提高了准确性并降低了计算开销。
English: CDR-Agent is a novel LLM-based system that enhances emergency department decision-making by autonomously selecting and applying appropriate Clinical Decision Rules from unstructured clinical notes, achieving significant accuracy improvements over baseline models while reducing computational overhead.
Authors:Tianteng Gu, Bei Liu, Bo Xiao, Ke Zeng, Jiacheng Liu, Yanmin Qian
Abstract:
Pruning is a widely used technique to compress large language models (LLMs) by removing unimportant weights, but it often suffers from significant performance degradation - especially under semi-structured sparsity constraints. Existing pruning methods primarily focus on estimating the importance of individual weights, which limits their ability to preserve critical capabilities of the model. In this work, we propose a new perspective: rather than merely selecting which weights to prune, we first redistribute parameter importance to make the model inherently more amenable to pruning. By minimizing the information entropy of normalized importance scores, our approach concentrates importance onto a smaller subset of weights, thereby enhancing pruning robustness. We instantiate this idea through DenoiseRotator, which applies learnable orthogonal transformations to the model's weight matrices. Our method is model-agnostic and can be seamlessly integrated with existing pruning techniques such as Magnitude, SparseGPT, and Wanda. Evaluated on LLaMA3, Qwen2.5, and Mistral models under 50% unstructured and 2:4 semi-structured sparsity, DenoiseRotator consistently improves perplexity and zero-shot accuracy. For instance, on LLaMA3-70B pruned with SparseGPT at 2:4 semi-structured sparsity, DenoiseRotator reduces the perplexity gap to the dense model by 58%, narrowing the degradation from 8.1 to 3.4 points. Codes are available at https://github.com/Axel-gu/DenoiseRotator.
Chinese: 本文提出DenoiseRotator方法,通过正交变换重新分配参数重要性,增强大语言模型剪枝的鲁棒性并显著减少性能损失。
English: This paper introduces DenoiseRotator, a model-agnostic method that redistributes parameter importance through orthogonal transformations to enhance pruning robustness and reduce performance degradation in large language models.
Authors:Minh Nguyen Nhat To, Paul F RWilson, Viet Nguyen, Mohamed Harmanani, Michael Cooper, Fahimeh Fooladgar, Purang Abolmaesumi, Parvin Mousavi, Rahul G. Krishnan
Abstract:
The subpopulationtion shift, characterized by a disparity in subpopulation distributibetween theween the training and target datasets, can significantly degrade the performance of machine learning models. Current solutions to subpopulation shift involve modifying empirical risk minimization with re-weighting strategies to improve generalization. This strategy relies on assumptions about the number and nature of subpopulations and annotations on group membership, which are unavailable for many real-world datasets. Instead, we propose using an ensemble of diverse classifiers to adaptively capture risk associated with subpopulations. Given a feature extractor network, we replace its standard linear classification layer with a mixture of prototypical classifiers, where each member is trained to classify the data while focusing on different features and samples from other members. In empirical evaluation on nine real-world datasets, covering diverse domains and kinds of subpopulation shift, our method of Diverse Prototypical Ensembles (DPEs) often outperforms the prior state-of-the-art in worst-group accuracy. The code is available at https://github.com/minhto2802/dpe4subpop
中文: 针对机器学习中的子群体分布偏移问题,本文提出的多样化原型集成方法通过用混合原型分类器替代标准分类器,自适应捕捉子群体风险,在多种数据集的最差组准确率上优于现有最优方法。
English: Subpopulation shift in machine learning models is addressed by the proposed Diverse Prototypical Ensembles method, which replaces standard classifiers with a mixture of prototypical classifiers to adaptively capture subpopulation risks and outperforms prior methods in worst-group accuracy across diverse datasets.
Authors:Ruskin Raj Manku, Yuzhi Tang, Xingjian Shi, Mu Li, Alex Smola
Abstract:
Text-to-Speech (TTS) benchmarks often fail to capture how well models handle nuanced and semantically complex text. Building on $\textit{EmergentTTS}$, we introduce $\textit{EmergentTTS-Eval}$, a comprehensive benchmark covering six challenging TTS scenarios: emotions, paralinguistics, foreign words, syntactic complexity, complex pronunciation (e.g. URLs, formulas), and questions. Crucially, our framework automates both test-case generation and evaluation, making the benchmark easily extensible. Starting from a small set of human-written seed prompts, we iteratively extend them using LLMs to target specific structural, phonetic and prosodic challenges, resulting in 1,645 diverse test cases. Moreover, we employ a model-as-a-judge approach, using a Large Audio Language Model (LALM) to assess the speech across multiple dimensions such as expressed emotion, prosodic, intonational, and pronunciation accuracy. We evaluate state-of-the-art open-source and proprietary TTS systems, such as 11Labs, Deepgram, and OpenAI's 4o-mini-TTS, on EmergentTTS-Eval, demonstrating its ability to reveal fine-grained performance differences. Results show that the model-as-a-judge approach offers robust TTS assessment and a high correlation with human preferences. We open source the evaluation $\href{https://github.com/boson-ai/EmergentTTS-Eval-public}{code}$ and the $\href{https://huggingface.co/datasets/bosonai/EmergentTTS-Eval}{dataset}$.
中文:EmergentTTS-Eval基准通过自动化测试生成与评估,覆盖六种复杂TTS场景,利用大语言模型构建多样化测试用例,并采用音频大模型进行多维度评估,其评判结果与人类偏好高度一致。
English: The EmergentTTS-Eval benchmark addresses limitations in existing TTS evaluations by automating test generation and assessment across six complex scenarios, using LLMs to create diverse cases and a Large Audio Language Model for multidimensional evaluation that correlates well with human judgment.
Authors:Peixuan Han, Zijia Liu, Jiaxuan You
Abstract:
Large language models (LLMs) have shown promising potential in persuasion, but existing works on training LLM persuaders are still preliminary. Notably, while humans are skilled in modeling their opponent's thoughts and opinions proactively and dynamically, current LLMs struggle with such Theory of Mind (ToM) reasoning, resulting in limited diversity and opponent awareness. To address this limitation, we introduce Theory of Mind Augmented Persuader (ToMAP), a novel approach for building more flexible persuader agents by incorporating two theory of mind modules that enhance the persuader's awareness and analysis of the opponent's mental state. Specifically, we begin by prompting the persuader to consider possible objections to the target central claim, and then use a text encoder paired with a trained MLP classifier to predict the opponent's current stance on these counterclaims. Our carefully designed reinforcement learning schema enables the persuader learns how to analyze opponent-related information and utilize it to generate more effective arguments. Experiments show that the ToMAP persuader, while containing only 3B parameters, outperforms much larger baselines, like GPT-4o, with a relative gain of 39.4% across multiple persuadee models and diverse corpora. Notably, ToMAP exhibits complex reasoning chains and reduced repetition during training, which leads to more diverse and effective arguments. The opponent-aware feature of ToMAP also makes it suitable for long conversations and enables it to employ more logical and opponent-aware strategies. These results underscore our method's effectiveness and highlight its potential for developing more persuasive language agents. Code is available at: https://github.com/ulab-uiuc/ToMAP.
Chinese: 针对当前大型语言模型在说服任务中缺乏心理理论推理能力的问题,研究者提出了ToMAP方法,通过心理理论模块和强化学习增强对对手心理状态的分析,仅用30亿参数就在多项指标上显著超越了GPT-4o等更大模型。
English: To address the limitations of current LLMs in Theory of Mind reasoning for persuasion, the researchers developed ToMAP, a method that enhances opponent awareness through specialized modules and reinforcement learning, achieving superior performance over larger models like GPT-4o with only 3B parameters.
Authors:Michael Sun, Orion Foo, Gang Liu, Wojciech Matusik, Jie Chen
Abstract:
Directed acyclic graphs (DAGs) are a class of graphs commonly used in practice, with examples that include electronic circuits, Bayesian networks, and neural architectures. While many effective encoders exist for DAGs, it remains challenging to decode them in a principled manner, because the nodes of a DAG can have many different topological orders. In this work, we propose a grammar-based approach to constructing a principled, compact and equivalent sequential representation of a DAG. Specifically, we view a graph as derivations over an unambiguous grammar, where the DAG corresponds to a unique sequence of production rules. Equivalently, the procedure to construct such a description can be viewed as a lossless compression of the data. Such a representation has many uses, including building a generative model for graph generation, learning a latent space for property prediction, and leveraging the sequence representational continuity for Bayesian Optimization over structured data. Code is available at https://github.com/shiningsunnyday/induction.
中文: 本文提出了一种基于语法的方法,通过将有向无环图视为无歧义语法的推导,构建出原则性、紧凑的序列表示,可用于生成建模、属性预测和贝叶斯优化等应用。
English: This paper introduces a grammar-based method to create a principled, compact sequential representation of directed acyclic graphs (DAGs) by treating them as derivations from an unambiguous grammar, enabling applications in generative modeling, property prediction, and Bayesian optimization.
Authors:Yuhui Zhang, Yuchang Su, Yiming Liu, Serena Yeung-Levy
Abstract:
Negation is a fundamental linguistic phenomenon that can entirely reverse the meaning of a sentence. As vision language models (VLMs) continue to advance and are deployed in high-stakes applications, assessing their ability to comprehend negation becomes essential. To address this, we introduce NegVQA, a visual question answering (VQA) benchmark consisting of 7,379 two-choice questions covering diverse negation scenarios and image-question distributions. We construct NegVQA by leveraging large language models to generate negated versions of questions from existing VQA datasets. Evaluating 20 state-of-the-art VLMs across seven model families, we find that these models struggle significantly with negation, exhibiting a substantial performance drop compared to their responses to the original questions. Furthermore, we uncover a U-shaped scaling trend, where increasing model size initially degrades performance on NegVQA before leading to improvements. Our benchmark reveals critical gaps in VLMs' negation understanding and offers insights into future VLM development. Project page available at https://yuhui-zh15.github.io/NegVQA/.
中文: NegVQA基准测试表明视觉语言模型在理解否定语义方面存在显著困难,不仅表现大幅下降,还呈现出模型规模与性能间的U型缩放规律。
English: The NegVQA benchmark reveals that vision language models struggle significantly with understanding negation, showing performance drops and a U-shaped scaling trend with model size increases.
Authors:Donghyeon Joo, Helya Hosseini, Ramyad Hadidi, Bahar Asgari
Abstract:
We demonstrate that unstructured sparsity significantly improves KV cache compression for LLMs, enabling sparsity levels up to 70% without compromising accuracy or requiring fine-tuning. We conduct a systematic exploration of pruning strategies and find per-token magnitude-based pruning as highly effective for both Key and Value caches under unstructured sparsity, surpassing prior structured pruning schemes. The Key cache benefits from prominent outlier elements, while the Value cache surprisingly benefits from a simple magnitude-based pruning despite its uniform distribution. KV cache size is the major bottleneck in decode performance due to high memory overhead for large context lengths. To address this, we use a bitmap-based sparse format and a custom attention kernel capable of compressing and directly computing over compressed caches pruned to arbitrary sparsity patterns, significantly accelerating memory-bound operations in decode computations and thereby compensating for the overhead of runtime pruning and compression. Our custom attention kernel coupled with the bitmap-based format delivers substantial compression of KV cache upto 45% of dense inference and thereby enables longer context length and increased tokens/sec throughput of upto 2.23x compared to dense inference. Our pruning mechanism and sparse attention kernel is available at https://github.com/dhjoo98/mustafar.
中文摘要:非结构化稀疏性通过基于幅度的逐令牌剪枝和定制稀疏注意力内核,可在不损失精度的情况下将LLM的KV缓存压缩高达70%,实现2.23倍吞吐量提升并支持更长上下文。
English Summary: Unstructured sparsity enables up to 70% KV cache compression in LLMs without accuracy loss, using per-token magnitude pruning and a custom sparse attention kernel to achieve 2.23x throughput and support longer contexts.
Authors:Junbo Yin, Chao Zha, Wenjia He, Chencheng Xu, Xin Gao
Abstract:
Existing PLMs generate protein sequences based on a single-condition constraint from a specific modality, struggling to simultaneously satisfy multiple constraints across different modalities. In this work, we introduce CFP-Gen, a novel diffusion language model for Combinatorial Functional Protein GENeration. CFP-Gen facilitates the de novo protein design by integrating multimodal conditions with functional, sequence, and structural constraints. Specifically, an Annotation-Guided Feature Modulation (AGFM) module is introduced to dynamically adjust the protein feature distribution based on composable functional annotations, e.g., GO terms, IPR domains and EC numbers. Meanwhile, the Residue-Controlled Functional Encoding (RCFE) module captures residue-wise interaction to ensure more precise control. Additionally, off-the-shelf 3D structure encoders can be seamlessly integrated to impose geometric constraints. We demonstrate that CFP-Gen enables high-throughput generation of novel proteins with functionality comparable to natural proteins, while achieving a high success rate in designing multifunctional proteins. Code and data available at https://github.com/yinjunbo/cfpgen.
中文: CFP-Gen是一种扩散语言模型,通过整合多模态功能、序列和结构约束,能够生成具有天然蛋白质功能的新型蛋白质,并在多功能设计上实现高成功率。
English: CFP-Gen is a diffusion language model that integrates multimodal functional, sequence, and structural constraints to generate novel proteins with natural-like functionality and a high success rate in multifunctional designs.
Authors:Zhangyi Hu, Jiemin Wu, Hua Xu, Mingqian Liao, Ninghui Feng, Bo Gao, Songning Lai, Yutao Yue
Abstract:
Irregular Multivariate Time Series (IMTS) forecasting is challenging due to the unaligned nature of multi-channel signals and the prevalence of extensive missing data. Existing methods struggle to capture reliable temporal patterns from such data due to significant missing values. While pre-trained foundation models show potential for addressing these challenges, they are typically designed for Regularly Sampled Time Series (RTS). Motivated by the visual Mask AutoEncoder's (MAE) powerful capability for modeling sparse multi-channel information and its success in RTS forecasting, we propose VIMTS, a framework adapting Visual MAE for IMTS forecasting. To mitigate the effect of missing values, VIMTS first processes IMTS along the timeline into feature patches at equal intervals. These patches are then complemented using learned cross-channel dependencies. Then it leverages visual MAE's capability in handling sparse multichannel data for patch reconstruction, followed by a coarse-to-fine technique to generate precise predictions from focused contexts. In addition, we integrate self-supervised learning for improved IMTS modeling by adapting the visual MAE to IMTS data. Extensive experiments demonstrate VIMTS's superior performance and few-shot capability, advancing the application of visual foundation models in more general time series tasks. Our code is available at https://github.com/WHU-HZY/VIMTS.
中文摘要:VIMTS框架通过将不规则多元时间序列处理为等间隔特征补丁,利用跨通道依赖关系和自监督学习,有效解决了多通道信号不对齐和缺失数据带来的预测挑战,显著提升了预测性能。
English Summary: The VIMTS framework adapts the Visual Mask AutoEncoder to effectively forecast Irregular Multivariate Time Series by processing data into patches, leveraging cross-channel dependencies, and employing self-supervised learning to overcome challenges posed by missing values and unaligned signals.
Authors:Siddharth Ancha, Sunshine Jiang, Travis Manderson, Laura Brandt, Yilun Du, Philip R. Osteen, Nicholas Roy
Abstract:
In order to navigate safely and reliably in off-road and unstructured environments, robots must detect anomalies that are out-of-distribution (OOD) with respect to the training data. We present an analysis-by-synthesis approach for pixel-wise anomaly detection without making any assumptions about the nature of OOD data. Given an input image, we use a generative diffusion model to synthesize an edited image that removes anomalies while keeping the remaining image unchanged. Then, we formulate anomaly detection as analyzing which image segments were modified by the diffusion model. We propose a novel inference approach for guided diffusion by analyzing the ideal guidance gradient and deriving a principled approximation that bootstraps the diffusion model to predict guidance gradients. Our editing technique is purely test-time that can be integrated into existing workflows without the need for retraining or fine-tuning. Finally, we use a combination of vision-language foundation models to compare pixels in a learned feature space and detect semantically meaningful edits, enabling accurate anomaly detection for off-road navigation. Project website: https://siddancha.github.io/anomalies-by-diffusion-synthesis/
Authors:Mert Onur Cakiroglu, Idil Bilge Altun, Mehmet Dalkilic, Elham Buxton, Hasan Kurban
Abstract:
Time series forecasting remains a challenging task for foundation models due to temporal heterogeneity, high dimensionality, and the lack of inherent symbolic structure. In this work, we propose DRAGON (Discrete Representation and Augmented Graph encoding Over de BruijN Graphs), a novel encoder that introduces Multivariate de Bruijn Graphs (MdBGs) to bridge the gap between symbolic representations and neural modeling. DRAGON discretizes continuous input sequences and maps them onto a fixed graph structure, enabling dynamic context recovery via graph-based attention. Integrated as an auxiliary module within a dual-branch architecture, DRAGON augments conventional CNN-based encoders with symbolic, structure-aware representations. All code developed for this study is available at: https://github.com/KurbanIntelligenceLab/MultdBG-Time-Series-Library
中文:DRAGON编码器通过多元德布鲁因图将符号表示与神经建模相结合,在双分支架构中离散化序列并利用基于图的注意力增强传统CNN编码器,以解决时间序列预测的挑战。
English: The DRAGON encoder introduces Multivariate de Bruijn Graphs to bridge symbolic representations with neural modeling for time series forecasting, discretizing sequences and enhancing CNN-based encoders with graph-based attention in a dual-branch architecture.
Authors:Marco Colussi, Dragan Ahmetovic, Sergio Mascetti
Abstract:
This paper presents MIAS-SAM, a novel approach for the segmentation of anomalous regions in medical images. MIAS-SAM uses a patch-based memory bank to store relevant image features, which are extracted from normal data using the SAM encoder. At inference time, the embedding patches extracted from the SAM encoder are compared with those in the memory bank to obtain the anomaly map. Finally, MIAS-SAM computes the center of gravity of the anomaly map to prompt the SAM decoder, obtaining an accurate segmentation from the previously extracted features. Differently from prior works, MIAS-SAM does not require to define a threshold value to obtain the segmentation from the anomaly map. Experimental results conducted on three publicly available datasets, each with a different imaging modality (Brain MRI, Liver CT, and Retina OCT) show accurate anomaly segmentation capabilities measured using DICE score. The code is available at: https://github.com/warpcut/MIAS-SAM
中文: MIAS-SAM提出了一种基于图像块记忆库的新方法,利用SAM编码器从正常数据提取特征进行医学图像异常区域分割,无需设定阈值即可在多种成像模态中实现精准分割。
English: MIAS-SAM introduces a patch-based memory bank approach using SAM encoder features from normal data to segment anomalies in medical images without thresholding, achieving high accuracy across multiple imaging modalities.
Authors:Tian Qin, Core Francisco Park, Mujin Kwun, Aaron Walsman, Eran Malach, Nikhil Anand, Hidenori Tanaka, David Alvarez-Melis
Abstract:
Mathematical reasoning tasks have become prominent benchmarks for assessing the reasoning capabilities of LLMs, especially with reinforcement learning (RL) methods such as GRPO showing significant performance gains. However, accuracy metrics alone do not support fine-grained assessment of capabilities and fail to reveal which problem-solving skills have been internalized. To better understand these capabilities, we propose to decompose problem solving into fundamental capabilities: Plan (mapping questions to sequences of steps), Execute (correctly performing solution steps), and Verify (identifying the correctness of a solution). Empirically, we find that GRPO mainly enhances the execution skill-improving execution robustness on problems the model already knows how to solve-a phenomenon we call temperature distillation. More importantly, we show that RL-trained models struggle with fundamentally new problems, hitting a 'coverage wall' due to insufficient planning skills. To explore RL's impact more deeply, we construct a minimal, synthetic solution-tree navigation task as an analogy for mathematical problem-solving. This controlled setup replicates our empirical findings, confirming RL primarily boosts execution robustness. Importantly, in this setting, we identify conditions under which RL can potentially overcome the coverage wall through improved exploration and generalization to new solution paths. Our findings provide insights into the role of RL in enhancing LLM reasoning, expose key limitations, and suggest a path toward overcoming these barriers. Code is available at https://github.com/cfpark00/RL-Wall.
中文摘要:强化学习方法如GRPO主要提升了大型语言模型在数学推理中的执行稳健性,但由于规划能力不足而面临“覆盖墙”的局限,不过控制实验表明通过改进探索机制可能找到突破这一障碍的潜在路径。
English Summary: Reinforcement learning methods like GRPO primarily enhance LLMs' execution robustness in mathematical reasoning but face a 'coverage wall' due to insufficient planning skills, though controlled experiments suggest potential pathways to overcome this limitation through improved exploration.
Authors:Tamas Spisak, Karl Friston
Abstract:
Attractor dynamics are a hallmark of many complex systems, including the brain. Understanding how such self-organizing dynamics emerge from first principles is crucial for advancing our understanding of neuronal computations and the design of artificial intelligence systems. Here we formalize how attractor networks emerge from the free energy principle applied to a universal partitioning of random dynamical systems. Our approach obviates the need for explicitly imposed learning and inference rules and identifies emergent, but efficient and biologically plausible inference and learning dynamics for such self-organizing systems. These result in a collective, multi-level Bayesian active inference process. Attractors on the free energy landscape encode prior beliefs; inference integrates sensory data into posterior beliefs; and learning fine-tunes couplings to minimize long-term surprise. Analytically and via simulations, we establish that the proposed networks favor approximately orthogonalized attractor representations, a consequence of simultaneously optimizing predictive accuracy and model complexity. These attractors efficiently span the input subspace, enhancing generalization and the mutual information between hidden causes and observable effects. Furthermore, while random data presentation leads to symmetric and sparse couplings, sequential data fosters asymmetric couplings and non-equilibrium steady-state dynamics, offering a natural extension to conventional Boltzmann Machines. Our findings offer a unifying theory of self-organizing attractor networks, providing novel insights for AI and neuroscience.
中文摘要:本研究基于自由能原理构建了吸引子网络的形成机制,揭示了通过吸引子编码信念并优化表征的自组织系统,能够执行贝叶斯主动推理并提升泛化能力。
English Summary: This study formalizes how attractor networks emerge from the free energy principle, revealing self-organizing systems that perform Bayesian active inference through attractors encoding beliefs and optimizing representations for enhanced generalization.
Authors:Filippo Rinaldi, Giacomo Capitani, Lorenzo Bonicelli, Donato Crisostomi, Federico Bolelli, Elisa Ficarra, Emanuele RodolÃ, Simone Calderara, Angelo Porrello
Abstract:
Foundation models serve as the backbone for numerous specialized models developed through fine-tuning. However, when the underlying pretrained model is updated or retrained (e.g., on larger and more curated datasets), the fine-tuned model becomes obsolete, losing its utility and requiring retraining. This raises the question: is it possible to transfer fine-tuning to a new release of the model? In this work, we investigate how to transfer fine-tuning to a new checkpoint without having to re-train, in a data-free manner. To do so, we draw principles from model re-basin and provide a recipe based on weight permutations to re-base the modifications made to the original base model, often called task vector. In particular, our approach tailors model re-basin for Transformer models, taking into account the challenges of residual connections and multi-head attention layers. Specifically, we propose a two-level method rooted in spectral theory, initially permuting the attention heads and subsequently adjusting parameters within select pairs of heads. Through extensive experiments on visual and textual tasks, we achieve the seamless transfer of fine-tuned knowledge to new pre-trained backbones without relying on a single training step or datapoint. Code is available at https://github.com/aimagelab/TransFusion.
Chinese: 本研究提出了一种无需数据的迁移方法,通过基于模型重定基原理的权重置换,将微调知识无缝转移到更新的预训练模型上,特别针对Transformer的残差连接和多头注意力层进行了优化。
English: This study introduces a data-free method to transfer fine-tuned knowledge to updated pre-trained models by applying weight permutations based on model re-basin principles, specifically tailored for Transformers to handle residual connections and multi-head attention layers.
Authors:Pawan Neupane, Jian Liu, Jianlin Cheng
Abstract:
Predicting protein complex structures is essential for protein function analysis, protein design, and drug discovery. While AI methods like AlphaFold can predict accurate structural models for many protein complexes, reliably estimating the quality of these predicted models (estimation of model accuracy, or EMA) for model ranking and selection remains a major challenge. A key barrier to developing effective machine learning-based EMA methods is the lack of large, diverse, and well-annotated datasets for training and evaluation. To address this gap, we introduce PSBench, a benchmark suite comprising four large-scale, labeled datasets generated during the 15th and 16th community-wide Critical Assessment of Protein Structure Prediction (CASP15 and CASP16). PSBench includes over one million structural models covering a wide range of protein sequence lengths, complex stoichiometries, functional classes, and modeling difficulties. Each model is annotated with multiple complementary quality scores at the global, local, and interface levels. PSBench also provides multiple evaluation metrics and baseline EMA methods to facilitate rigorous comparisons. To demonstrate PSBench's utility, we trained and evaluated GATE, a graph transformer-based EMA method, on the CASP15 data. GATE was blindly tested in CASP16 (2024), where it ranked among the top-performing EMA methods. These results highlight PSBench as a valuable resource for advancing EMA research in protein complex modeling. PSBench is publicly available at: https://github.com/BioinfoMachineLearning/PSBench.
中文: PSBench通过提供包含超过百万标注结构模型的综合基准,解决了评估蛋白质复合物模型质量的难题,促进了如图变换器方法GATE等机器学习技术的发展,该方法在盲测中表现优异。
English: PSBench addresses the challenge of estimating protein complex model quality by providing a comprehensive benchmark with over one million annotated structural models, facilitating the development of machine learning methods like the graph transformer-based GATE, which demonstrated top performance in blind testing.
Authors:Guoxuan Chen, Lianghao Xia, Chao Huang
Abstract:
Modern recommender systems powered by Graph Neural Networks (GNNs) excel at modeling complex user-item interactions, yet increasingly face scenarios requiring selective forgetting of training data. Beyond user requests to remove specific interactions due to privacy concerns or preference changes, regulatory frameworks mandate recommender systems' ability to eliminate the influence of certain user data from models. This recommendation unlearning challenge presents unique difficulties as removing connections within interaction graphs creates ripple effects throughout the model, potentially impacting recommendations for numerous users. Traditional approaches suffer from significant drawbacks: fragmentation methods damage graph structure and diminish performance, while influence function techniques make assumptions that may not hold in complex GNNs, particularly with self-supervised or random architectures. To address these limitations, we propose a novel model-agnostic pre-training paradigm UnlearnRec that prepares systems for efficient unlearning operations. Our Influence Encoder takes unlearning requests together with existing model parameters and directly produces updated parameters of unlearned model with little fine-tuning, avoiding complete retraining while preserving model performance characteristics. Extensive evaluation on public benchmarks demonstrates that our method delivers exceptional unlearning effectiveness while providing more than 10x speedup compared to retraining approaches. We release our method implementation at: https://github.com/HKUDS/UnlearnRec.
Chinese: 提出的UnlearnRec框架通过影响编码器直接更新模型参数,实现了推荐系统的高效数据遗忘,相比重新训练方法在保持性能的同时获得了超过10倍的速度提升。
English: The proposed UnlearnRec framework enables efficient data removal in recommender systems by using an Influence Encoder to directly update model parameters, achieving over 10x faster unlearning while maintaining performance compared to retraining.
Authors:Yu Zhang, Yuqi Xie, Huihan Liu, Rutav Shah, Michael Wan, Linxi Fan, Yuke Zhu
Abstract:
Imitation learning advances robot capabilities by enabling the acquisition of diverse behaviors from human demonstrations. However, large-scale datasets used for policy training often introduce substantial variability in quality, which can negatively impact performance. As a result, automatically curating datasets by filtering low-quality samples to improve quality becomes essential. Existing robotic curation approaches rely on costly manual annotations and perform curation at a coarse granularity, such as the dataset or trajectory level, failing to account for the quality of individual state-action pairs. To address this, we introduce SCIZOR, a self-supervised data curation framework that filters out low-quality state-action pairs to improve the performance of imitation learning policies. SCIZOR targets two complementary sources of low-quality data: suboptimal data, which hinders learning with undesirable actions, and redundant data, which dilutes training with repetitive patterns. SCIZOR leverages a self-supervised task progress predictor for suboptimal data to remove samples lacking task progression, and a deduplication module operating on joint state-action representation for samples with redundant patterns. Empirically, we show that SCIZOR enables imitation learning policies to achieve higher performance with less data, yielding an average improvement of 15.4% across multiple benchmarks. More information is available at: https://ut-austin-rpl.github.io/SCIZOR/
中文:SCIZOR是一种自监督数据筛选框架,通过剔除次优和冗余的状态-动作对来优化模仿学习数据集,使策略性能平均提升15.4%且所需数据更少。
English: SCIZOR is a self-supervised framework that curates imitation learning datasets by filtering out both suboptimal and redundant state-action pairs, enhancing policy performance by 15.4% on average with less data.
Authors:Wenjie Sun, Bingzhe Wu, Zhile Yang, Chengke Wu
Abstract:
Sparse Autoencoders (SAEs) have emerged as a predominant tool in mechanistic interpretability, aiming to identify interpretable monosemantic features. However, how does sparse encoding organize the representations of activation vector from language models? What is the relationship between this organizational paradigm and feature disentanglement as well as reconstruction performance? To address these questions, we propose the SAEMA, which validates the stratified structure of the representation by observing the variability of the rank of the symmetric semipositive definite (SSPD) matrix corresponding to the modal tensor unfolded along the latent tensor with the level of noise added to the residual stream. To systematically investigate how sparse encoding alters representational structures, we define local and global representations, demonstrating that they amplify inter-feature distinctions by merging similar semantic features and introducing additional dimensionality. Furthermore, we intervene the global representation from an optimization perspective, proving a significant causal relationship between their separability and the reconstruction performance. This study explains the principles of sparsity from the perspective of representational geometry and demonstrates the impact of changes in representational structure on reconstruction performance. Particularly emphasizes the necessity of understanding representations and incorporating representational constraints, providing empirical references for developing new interpretable tools and improving SAEs. The code is available at \hyperlink{https://github.com/wenjie1835/SAERepGeo}{https://github.com/wenjie1835/SAERepGeo}.
中文摘要:本研究提出SAEMA分析稀疏自编码器如何组织语言模型表征,发现稀疏编码通过几何重构增强特征区分度,并直接影响重建性能。
English Summary: This study introduces SAEMA to analyze how sparse autoencoders organize language model representations, revealing that sparse encoding enhances feature distinctiveness through geometric restructuring and directly impacts reconstruction performance.
Authors:Abhra Chaudhuri, Anjan Dutta, Tu Bui, Serban Georgescu
Abstract:
We aim to develop a fundamental understanding of modality collapse, a recently observed empirical phenomenon wherein models trained for multimodal fusion tend to rely only on a subset of the modalities, ignoring the rest. We show that modality collapse happens when noisy features from one modality are entangled, via a shared set of neurons in the fusion head, with predictive features from another, effectively masking out positive contributions from the predictive features of the former modality and leading to its collapse. We further prove that cross-modal knowledge distillation implicitly disentangles such representations by freeing up rank bottlenecks in the student encoder, denoising the fusion-head outputs without negatively impacting the predictive features from either modality. Based on the above findings, we propose an algorithm that prevents modality collapse through explicit basis reallocation, with applications in dealing with missing modalities. Extensive experiments on multiple multimodal benchmarks validate our theoretical claims. Project page: https://abhrac.github.io/mmcollapse/.
Authors:Lai Wei, Yuting Li, Chen Wang, Yue Wang, Linghe Kong, Weiran Huang, Lichao Sun
Abstract:
Improving Multi-modal Large Language Models (MLLMs) in the post-training stage typically relies on supervised fine-tuning (SFT) or reinforcement learning (RL). However, these supervised methods require expensive and manually annotated multi-modal data--an ultimately unsustainable resource. While recent efforts have explored unsupervised post-training, their methods are complex and difficult to iterate. In this work, we are the first to investigate the use of GRPO, a stable and scalable online RL algorithm, for enabling continual self-improvement without any external supervision. We propose MM-UPT, a simple yet effective framework for unsupervised post-training of MLLMs. MM-UPT builds upon GRPO, replacing traditional reward signals with a self-rewarding mechanism based on majority voting over multiple sampled responses. Our experiments demonstrate that MM-UPT significantly improves the reasoning ability of Qwen2.5-VL-7B (e.g., 66.3 %$\rightarrow$72.9 % on MathVista, 62.9 %$\rightarrow$68.7 % on We-Math), using standard dataset without ground truth labels. MM-UPT also outperforms prior unsupervised baselines and even approaches the results of supervised GRPO. Furthermore, we show that incorporating synthetic questions, generated solely by MLLM itself, can boost performance as well, highlighting a promising approach for scalable self-improvement. Overall, MM-UPT offers a new paradigm for continual, autonomous enhancement of MLLMs in the absence of external supervision. Our code is available at https://github.com/waltonfuture/MM-UPT.
中文: MM-UPT提出了一种基于GRPO和无监督自奖励机制的后训练框架,无需外部监督即可显著提升多模态大语言模型的推理能力,其效果甚至接近有监督方法。
English: MM-UPT introduces an unsupervised post-training framework using GRPO with a self-rewarding mechanism, significantly enhancing MLLMs' reasoning without external supervision and even approaching supervised method performance.
Authors:Mattie Fellows, Clarisse Wibault, Uljad Berdica, Johannes Forkel, Michael A. Osborne, Jakob N. Foerster
Abstract:
Sample efficiency remains a major obstacle for real world adoption of reinforcement learning (RL): success has been limited to settings where simulators provide access to essentially unlimited environment interactions, which in reality are typically costly or dangerous to obtain. Offline RL in principle offers a solution by exploiting offline data to learn a near-optimal policy before deployment. In practice, however, current offline RL methods rely on extensive online interactions for hyperparameter tuning, and have no reliable bound on their initial online performance. To address these two issues, we introduce two algorithms. Firstly, SOReL: an algorithm for safe offline reinforcement learning. Using only offline data, our Bayesian approach infers a posterior over environment dynamics to obtain a reliable estimate of the online performance via the posterior predictive uncertainty. Crucially, all hyperparameters are also tuned fully offline. Secondly, we introduce TOReL: a tuning for offline reinforcement learning algorithm that extends our information rate based offline hyperparameter tuning methods to general offline RL approaches. Our empirical evaluation confirms SOReL's ability to accurately estimate regret in the Bayesian setting whilst TOReL's offline hyperparameter tuning achieves competitive performance with the best online hyperparameter tuning methods using only offline data. Thus, SOReL and TOReL make a significant step towards safe and reliable offline RL, unlocking the potential for RL in the real world. Our implementations are publicly available: https://github.com/CWibault/sorel\_torel.
Chinese: 本文提出了SOReL和TOReL两种算法,通过实现安全的性能评估和完全离线的超参数调优,解决了离线强化学习中的关键限制,从而提升了强化学习在现实世界应用中的可行性。
English: The paper introduces two algorithms, SOReL and TOReL, which address key limitations in offline reinforcement learning by enabling safe performance estimation and fully offline hyperparameter tuning, thereby enhancing the practicality of RL for real-world applications.
Authors:Václav VoráÄek, Francesco Orabona
Abstract:
The construction of confidence intervals for the mean of a bounded random variable is a classical problem in statistics with numerous applications in machine learning and virtually all scientific fields. In particular, obtaining the tightest possible confidence intervals is vital every time the sampling of the random variables is expensive. The current state-of-the-art method to construct confidence intervals is by using betting algorithms. This is a very successful approach for deriving optimal confidence sequences, even matching the rate of law of iterated logarithms. However, in the fixed horizon setting, these approaches are either sub-optimal or based on heuristic solutions with strong empirical performance but without a finite-time guarantee. Hence, no betting-based algorithm guaranteeing the optimal $\mathcal{O}(\sqrt{\frac{Ï^2\log\frac1δ}{n}})$ width of the confidence intervals are known. This work bridges this gap. We propose a betting-based algorithm to compute confidence intervals that empirically outperforms the competitors. Our betting strategy uses the optimal strategy in every step (in a certain sense), whereas the standard betting methods choose a constant strategy in advance. Leveraging this fact results in strict improvements even for classical concentration inequalities, such as the ones of Hoeffding or Bernstein. Moreover, we also prove that the width of our confidence intervals is optimal up to an $1+o(1)$ factor diminishing with $n$. The code is available on~https://github.com/vvoracek/STaR-bets-confidence-interval.
Chinese: 本研究提出了一种基于博弈策略的新算法,用于构建有界随机变量的置信区间,在保证理论最优性的同时,其实际表现也优于现有方法。
English: This study introduces a novel betting-based algorithm that constructs optimal confidence intervals for bounded random variables, achieving theoretically guaranteed tightness and empirical superiority over existing methods.
Authors:Yao Huang, Huanran Chen, Shouwei Ruan, Yichi Zhang, Xingxing Wei, Yinpeng Dong
Abstract:
Recent advances in Large Reasoning Models (LRMs) have demonstrated remarkable capabilities in solving complex tasks such as mathematics and coding. However, these models frequently exhibit a phenomenon known as overthinking during inference, characterized by excessive validation loops and redundant deliberation, leading to substantial computational overheads. In this paper, we aim to mitigate overthinking by investigating the underlying mechanisms from the perspective of mechanistic interpretability. We first showcase that the tendency of overthinking can be effectively captured by a single direction in the model's activation space and the issue can be eased by intervening the activations along this direction. However, this efficacy soon reaches a plateau and even deteriorates as the intervention strength increases. We therefore systematically explore the activation space and find that the overthinking phenomenon is actually tied to a low-dimensional manifold, which indicates that the limited effect stems from the noises introduced by the high-dimensional steering direction. Based on this insight, we propose Manifold Steering, a novel approach that elegantly projects the steering direction onto the low-dimensional activation manifold given the theoretical approximation of the interference noise. Extensive experiments on DeepSeek-R1 distilled models validate that our method reduces output tokens by up to 71% while maintaining and even improving the accuracy on several mathematical benchmarks. Our method also exhibits robust cross-domain transferability, delivering consistent token reduction performance in code generation and knowledge-based QA tasks. Code is available at: https://github.com/Aries-iai/Manifold_Steering.
中文: 本文提出流形导向方法,通过将激活干预投影到低维流形上减少大型推理模型的计算开销,在数学和编程任务中最多减少71%的生成标记数,同时保持甚至提高准确性。
English: This paper introduces Manifold Steering, a method that reduces computational overhead in Large Reasoning Models by projecting activation interventions onto a low-dimensional manifold, achieving up to 71% fewer tokens while maintaining or improving accuracy across mathematical and coding tasks.
Authors:Yongkang Liu, Xingle Xu, Ercong Nie, Zijing Wang, Shi Feng, Daling Wang, Qian Li, Hinrich Schütze
Abstract:
Parameter-Efficient Fine-Tuning (PEFT) methods achieve performance comparable to Full Fine-Tuning (FFT) while requiring significantly fewer computing resources, making it the go-to choice for researchers. We find that although PEFT can achieve competitive results on some benchmarks, its performance falls short of FFT in complex tasks, such as reasoning and instruction-based fine-tuning. In this paper, we compare the characteristics of PEFT and FFT in terms of representational capacity and robustness based on optimization theory. We theoretically demonstrate that PEFT is a strict subset of FFT. By providing theoretical upper bounds for PEFT, we show that the limited parameter space constrains the model's representational ability, making it more susceptible to perturbations. Experiments on 15 datasets encompassing classification, generation, reasoning, instruction fine-tuning tasks and 11 adversarial test sets validate our theories. We hope that these results spark further research beyond the realms of well established PEFT. The source code is in the anonymous Github repository\footnote{https://github.com/misonsky/PEFTEval}.
Chinese: 参数高效微调方法虽然在计算上更经济,但由于参数空间受限导致表示能力和鲁棒性不足,在复杂任务中表现不及全参数微调,理论和实验均证实了这一点。
English: Parameter-Efficient Fine-Tuning (PEFT) methods, while computationally economical, fall short of Full Fine-Tuning (FFT) in complex tasks due to limited representational capacity and robustness, as proven theoretically and validated across diverse datasets.
Authors:Shriram M S, Xinyue Hao, Shihao Hou, Yang Lu, Laura Sevilla-Lara, Anurag Arnab, Shreyank N Gowda
Abstract:
The success of the machine learning field has reliably depended on training on large datasets. While effective, this trend comes at an extraordinary cost. This is due to two deeply intertwined factors: the size of models and the size of datasets. While promising research efforts focus on reducing the size of models, the other half of the equation remains fairly mysterious. Indeed, it is surprising that the standard approach to training remains to iterate over and over, uniformly sampling the training dataset. In this paper we explore a series of alternative training paradigms that leverage insights from hard-data-mining and dropout, simple enough to implement and use that can become the new training standard. The proposed Progressive Data Dropout reduces the number of effective epochs to as little as 12.4% of the baseline. This savings actually do not come at any cost for accuracy. Surprisingly, the proposed method improves accuracy by up to 4.82%. Our approach requires no changes to model architecture or optimizer, and can be applied across standard training pipelines, thus posing an excellent opportunity for wide adoption. Code can be found here: https://github.com/bazyagami/LearningWithRevision
Chinese: 本文提出的渐进式数据丢弃训练方法,在无需改变模型结构或优化器的前提下,显著减少了所需训练轮次,同时保持甚至提升了模型精度。
English: The paper introduces Progressive Data Dropout, a training method that significantly reduces the number of epochs needed while maintaining or even improving accuracy, without requiring changes to model architecture or optimizer.
Authors:Lai Wei, Yuting Li, Kaipeng Zheng, Chen Wang, Yue Wang, Linghe Kong, Lichao Sun, Weiran Huang
Abstract:
Recent advancements in large language models (LLMs) have demonstrated impressive chain-of-thought reasoning capabilities, with reinforcement learning (RL) playing a crucial role in this progress. While "aha moment" patterns--where models exhibit self-correction through reflection--are often attributed to emergent properties from RL, we first demonstrate that these patterns exist in multimodal LLMs (MLLMs) prior to RL training but may not necessarily correlate with improved reasoning performance. Building on these insights, we present a comprehensive study on enhancing multimodal reasoning through a two-stage approach: (1) supervised fine-tuning (SFT) as a cold start with structured chain-of-thought reasoning patterns, followed by (2) reinforcement learning via GRPO to further refine these capabilities. Our extensive experiments show that this combined approach consistently outperforms both SFT-only and RL-only methods across challenging multimodal reasoning benchmarks. The resulting models achieve state-of-the-art performance among open-source MLLMs at both 3B and 7B scales, with our 7B model showing substantial improvements over base models (e.g., 66.3 %$\rightarrow$73.4 % on MathVista, 62.9 %$\rightarrow$70.4 % on We-Math) and our 3B model achieving performance competitive with several 7B models. Overall, this work provides practical guidance for building advanced multimodal reasoning models. Our code is available at https://github.com/waltonfuture/RL-with-Cold-Start.
中文: 本研究表明多模态大语言模型中的自我修正模式在强化学习训练前就已存在,并提出结合监督微调与强化学习的两阶段方法,在推理基准测试中实现了最优性能。
English: This study demonstrates that self-correction patterns in multimodal LLMs exist before RL training and proposes a two-stage approach combining supervised fine-tuning and reinforcement learning, achieving state-of-the-art performance on reasoning benchmarks.
Authors:Haosheng Zou, Xiaowei Lv, Shousheng Jia, Xiangzheng Zhang
Abstract:
Adding sequence parallelism into LLaMA-Factory, we open-sourced 360-LLaMA-Factory at https://github.com/Qihoo360/360-LLaMA-Factory. 360-LLaMA-Factory has received wide recognition and used in models such as Light-R1 arXiv:2503.10460, TinyR1 arXiv:2503.04872, Kaggle AIMO math models and also in large companies' training frameworks. This technical report delves deeper into the different sequence parallel modes behind 360-LLaMA-Factory and discusses our implementation insights.
Chinese: 我们开源了集成序列并行技术的360-LLA MA-Factory,该框架已在多个模型及企业训练系统中获得广泛应用。
English: We have open-sourced 360-LLaMA-Factory with sequence parallelism, which has been widely adopted in various models and corporate training frameworks.
Authors:Yudi Zhang, Weilin Zhao, Xu Han, Tiejun Zhao, Wang Xu, Hailong Cao, Conghui Zhu
Abstract:
Speculative decoding and quantization effectively accelerate memory-bound inference of large language models. Speculative decoding mitigates the memory bandwidth bottleneck by verifying multiple tokens within a single forward pass, which increases computational effort. Quantization achieves this optimization by compressing weights and activations into lower bit-widths and also reduces computations via low-bit matrix multiplications. To further leverage their strengths, we investigate the integration of these two techniques. Surprisingly, experiments applying the advanced speculative decoding method EAGLE-2 to various quantized models reveal that the memory benefits from 4-bit weight quantization are diminished by the computational load from speculative decoding. Specifically, verifying a tree-style draft incurs significantly more time overhead than a single-token forward pass on 4-bit weight quantized models. This finding led to our new speculative decoding design: a hierarchical framework that employs a small model as an intermediate stage to turn tree-style drafts into sequence drafts, leveraging the memory access benefits of the target quantized model. Experimental results show that our hierarchical approach achieves a 2.78$\times$ speedup across various tasks for the 4-bit weight Llama-3-70B model on an A100 GPU, outperforming EAGLE-2 by 1.31$\times$. Code available at https://github.com/AI9Stars/SpecMQuant.
中文总结:推测解码与量化相结合可加速大语言模型推理,但实验发现4位量化的内存优势会被推测解码的计算负载抵消,因此提出分层框架,通过小模型将树状草案转为序列草案,显著提升量化模型性能。
English Summary: Speculative decoding and quantization are combined to accelerate large language model inference, but their integration reveals that 4-bit quantization's memory benefits are offset by computational overhead, leading to a new hierarchical framework that achieves significant speedup improvements.
Authors:Shuhai Zhang, Zeng You, Yaofo Chen, Zhiquan Wen, Qianyue Wang, Zhijie Qiu, Yuanqing Li, Mingkui Tan
Abstract:
Transformer-based large language models (LLMs) excel in natural language processing tasks by capturing long-range dependencies through self-attention mechanisms. However, long-context modeling faces significant computational inefficiencies due to \textit{redundant} attention computations: while attention weights are often \textit{sparse}, all tokens consume \textit{equal} computational resources. In this paper, we reformulate traditional probabilistic sequence modeling as a \textit{supervised learning task}, enabling the separation of relevant and irrelevant tokens and providing a clearer understanding of redundancy. Based on this reformulation, we theoretically analyze attention sparsity, revealing that only a few tokens significantly contribute to predictions. Building on this, we formulate attention optimization as a linear coding problem and propose a \textit{group coding strategy}, theoretically showing its ability to improve robustness against random noise and enhance learning efficiency. Motivated by this, we propose \textit{Dynamic Group Attention} (DGA), which leverages the group coding to explicitly reduce redundancy by aggregating less important tokens during attention computation. Empirical results show that our DGA significantly reduces computational costs while maintaining competitive performance.Code is available at https://github.com/bolixinyu/DynamicGroupAttention.
中文: Transformer架构的大型语言模型因冗余注意力计算存在效率问题,本文提出的动态分组注意力(DGA)方法通过聚合次要标记显著降低计算成本,同时保持模型性能竞争力。
English: Transformer-based LLMs face computational inefficiency from redundant attention computations, which the proposed Dynamic Group Attention (DGA) method addresses by aggregating less important tokens to reduce costs while maintaining performance.
Authors:Shun Sato, Issei Sato
Abstract:
Symbolic regression aims to discover mathematical equations that fit given numerical data. It has been applied in various fields of scientific research, such as producing human-readable expressions that explain physical phenomena. Recently, Neural symbolic regression (NSR) methods that involve Transformers pre-trained on large-scale synthetic datasets have gained attention. While these methods offer advantages such as short inference time, they suffer from low performance, particularly when the number of input variables is large. In this study, we hypothesized that this limitation stems from the memorization bias of Transformers in symbolic regression. We conducted a quantitative evaluation of this bias in Transformers using a synthetic dataset and found that Transformers rarely generate expressions not present in the training data. Additional theoretical analysis reveals that this bias arises from the Transformer's inability to construct expressions compositionally while verifying their numerical validity. We finally examined if tailoring test-time strategies can lead to reduced memorization bias and better performance. We empirically demonstrate that providing additional information to the model at test time can significantly mitigate memorization bias. On the other hand, we also find that reducing memorization bias does not necessarily correlate with improved performance. These findings contribute to a deeper understanding of the limitations of NSR approaches and offer a foundation for designing more robust, generalizable symbolic regression methods. Code is available at https://github.com/Shun-0922/Mem-Bias-NSR .
Chinese: 本研究识别并量化了神经符号回归中Transformer的记忆偏差,发现其难以生成训练数据外的新表达式,且测试时提供额外信息虽可减轻偏差,但未必提升性能。
English: This study identifies and quantifies the memorization bias in Transformers used for neural symbolic regression, revealing their limited ability to generate novel expressions and showing that while test-time information can reduce bias, it doesn't always improve performance.
Authors:Zi-Hao Zhou, Jun-Jie Wang, Tong Wei, Min-Ling Zhang
Abstract:
Contrastive learning has achieved remarkable success in learning effective representations, with supervised contrastive learning often outperforming self-supervised approaches. However, in real-world scenarios, data annotations are often ambiguous or inaccurate, meaning that class labels may not reliably indicate whether two examples belong to the same class. This limitation restricts the applicability of supervised contrastive learning. To address this challenge, we introduce the concept of ``continuous semantic similarity'' to define positive and negative pairs. Instead of directly relying on imprecise class labels, we measure the semantic similarity between example pairs, which quantifies how closely they belong to the same category by iteratively refining weak supervisory signals. Based on this concept, we propose a graph-theoretic framework for weakly-supervised contrastive learning, where semantic similarity serves as the graph weights. Our framework is highly versatile and can be applied to many weakly-supervised learning scenarios. We demonstrate its effectiveness through experiments in two common settings, i.e., noisy label and partial label learning, where existing methods can be easily integrated to significantly improve performance. Theoretically, we establish an error bound for our approach, showing that it can approximate supervised contrastive learning under mild conditions. The implementation code is available at https://github.com/Speechless-10308/WSC.
中文: 本文提出了一种基于图的弱监督对比学习框架,通过连续语义相似性替代不可靠的类别标签来定义正负样本对,在噪声标签和部分标签场景中验证了有效性,并提供了理论保证。
English: This paper introduces a graph-based framework for weakly-supervised contrastive learning that uses continuous semantic similarity instead of unreliable class labels to define positive and negative pairs, demonstrating effectiveness in noisy and partial label settings while providing theoretical guarantees.
Authors:Yu-Heng Hung, Kai-Jie Lin, Yu-Heng Lin, Chien-Yi Wang, Cheng Sun, Ping-Chun Hsieh
Abstract:
Bayesian optimization (BO) offers an efficient pipeline for optimizing black-box functions with the help of a Gaussian process prior and an acquisition function (AF). Recently, in the context of single-objective BO, learning-based AFs witnessed promising empirical results given its favorable non-myopic nature. Despite this, the direct extension of these approaches to multi-objective Bayesian optimization (MOBO) suffer from the \textit{hypervolume identifiability issue}, which results from the non-Markovian nature of MOBO problems. To tackle this, inspired by the non-Markovian RL literature and the success of Transformers in language modeling, we present a generalized deep Q-learning framework and propose \textit{BOFormer}, which substantiates this framework for MOBO via sequence modeling. Through extensive evaluation, we demonstrate that BOFormer constantly outperforms the benchmark rule-based and learning-based algorithms in various synthetic MOBO and real-world multi-objective hyperparameter optimization problems. We have made the source code publicly available to encourage further research in this direction.
Authors:Chong Zeng, Yue Dong, Pieter Peers, Hongzhi Wu, Xin Tong
Abstract:
We present RenderFormer, a neural rendering pipeline that directly renders an image from a triangle-based representation of a scene with full global illumination effects and that does not require per-scene training or fine-tuning. Instead of taking a physics-centric approach to rendering, we formulate rendering as a sequence-to-sequence transformation where a sequence of tokens representing triangles with reflectance properties is converted to a sequence of output tokens representing small patches of pixels. RenderFormer follows a two stage pipeline: a view-independent stage that models triangle-to-triangle light transport, and a view-dependent stage that transforms a token representing a bundle of rays to the corresponding pixel values guided by the triangle-sequence from the view-independent stage. Both stages are based on the transformer architecture and are learned with minimal prior constraints. We demonstrate and evaluate RenderFormer on scenes with varying complexity in shape and light transport.
Authors:Ruijie Li, Xiang Zhao, Qiao Ning, Shikai Guo
Abstract:
In tennis tournaments, momentum, a critical yet elusive phenomenon, reflects the dynamic shifts in performance of athletes that can decisively influence match outcomes. Despite its significance, momentum in terms of effective modeling and multi-granularity analysis across points, games, sets, and matches in tennis tournaments remains underexplored. In this study, we define a novel Momentum Score (MS) metric to quantify a player's momentum level in multi-granularity tennis tournaments, and design HydraNet, a momentum-driven state-space duality-based framework, to model MS by integrating thirty-two heterogeneous dimensions of athletes performance in serve, return, psychology and fatigue. HydraNet integrates a Hydra module, which builds upon a state-space duality (SSD) framework, capturing explicit momentum with a sliding-window mechanism and implicit momentum through cross-game state propagation. It also introduces a novel Versus Learning method to better enhance the adversarial nature of momentum between the two athletes at a macro level, along with a Collaborative-Adversarial Attention Mechanism (CAAM) for capturing and integrating intra-player and inter-player dynamic momentum at a micro level. Additionally, we construct a million-level tennis cross-tournament dataset spanning from 2012-2023 Wimbledon and 2013-2023 US Open, and validate the multi-granularity modeling capability of HydraNet for the MS metric on this dataset. Extensive experimental evaluations demonstrate that the MS metric constructed by the HydraNet framework provides actionable insights into how momentum impacts outcomes at different granularities, establishing a new foundation for momentum modeling and sports analysis. To the best of our knowledge, this is the first work to explore and effectively model momentum across multiple granularities in professional tennis tournaments.
Chinese: 本研究首次在职业网球赛事中引入动量评分指标和HydraNet框架,通过多粒度分析和对抗学习机制,有效量化并建模了运动员在比赛不同层级的动量变化,为体育分析建立了新基础。
English: This study introduces a novel Momentum Score (MS) metric and the HydraNet framework to quantify and model momentum across multiple granularities in tennis tournaments, using a comprehensive dataset and advanced mechanisms to capture both explicit and implicit momentum dynamics.
Authors:Chengyue Huang, Brisa Maneechotesuwan, Shivang Chopra, Zsolt Kira
Abstract:
Visual question answering (VQA) systems face significant challenges when adapting to real-world data shifts, especially in multi-modal contexts. While robust fine-tuning strategies are essential for maintaining performance across in-distribution (ID) and out-of-distribution (OOD) scenarios, current evaluation settings are primarily unimodal or particular to some types of OOD, offering limited insight into the complexities of multi-modal contexts. In this work, we propose a new benchmark FRAMES-VQA (Fine-Tuning Robustness across Multi-Modal Shifts in VQA) for evaluating robust fine-tuning for VQA tasks. We utilize ten existing VQA benchmarks, including VQAv2, IV-VQA, VQA-CP, OK-VQA and others, and categorize them into ID, near and far OOD datasets covering uni-modal, multi-modal and adversarial distribution shifts. We first conduct a comprehensive comparison of existing robust fine-tuning methods. We then quantify the distribution shifts by calculating the Mahalanobis distance using uni-modal and multi-modal embeddings extracted from various models. Further, we perform an extensive analysis to explore the interactions between uni- and multi-modal shifts as well as modality importance for ID and OOD samples. These analyses offer valuable guidance on developing more robust fine-tuning methods to handle multi-modal distribution shifts. The code is available at https://github.com/chengyuehuang511/FRAMES-VQA .
中文: 本文提出了FRAMES-VQA新基准,用于评估视觉问答任务中针对多模态分布变化的鲁棒微调方法,通过将数据集分类为分布内和分布外类型并分析模态交互作用,为开发更鲁棒的微调方法提供指导。
English: This paper introduces FRAMES-VQA, a new benchmark for evaluating robust fine-tuning in visual question answering across multi-modal distribution shifts, categorizing datasets into in-distribution and out-of-distribution types while analyzing modality interactions to guide future method development.
Authors:Owen Oertell, Shikun Sun, Yiding Chen, Jin Peng Zhou, Zhiyong Wang, Wen Sun
Abstract:
The controllable generation of diffusion models aims to steer the model to generate samples that optimize some given objective functions. It is desirable for a variety of applications including image generation, molecule generation, and DNA/sequence generation. Reinforcement Learning (RL) based fine-tuning of the base model is a popular approach but it can overfit the reward function while requiring significant resources. We frame controllable generation as a problem of finding a distribution that optimizes a KL-regularized objective function. We present SLCD -- Supervised Learning based Controllable Diffusion, which iteratively generates online data and trains a small classifier to guide the generation of the diffusion model. Similar to the standard classifier-guided diffusion, SLCD's key computation primitive is classification and does not involve any complex concepts from RL or control. Via a reduction to no-regret online learning analysis, we show that under KL divergence, the output from SLCD provably converges to the optimal solution of the KL-regularized objective. Further, we empirically demonstrate that SLCD can generate high quality samples with nearly the same inference time as the base model in both image generation with continuous diffusion and biological sequence generation with discrete diffusion. Our code is available at https://github.com/Owen-Oertell/slcd
中文摘要:SLCD是一种基于监督学习的可控扩散模型方法,通过迭代生成在线数据并训练小型分类器来引导生成过程,在理论上保证收敛的同时保持与基础模型相近的快速推理速度。
English Summary: SLCD is a supervised learning method for controllable diffusion models that uses iterative online data generation and classifier training to guide sampling, achieving provable convergence and maintaining fast inference times comparable to the base model.
Authors:Zhengyuan Jiang, Moyang Guo, Kecen Li, Yuepeng Hu, Yupu Wang, Zhicong Huang, Cheng Hong, Neil Zhenqiang Gong
Abstract:
The rapid development of video generative models has led to a surge in highly realistic synthetic videos, raising ethical concerns related to disinformation and copyright infringement. Recently, video watermarking has been proposed as a mitigation strategy by embedding invisible marks into AI-generated videos to enable subsequent detection. However, the robustness of existing video watermarking methods against both common and adversarial perturbations remains underexplored. In this work, we introduce VideoMarkBench, the first systematic benchmark designed to evaluate the robustness of video watermarks under watermark removal and watermark forgery attacks. Our study encompasses a unified dataset generated by three state-of-the-art video generative models, across three video styles, incorporating four watermarking methods and seven aggregation strategies used during detection. We comprehensively evaluate 12 types of perturbations under white-box, black-box, and no-box threat models. Our findings reveal significant vulnerabilities in current watermarking approaches and highlight the urgent need for more robust solutions. Our code is available at https://github.com/zhengyuan-jiang/VideoMarkBench.
Chinese: 本文提出了首个系统性评估视频水印在去除和伪造攻击下鲁棒性的基准VideoMarkBench,揭示了现有方法的显著脆弱性,并强调了开发更强健解决方案的迫切性。
English: This paper introduces VideoMarkBench, the first benchmark to systematically evaluate the vulnerability of video watermarks against removal and forgery attacks, revealing significant weaknesses in current methods and underscoring the need for more robust solutions.
Authors:Tianyu Fu, Yi Ge, Yichen You, Enshu Liu, Zhihang Yuan, Guohao Dai, Shengen Yan, Huazhong Yang, Yu Wang
Abstract:
Large Language Models (LLMs) achieve impressive reasoning capabilities at the cost of substantial inference overhead, posing substantial deployment challenges. Although distilled Small Language Models (SLMs) significantly enhance efficiency, their performance suffers as they fail to follow LLMs' reasoning paths. Luckily, we reveal that only a small fraction of tokens genuinely diverge reasoning paths between LLMs and SLMs. Most generated tokens are either identical or exhibit neutral differences, such as minor variations in abbreviations or expressions. Leveraging this insight, we introduce **Roads to Rome (R2R)**, a neural token routing method that selectively utilizes LLMs only for these critical, path-divergent tokens, while leaving the majority of token generation to the SLM. We also develop an automatic data generation pipeline that identifies divergent tokens and generates token-level routing labels to train the lightweight router. We apply R2R to combine R1-1.5B and R1-32B models from the DeepSeek family, and evaluate on challenging math, coding, and QA benchmarks. With an average activated parameter size of 5.6B, R2R surpasses the average accuracy of R1-7B by 1.6x, outperforming even the R1-14B model. Compared to R1-32B, it delivers a 2.8x wall-clock speedup with comparable performance, advancing the Pareto frontier of test-time scaling efficiency. Our code is available at https://github.com/thu-nics/R2R.
中文: Roads to Rome (R2R)方法通过神经令牌路由技术,仅将关键推理令牌交由大模型处理,其余生成任务由小模型承担,在保持相当性能的同时实现2.8倍加速,显著提升了推理效率边界。
English: The Roads to Rome (R2R) method selectively routes only critical reasoning tokens to large language models while delegating most generation to small models, achieving comparable performance with 2.8x speedup and advancing efficiency frontiers.
Authors:Shreyas Gururaj, Lars Grüne, Wojciech Samek, Sebastian Lapuschkin, Leander Weber
Abstract:
Overfitting is a well-known issue extending even to state-of-the-art (SOTA) Machine Learning (ML) models, resulting in reduced generalization, and a significant train-test performance gap. Mitigation measures include a combination of dropout, data augmentation, weight decay, and other regularization techniques. Among the various data augmentation strategies, occlusion is a prominent technique that typically focuses on randomly masking regions of the input during training. Most of the existing literature emphasizes randomness in selecting and modifying the input features instead of regions that strongly influence model decisions. We propose Relevance-driven Input Dropout (RelDrop), a novel data augmentation method which selectively occludes the most relevant regions of the input, nudging the model to use other important features in the prediction process, thus improving model generalization through informed regularization. We further conduct qualitative and quantitative analyses to study how Relevance-driven Input Dropout (RelDrop) affects model decision-making. Through a series of experiments on benchmark datasets, we demonstrate that our approach improves robustness towards occlusion, results in models utilizing more features within the region of interest, and boosts inference time generalization performance. Our code is available at https://github.com/Shreyas-Gururaj/LRP_Relevance_Dropout.
Chinese: 针对机器学习模型中的过拟合问题,RelDrop作为一种新颖的数据增强方法,通过选择性遮挡输入中最相关的区域,促使模型利用其他重要特征,从而提升泛化能力和鲁棒性。
English: Overfitting in machine learning models can be mitigated by RelDrop, a novel data augmentation technique that selectively occludes the most relevant input regions to encourage the use of other important features, thereby enhancing generalization and robustness.
Authors:Carina Newen, Luca Hinkamp, Maria Ntonti, Emmanuel Müller
Abstract:
From uncertainty quantification to real-world object detection, we recognize the importance of machine learning algorithms, particularly in safety-critical domains such as autonomous driving or medical diagnostics. In machine learning, ambiguous data plays an important role in various machine learning domains. Optical illusions present a compelling area of study in this context, as they offer insight into the limitations of both human and machine perception. Despite this relevance, optical illusion datasets remain scarce. In this work, we introduce a novel dataset of optical illusions featuring intermingled animal pairs designed to evoke perceptual ambiguity. We identify generalizable visual concepts, particularly gaze direction and eye cues, as subtle yet impactful features that significantly influence model accuracy. By confronting models with perceptual ambiguity, our findings underscore the importance of concepts in visual learning and provide a foundation for studying bias and alignment between human and machine vision. To make this dataset useful for general purposes, we generate optical illusions systematically with different concepts discussed in our bias mitigation section. The dataset is accessible in Kaggle via https://kaggle.com/datasets/693bf7c6dd2cb45c8a863f9177350c8f9849a9508e9d50526e2ffcc5559a8333. Our source code can be found at https://github.com/KDD-OpenSource/Ambivision.git.
中文: 本研究引入了一个新颖的视觉错觉数据集,旨在探究机器学习中的感知模糊性,揭示了如注视方向等细微视觉线索对模型准确性和偏差缓解的重要影响。
English: This study introduces a novel dataset of optical illusions to explore perceptual ambiguity in machine learning, highlighting how subtle visual cues like gaze direction impact model accuracy and bias mitigation.
Authors:Jiawei Tang, Yuheng Jia
Abstract:
Label distribution learning (LDL) is an effective method to predict the relative label description degree (a.k.a. label distribution) of a sample. However, the label distribution is not a complete representation of an instance because it overlooks the absolute intensity of each label. Specifically, it's impossible to obtain the total description degree of hidden labels that not in the label space, which leads to the loss of information and confusion in instances. To solve the above problem, we come up with a new concept named background concentration to serve as the absolute description degree term of the label distribution and introduce it into the LDL process, forming the improved paradigm of concentration distribution learning. Moreover, we propose a novel model by probabilistic methods and neural networks to learn label distributions and background concentrations from existing LDL datasets. Extensive experiments prove that the proposed approach is able to extract background concentrations from label distributions while producing more accurate prediction results than the state-of-the-art LDL methods. The code is available in https://github.com/seutjw/CDL-LD.
中文: 标签分布学习通过引入背景浓度的概念,解决了标签绝对强度缺失的问题,提出了一种结合概率方法和神经网络的新模型,有效提升了预测精度。
English: Label distribution learning is enhanced by introducing the concept of background concentration to address the limitation of missing absolute label intensity, leading to a new model that improves prediction accuracy through probabilistic and neural network methods.
Authors:Yao Lu, Tengfei Ma, Zeyu Wang, Zhuangzhi Chen, Dongwei Xu, Yun Lin, Qi Xuan, Guan Gui
Abstract:
With the rapid development of wireless communications and the growing complexity of digital modulation schemes, traditional manual modulation recognition methods struggle to extract reliable signal features and meet real-time requirements in modern scenarios. Recently, deep learning based Automatic Modulation Recognition (AMR) approaches have greatly improved classification accuracy. However, their large model sizes and high computational demands hinder deployment on resource-constrained devices. Model pruning provides a general approach to reduce model complexity, but existing weight, channel, and layer pruning techniques each present a trade-off between compression rate, hardware acceleration, and accuracy preservation. To this end, in this paper, we introduce FCOS, a novel Fine-to-COarse two-Stage pruning framework that combines channel-level pruning with layer-level collapse diagnosis to achieve extreme compression, high performance and efficient inference. In the first stage of FCOS, hierarchical clustering and parameter fusion are applied to channel weights to achieve channel-level pruning. Then a Layer Collapse Diagnosis (LaCD) module uses linear probing to identify layer collapse and removes the collapsed layers due to high channel compression ratio. Experiments on multiple AMR benchmarks demonstrate that FCOS outperforms existing channel and layer pruning methods. Specifically, FCOS achieves 95.51% FLOPs reduction and 95.31% parameter reduction while still maintaining performance close to the original ResNet56, with only a 0.46% drop in accuracy on Sig2019-12. Code is available at https://github.com/yaolu-zjut/FCOS.
中文:FCOS框架提出了一种两阶段剪枝方法,通过通道级剪枝与层坍塌诊断相结合,在保持高精度的同时实现极致模型压缩,在自动调制识别基准测试中显著优于现有方法。
English: The FCOS framework introduces a two-stage pruning method combining channel-level pruning with layer collapse diagnosis to achieve extreme model compression while maintaining high accuracy, significantly outperforming existing methods on automatic modulation recognition benchmarks.
Authors:Chika Maduabuchi, Hao Chen, Yujin Han, Jindong Wang
Abstract:
Latent Video Diffusion Models (LVDMs) achieve high-quality generation but are sensitive to imperfect conditioning, which causes semantic drift and temporal incoherence on noisy, web-scale video-text datasets. We introduce CAT-LVDM, the first corruption-aware training framework for LVDMs that improves robustness through structured, data-aligned noise injection. Our method includes Batch-Centered Noise Injection (BCNI), which perturbs embeddings along intra-batch semantic directions to preserve temporal consistency. BCNI is especially effective on caption-rich datasets like WebVid-2M, MSR-VTT, and MSVD. We also propose Spectrum-Aware Contextual Noise (SACN), which injects noise along dominant spectral directions to improve low-frequency smoothness, showing strong results on UCF-101. On average, BCNI reduces FVD by 31.9% across WebVid-2M, MSR-VTT, and MSVD, while SACN yields a 12.3% improvement on UCF-101. Ablation studies confirm the benefit of low-rank, data-aligned noise. Our theoretical analysis further explains how such perturbations tighten entropy, Wasserstein, score-drift, mixing-time, and generalization bounds. CAT-LVDM establishes a principled, scalable training approach for robust video diffusion under multimodal noise. Code and models: https://github.com/chikap421/catlvdm
中文: CAT-LVDM提出首个针对隐视频扩散模型的抗干扰训练框架,通过结构化的数据对齐噪声注入增强模型鲁棒性,在多个数据集上有效提升时间一致性并减少语义偏移。
English: CAT-LVDM introduces a corruption-aware training framework that enhances the robustness of Latent Video Diffusion Models against imperfect conditioning through structured noise injection, significantly improving temporal consistency and reducing semantic drift across multiple datasets.
Authors:Thalles Silva, Helio Pedrini, AdÃn RamÃrez Rivera
Abstract:
We present Self-Organizing Visual Prototypes (SOP), a new training technique for unsupervised visual feature learning. Unlike existing prototypical self-supervised learning (SSL) methods that rely on a single prototype to encode all relevant features of a hidden cluster in the data, we propose the SOP strategy. In this strategy, a prototype is represented by many semantically similar representations, or support embeddings (SEs), each containing a complementary set of features that together better characterize their region in space and maximize training performance. We reaffirm the feasibility of non-parametric SSL by introducing novel non-parametric adaptations of two loss functions that implement the SOP strategy. Notably, we introduce the SOP Masked Image Modeling (SOP-MIM) task, where masked representations are reconstructed from the perspective of multiple non-parametric local SEs. We comprehensively evaluate the representations learned using the SOP strategy on a range of benchmarks, including retrieval, linear evaluation, fine-tuning, and object detection. Our pre-trained encoders achieve state-of-the-art performance on many retrieval benchmarks and demonstrate increasing performance gains with more complex encoders.
中文: 本文提出自组织视觉原型(SOP)方法,通过多个互补的支持嵌入替代单一原型来更好表征数据簇,在多项基准测试中实现了最先进的性能。
English: This paper introduces Self-Organizing Visual Prototypes (SOP), an unsupervised visual feature learning method that uses multiple complementary support embeddings instead of single prototypes to better characterize data clusters and achieve state-of-the-art performance on various benchmarks.
Authors:Mokai Pan, Kaizhen Zhu, Yuexin Ma, Yanwei Fu, Jingyi Yu, Jingya Wang, Ye Shi
Abstract:
Diffusion Bridges enable transitions between arbitrary distributions, with the Unified Diffusion Bridge (UniDB) framework achieving high-fidelity image generation via a Stochastic Optimal Control (SOC) formulation. However, UniDB's reliance on iterative Euler sampling methods results in slow, computationally expensive inference, while existing acceleration techniques for diffusion or diffusion bridge models fail to address its unique challenges: missing terminal mean constraints and SOC-specific penalty coefficients in its SDEs. We present UniDB++, a training-free sampling algorithm that significantly improves upon these limitations. The method's key advancement comes from deriving exact closed-form solutions for UniDB's reverse-time SDEs, effectively reducing the error accumulation inherent in Euler approximations and enabling high-quality generation with up to 20$\times$ fewer sampling steps. This method is further complemented by replacing conventional noise prediction with a more stable data prediction model, along with an SDE-Corrector mechanism that maintains perceptual quality for low-step regimes (5-10 steps). Additionally, we demonstrate that UniDB++ aligns with existing diffusion bridge acceleration methods by evaluating their update rules, and UniDB++ can recover DBIMs as special cases under some theoretical conditions. Experiments demonstrate UniDB++'s state-of-the-art performance in image restoration tasks, outperforming Euler-based methods in fidelity and speed while reducing inference time significantly. This work bridges the gap between theoretical generality and practical efficiency in SOC-driven diffusion bridge models. Our code is available at https://github.com/2769433owo/UniDB-plusplus.
中文摘要:UniDB++ 作为一种免训练采样算法,通过推导 UniDB 随机微分方程的精确闭式解,解决了原有模型推理缓慢的问题,在保持感知质量的同时实现了20倍加速的高质量图像生成,并在图像修复任务中达到领先性能。
English Summary: UniDB++ is a training-free sampling algorithm that overcomes UniDB's slow inference by deriving exact closed-form solutions for its SDEs, enabling 20× faster high-quality image generation while maintaining perceptual quality through data prediction and SDE-corrector mechanisms.
Authors:Amitai Yacobi, Nir Ben-Ari, Ronen Talmon, Uri Shaham
Abstract:
Learning shared representations is a primary area of multimodal representation learning. The current approaches to achieve a shared embedding space rely heavily on paired samples from each modality, which are significantly harder to obtain than unpaired ones. In this work, we demonstrate that shared representations can be learned almost exclusively from unpaired data. Our arguments are grounded in the spectral embeddings of the random walk matrices constructed independently from each unimodal representation. Empirical results in computer vision and natural language processing domains support its potential, revealing the effectiveness of unpaired data in capturing meaningful cross-modal relations, demonstrating high capabilities in retrieval tasks, generation, arithmetics, zero-shot, and cross-domain classification. This work, to the best of our knowledge, is the first to demonstrate these capabilities almost exclusively from unpaired samples, giving rise to a cross-modal embedding that could be viewed as universal, i.e., independent of the specific modalities of the data. Our code IS publicly available at https://github.com/shaham-lab/SUE.
中文: 本研究证明,通过单模态表示的光谱嵌入,可以主要从未配对数据中有效学习共享的多模态表示,在各种跨模态任务中表现出色,并建立了一种可能通用的嵌入方法。
English: This research demonstrates that shared multimodal representations can be effectively learned primarily from unpaired data using spectral embeddings from unimodal representations, achieving strong performance across various cross-modal tasks and establishing a potentially universal embedding method.
Authors:Han Xiao, Guozhi Wang, Yuxiang Chai, Zimu Lu, Weifeng Lin, Hao He, Lue Fan, Liuyang Bian, Rui Hu, Liang Liu, Shuai Ren, Yafei Wen, Xiaoxin Chen, Aojun Zhou, Hongsheng Li
Abstract:
In this paper, we introduce UI-Genie, a self-improving framework addressing two key challenges in GUI agents: verification of trajectory outcome is challenging and high-quality training data are not scalable. These challenges are addressed by a reward model and a self-improving pipeline, respectively. The reward model, UI-Genie-RM, features an image-text interleaved architecture that efficiently pro- cesses historical context and unifies action-level and task-level rewards. To sup- port the training of UI-Genie-RM, we develop deliberately-designed data genera- tion strategies including rule-based verification, controlled trajectory corruption, and hard negative mining. To address the second challenge, a self-improvement pipeline progressively expands solvable complex GUI tasks by enhancing both the agent and reward models through reward-guided exploration and outcome verification in dynamic environments. For training the model, we generate UI- Genie-RM-517k and UI-Genie-Agent-16k, establishing the first reward-specific dataset for GUI agents while demonstrating high-quality synthetic trajectory gen- eration without manual annotation. Experimental results show that UI-Genie achieves state-of-the-art performance across multiple GUI agent benchmarks with three generations of data-model self-improvement. We open-source our complete framework implementation and generated datasets to facilitate further research in https://github.com/Euphoria16/UI-Genie.
中文: 本文提出UI-Genie自改进框架,通过专用奖励模型和自动化数据生成解决GUI智能体的两大难题,在无需人工标注的情况下实现了最优性能。
English: This paper presents UI-Genie, a self-improving framework that tackles GUI agent challenges through a specialized reward model and automated data generation, achieving state-of-the-art results without manual annotation.
Authors:Xiangxin Zhou, Zichen Liu, Anya Sims, Haonan Wang, Tianyu Pang, Chongxuan Li, Liang Wang, Min Lin, Chao Du
Abstract:
The recent paradigm shift towards training large language models (LLMs) using DeepSeek-R1-Zero-style reinforcement learning (RL) on verifiable rewards has led to impressive advancements in code and mathematical reasoning. However, this methodology is limited to tasks where rule-based answer verification is possible and does not naturally extend to real-world domains such as chemistry, healthcare, engineering, law, biology, business, and economics. Current practical workarounds use an additional LLM as a model-based verifier; however, this introduces issues such as reliance on a strong verifier LLM, susceptibility to reward hacking, and the practical burden of maintaining the verifier model in memory during training. To address this and extend DeepSeek-R1-Zero-style training to general reasoning domains, we propose a verifier-free method (VeriFree) that bypasses answer verification and instead uses RL to directly maximize the probability of generating the reference answer. We compare VeriFree with verifier-based methods and demonstrate that, in addition to its significant practical benefits and reduced compute requirements, VeriFree matches and even surpasses verifier-based methods on extensive evaluations across MMLU-Pro, GPQA, SuperGPQA, and math-related benchmarks. Moreover, we provide insights into this method from multiple perspectives: as an elegant integration of training both the policy and implicit verifier in a unified model, and as a variational optimization approach. Code is available at https://github.com/sail-sg/VeriFree.
Chinese: 提出的VeriFree方法通过直接最大化生成参考答案的概率,绕过了强化学习中的答案验证环节,在保持与验证器方法相当甚至更优性能的同时,显著提升了实用性和计算效率。
English: The proposed VeriFree method eliminates the need for answer verification in reinforcement learning by directly maximizing the probability of generating reference answers, demonstrating comparable or superior performance to verifier-based approaches while offering significant practical and computational advantages.
Authors:Omer Dahary, Yehonathan Cohen, Or Patashnik, Kfir Aberman, Daniel Cohen-Or
Abstract:
Generating multiple distinct subjects remains a challenge for existing text-to-image diffusion models. Complex prompts often lead to subject leakage, causing inaccuracies in quantities, attributes, and visual features. Preventing leakage among subjects necessitates knowledge of each subject's spatial location. Recent methods provide these spatial locations via an external layout control. However, enforcing such a prescribed layout often conflicts with the innate layout dictated by the sampled initial noise, leading to misalignment with the model's prior. In this work, we introduce a new approach that predicts a spatial layout aligned with the prompt, derived from the initial noise, and refines it throughout the denoising process. By relying on this noise-induced layout, we avoid conflicts with externally imposed layouts and better preserve the model's prior. Our method employs a small neural network to predict and refine the evolving noise-induced layout at each denoising step, ensuring clear boundaries between subjects while maintaining consistency. Experimental results show that this noise-aligned strategy achieves improved text-image alignment and more stable multi-subject generation compared to existing layout-guided techniques, while preserving the rich diversity of the model's original distribution.
中文: 本文提出一种方法,通过在去噪过程中预测并优化与初始噪声对齐的空间布局,有效防止文本到图像扩散模型中的主体混淆,提升多主体生成的准确性和稳定性,同时保持模型原有的多样性。
English: This paper introduces a method that predicts and refines a spatial layout aligned with the initial noise during denoising to prevent subject leakage and enhance multi-subject generation in text-to-image diffusion models, improving alignment and stability while preserving diversity.
Authors:Jingyuan Huang, Xi Zhu, Minghao Guo, Yongfeng Zhang
Abstract:
Web 2.0 social platforms are inherently centralized, with user data and algorithmic decisions controlled by the platform. However, users can only passively receive social predictions without being able to choose the underlying algorithm, which limits personalization. Fortunately, with the emergence of blockchain, users are allowed to choose algorithms that are tailored to their local situation, improving prediction results in a personalized way. In a blockchain environment, each user possesses its own model to perform the social prediction, capturing different perspectives on social interactions. In our work, we propose DeSocial, a decentralized social network learning framework deployed on an Ethereum (ETH) local development chain that integrates distributed data storage, node-level consensus, and user-driven model selection through Ganache. In the first stage, each user leverages DeSocial to evaluate multiple backbone models on their local subgraph. DeSocial coordinates the execution and returns model-wise prediction results, enabling the user to select the most suitable backbone for personalized social prediction. Then, DeSocial uniformly selects several validation nodes that possess the algorithm specified by each user, and aggregates the prediction results by majority voting, to prevent errors caused by any single model's misjudgment. Extensive experiments show that DeSocial has an evident improvement compared to the five classical centralized social network learning models, promoting user empowerment in blockchain-based decentralized social networks, showing the importance of multi-node validation and personalized algorithm selection based on blockchain. Our implementation is available at: https://github.com/agiresearch/DeSocial.
中文摘要:DeSocial是以太坊上的去中心化框架,通过本地模型评估和多节点验证让用户选择个性化社交预测算法,其性能优于中心化模型。
English Summary: DeSocial is a decentralized framework on Ethereum that enables users to select personalized algorithms for social predictions through local model evaluation and multi-node validation, outperforming centralized models.
Authors:Qi Yu, Zhichen Zeng, Yuchen Yan, Zhining Liu, Baoyu Jing, Ruizhong Qiu, Ariful Azad, Hanghang Tong
Abstract:
Network alignment (NA) aims to identify node correspondence across different networks and serves as a critical cornerstone behind various downstream multi-network learning tasks. Despite growing research in NA, there lacks a comprehensive library that facilitates the systematic development and benchmarking of NA methods. In this work, we introduce PLANETALIGN, a comprehensive Python library for network alignment that features a rich collection of built-in datasets, methods, and evaluation pipelines with easy-to-use APIs. Specifically, PLANETALIGN integrates 18 datasets and 14 NA methods with extensible APIs for easy use and development of NA methods. Our standardized evaluation pipeline encompasses a wide range of metrics, enabling a systematic assessment of the effectiveness, scalability, and robustness of NA methods. Through extensive comparative studies, we reveal practical insights into the strengths and limitations of existing NA methods. We hope that PLANETALIGN can foster a deeper understanding of the NA problem and facilitate the development and benchmarking of more effective, scalable, and robust methods in the future. The source code of PLANETALIGN is available at https://github.com/yq-leo/PlanetAlign.
网络对齐旨在识别不同网络间的节点对应关系,而PLANETALIGN库作为一个全面的Python工具包,集成了丰富的数据集、方法和评估流程,以促进该领域方法的开发与性能评估。
Network alignment is a fundamental task for identifying node correspondences across networks, and the PLANETALIGN library provides a comprehensive Python toolkit with extensive datasets, methods, and evaluation pipelines to advance its development and benchmarking.
Authors:James Oldfield, Shawn Im, Yixuan Li, Mihalis A. Nicolaou, Ioannis Patras, Grigorios G Chrysos
Abstract:
Multilayer perceptrons (MLPs) are an integral part of large language models, yet their dense representations render them difficult to understand, edit, and steer. Recent methods learn interpretable approximations via neuron-level sparsity, yet fail to faithfully reconstruct the original mapping--significantly increasing model's next-token cross-entropy loss. In this paper, we advocate for moving to layer-level sparsity to overcome the accuracy trade-off in sparse layer approximation. Under this paradigm, we introduce Mixture of Decoders (MxDs). MxDs generalize MLPs and Gated Linear Units, expanding pre-trained dense layers into tens of thousands of specialized sublayers. Through a flexible form of tensor factorization, each sparsely activating MxD sublayer implements a linear transformation with full-rank weights--preserving the original decoders' expressive capacity even under heavy sparsity. Experimentally, we show that MxDs significantly outperform state-of-the-art methods (e.g., Transcoders) on the sparsity-accuracy frontier in language models with up to 3B parameters. Further evaluations on sparse probing and feature steering demonstrate that MxDs learn similarly specialized features of natural language--opening up a promising new avenue for designing interpretable yet faithful decompositions. Our code is included at: https://github.com/james-oldfield/MxD/.
中文: 多层感知机在大型语言模型中难以解释,但提出的解码器混合方法通过层级稀疏性克服了精度权衡,在保持表达能力的同时实现了更优的性能和可解释性。
English: Multilayer perceptrons in large language models are challenging to interpret, but the proposed Mixture of Decoders (MxDs) overcomes accuracy trade-offs through layer-level sparsity, achieving superior performance and interpretability while preserving expressive capacity.
Authors:Wenyuan Li, Shunlin Liang, Keyan Chen, Yongzhe Chen, Han Ma, Jianglei Xu, Yichuan Ma, Shikang Guan, Husheng Fang, Zhenwei Shi
Abstract:
Accurate crop mapping fundamentally relies on modeling multi-scale spatiotemporal patterns, where spatial scales range from individual field textures to landscape-level context, and temporal scales capture both short-term phenological transitions and full growing-season dynamics. Transformer-based remote sensing foundation models (RSFMs) offer promising potential for crop mapping due to their innate ability for unified spatiotemporal processing. However, current RSFMs remain suboptimal for crop mapping: they either employ fixed spatiotemporal windows that ignore the multi-scale nature of crop systems or completely disregard temporal information by focusing solely on spatial patterns. To bridge these gaps, we present AgriFM, a multi-source remote sensing foundation model specifically designed for agricultural crop mapping. Our approach begins by establishing the necessity of simultaneous hierarchical spatiotemporal feature extraction, leading to the development of a modified Video Swin Transformer architecture where temporal down-sampling is synchronized with spatial scaling operations. This modified backbone enables efficient unified processing of long time-series satellite inputs. AgriFM leverages temporally rich data streams from three satellite sources including MODIS, Landsat-8/9 and Sentinel-2, and is pre-trained on a global representative dataset comprising over 25 million image samples supervised by land cover products. The resulting framework incorporates a versatile decoder architecture that dynamically fuses these learned spatiotemporal representations, supporting diverse downstream tasks. Comprehensive evaluations demonstrate AgriFM's superior performance over conventional deep learning approaches and state-of-the-art general-purpose RSFMs across all downstream tasks. Codes will be available at https://github.com/flyakon/AgriFM.
中文: 该摘要介绍了AgriFM,这是一个专为农作物制图设计的模型,通过改进的Video Swin Transformer架构整合多源卫星数据的多尺度时空特征,在各项任务中均优于现有先进方法。
English: The abstract introduces AgriFM, a specialized foundation model that enhances crop mapping by integrating multi-scale spatiotemporal features from satellite data using a modified Video Swin Transformer, achieving superior performance over existing methods.
Authors:Zijing Wang, Xingle Xu, Yongkang Liu, Yiqun Zhang, Peiqin Lin, Shi Feng, Xiaocui Yang, Daling Wang, Hinrich Schütze
Abstract:
Model merging dramatically reduces storage and computational resources by combining multiple expert models into a single multi-task model. Although recent model merging methods have shown promising results, they struggle to maintain performance gains as the number of merged models increases. In this paper, we investigate the key obstacles that limit the scalability of model merging when integrating a large number of expert models. First, we prove that there is an upper bound on model merging. Further theoretical analysis reveals that the limited effective parameter space imposes a strict constraint on the number of models that can be successfully merged. Gaussian Width shows that the marginal benefit of merging additional models diminishes according to a strictly concave function. This implies that the effective parameter space becomes rapidly saturated as the number of merged models increases. Furthermore, using Approximate Kinematics Theory, we prove the existence of a unique optimal threshold beyond which adding more models does not yield significant performance improvements. At the same time, we introduce a straightforward Reparameterized Heavy-Tailed method (RHT) to extend the coverage of the merged model, thereby enhancing its performance. Empirical results on 12 benchmarks, including both knowledge-intensive and general-purpose tasks, validate our theoretical analysis. We believe that these results spark further research beyond the current scope of model merging. The source code is in the Github repository: https://github.com/wzj1718/ModelMergingAnalysis.
中文: 模型合并通过整合专家模型减少资源消耗,但受限于有效参数空间而难以扩展,我们引入新方法扩展覆盖范围并提升性能。
English: Model merging reduces resource usage by combining expert models but faces scalability issues due to limited effective parameter space, which we address with a new method that extends coverage and improves performance.
Authors:M. Mebratu, W. L. K. Wu
Abstract:
Extragalactic foregrounds in cosmic microwave background (CMB) observations are both a source of cosmological and astrophysical information and a nuisance to the CMB. Effective field-level modeling that captures their non-Gaussian statistical distributions is increasingly important for optimal information extraction, particularly given the precise and low-noise observations from current and upcoming experiments. We explore the use of Wavelet Flow (WF) models to tackle the novel task of modeling the field-level probability distributions of multi-component CMB secondaries and foreground. Specifically, we jointly train correlated CMB lensing convergence ($κ$) and cosmic infrared background (CIB) maps with a WF model and obtain a network that statistically recovers the input to high accuracy -- the trained network generates samples of $κ$ and CIB fields whose average power spectra are within a few percent of the inputs across all scales, and whose Minkowski functionals are similarly accurate compared to the inputs. Leveraging the multiscale architecture of these models, we fine-tune both the model parameters and the priors at each scale independently, optimizing performance across different resolutions. These results demonstrate that WF models can accurately simulate correlated components of CMB secondaries, supporting improved analysis of cosmological data. Our code and trained models can be found here (https://github.com/matiwosm/HybridPriorWavletFlow.git).
中文: 小波流模型能有效模拟宇宙微波背景次级相关成分,如透镜收敛和宇宙红外背景,实现精确的场级概率分布建模,从而提升宇宙学数据分析的准确性。
English: Wavelet Flow models effectively simulate correlated cosmic microwave background secondaries like lensing convergence and cosmic infrared background, enabling accurate field-level probability distribution modeling for enhanced cosmological data analysis.
Authors:Jintao Zhang, Xiaoming Xu, Jia Wei, Haofeng Huang, Pengle Zhang, Chendong Xiang, Jun Zhu, Jianfei Chen
Abstract:
The efficiency of attention is critical because its time complexity grows quadratically with sequence length. SageAttention2 addresses this by utilizing quantization to accelerate matrix multiplications (Matmul) in attention. To further accelerate SageAttention2, we propose to utilize the faster instruction of FP8 Matmul accumulated in FP16. The instruction is 2x faster than the FP8 Matmul used in SageAttention2. Our experiments show that SageAttention2++ achieves a 3.9x speedup over FlashAttention while maintaining the same attention accuracy as SageAttention2. This means SageAttention2++ effectively accelerates various models, including those for language, image, and video generation, with negligible end-to-end metrics loss. The code will be available at https://github.com/thu-ml/SageAttention.
中文: SageAttention2++ 通过采用更快的FP8矩阵乘法指令,在保持多种模型精度的同时,将注意力机制速度提升至FlashAttention的3.9倍。
English: SageAttention2++ significantly accelerates attention mechanisms by using faster FP8 matrix multiplication instructions, achieving a 3.9x speedup over FlashAttention while maintaining accuracy across multiple model types.
Authors:Badr Moufad, Yazid Janati, Alain Durmus, Ahmed Ghorbel, Eric Moulines, Jimmy Olsson
Abstract:
Classifier-Free Guidance (CFG) is a widely used technique for improving conditional diffusion models by linearly combining the outputs of conditional and unconditional denoisers. While CFG enhances visual quality and improves alignment with prompts, it often reduces sample diversity, leading to a challenging trade-off between quality and diversity. To address this issue, we make two key contributions. First, CFG generally does not correspond to a well-defined denoising diffusion model (DDM). In particular, contrary to common intuition, CFG does not yield samples from the target distribution associated with the limiting CFG score as the noise level approaches zero -- where the data distribution is tilted by a power $w \gt 1$ of the conditional distribution. We identify the missing component: a Rényi divergence term that acts as a repulsive force and is required to correct CFG and render it consistent with a proper DDM. Our analysis shows that this correction term vanishes in the low-noise limit. Second, motivated by this insight, we propose a Gibbs-like sampling procedure to draw samples from the desired tilted distribution. This method starts with an initial sample from the conditional diffusion model without CFG and iteratively refines it, preserving diversity while progressively enhancing sample quality. We evaluate our approach on both image and text-to-audio generation tasks, demonstrating substantial improvements over CFG across all considered metrics. The code is available at https://github.com/yazidjanati/cfgig
中文: 本文指出无分类器引导(CFG)并不对应一个标准的扩散模型,并提出了一种吉布斯式采样方法进行修正,在条件生成任务中同时提升了样本质量和多样性。
English: This paper identifies that Classifier-Free Guidance (CFG) does not correspond to a proper diffusion model and proposes a Gibbs-like sampling method to correct it, enhancing both quality and diversity in conditional generation tasks.
Authors:Wenhu Li, Niki van Stein, Thomas Bäck, Elena Raponi
Abstract:
Bayesian optimization (BO) is a powerful class of algorithms for optimizing expensive black-box functions, but designing effective BO algorithms remains a manual, expertise-driven task. Recent advancements in Large Language Models (LLMs) have opened new avenues for automating scientific discovery, including the automatic design of optimization algorithms. While prior work has used LLMs within optimization loops or to generate non-BO algorithms, we tackle a new challenge: Using LLMs to automatically generate full BO algorithm code. Our framework uses an evolution strategy to guide an LLM in generating Python code that preserves the key components of BO algorithms: An initial design, a surrogate model, and an acquisition function. The LLM is prompted to produce multiple candidate algorithms, which are evaluated on the established Black-Box Optimization Benchmarking (BBOB) test suite from the COmparing Continuous Optimizers (COCO) platform. Based on their performance, top candidates are selected, combined, and mutated via controlled prompt variations, enabling iterative refinement. Despite no additional fine-tuning, the LLM-generated algorithms outperform state-of-the-art BO baselines in 19 (out of 24) BBOB functions in dimension 5 and generalize well to higher dimensions, and different tasks (from the Bayesmark framework). This work demonstrates that LLMs can serve as algorithmic co-designers, offering a new paradigm for automating BO development and accelerating the discovery of novel algorithmic combinations. The source code is provided at https://github.com/Ewendawi/LLaMEA-BO.
中文摘要:本研究提出一种利用大型语言模型通过进化策略自动生成并优化贝叶斯优化算法的框架,在多数基准测试中表现优于现有方法。
English Summary: This study introduces a framework using Large Language Models to automatically generate and refine Bayesian optimization algorithms through evolutionary strategies, achieving superior performance over existing methods in most benchmark tests.
Authors:Nils Neukirch, Johanna Vielhaben, Nils Strodthoff
Abstract:
Internal representations are crucial for understanding deep neural networks, such as their properties and reasoning patterns, but remain difficult to interpret. While mapping from feature space to input space aids in interpreting the former, existing approaches often rely on crude approximations. We propose using a conditional diffusion model - a pretrained high-fidelity diffusion model conditioned on spatially resolved feature maps - to learn such a mapping in a probabilistic manner. We demonstrate the feasibility of this approach across various pretrained image classifiers from CNNs to ViTs, showing excellent reconstruction capabilities. Through qualitative comparisons and robustness analysis, we validate our method and showcase possible applications, such as the visualization of concept steering in input space or investigations of the composite nature of the feature space. This approach has broad potential for improving feature space understanding in computer vision models.
中文摘要:本文提出了一种条件扩散模型,以概率方式将特征空间映射到输入空间,有效提升了多种分类器中深度神经网络的可解释性,并展示了卓越的重建能力和应用前景。
English Summary: This paper introduces a conditional diffusion model to probabilistically map feature spaces to input spaces, enhancing the interpretability of deep neural networks across various classifiers with strong reconstruction and application potential.
Authors:Yuan Gao, Ruiqi Shu, Hao Wu, Fan Xu, Yanfei Xiang, Ruijian Gou, Qingsong Wen, Xian Wu, Kun Wang, Xiaomeng Huang
Abstract:
Long-term, high-fidelity simulation of slow-changing physical systems, such as the ocean and climate, presents a fundamental challenge in scientific computing. Traditional autoregressive machine learning models often fail in these tasks as minor errors accumulate and lead to rapid forecast degradation. To address this problem, we propose NeuralOM, a general neural operator framework designed for simulating complex, slow-changing dynamics. NeuralOM's core consists of two key innovations: (1) a Progressive Residual Correction Framework that decomposes the forecasting task into a series of fine-grained refinement steps, effectively suppressing long-term error accumulation; and (2) a Physics-Guided Graph Network whose built-in adaptive messaging mechanism explicitly models multi-scale physical interactions, such as gradient-driven flows and multiplicative couplings, thereby enhancing physical consistency while maintaining computational efficiency. We validate NeuralOM on the challenging task of global Subseasonal-to-Seasonal (S2S) ocean simulation. Extensive experiments demonstrate that NeuralOM not only surpasses state-of-the-art models in forecast accuracy and long-term stability, but also excels in simulating extreme events. For instance, at a 60-day lead time, NeuralOM achieves a 13.3% lower RMSE compared to the best-performing baseline, offering a stable, efficient, and physically-aware paradigm for data-driven scientific computing. Code link: https://github.com/YuanGao-YG/NeuralOM.
中文: NeuralOM提出了一种神经算子框架,通过渐进残差修正和物理引导图网络有效抑制慢变系统(如海洋)长期模拟中的误差累积,提升物理一致性,在预测精度和稳定性上均超越现有最佳模型。
English: NeuralOM introduces a neural operator framework with progressive residual correction and physics-guided graph networks to effectively suppress error accumulation and enhance physical consistency in long-term simulations of slow-changing systems like oceans, achieving superior accuracy and stability over existing models.
Authors:Devran Ugurlu, Shuang Qian, Elliot Fairweather, Charlene Mauger, Bram Ruijsink, Laura Dal Toso, Yu Deng, Marina Strocchi, Reza Razavi, Alistair Young, Pablo Lamata, Steven Niederer, Martin Bishop
Abstract:
A cardiac digital twin is a virtual replica of a patient's heart for screening, diagnosis, prognosis, risk assessment, and treatment planning of cardiovascular diseases. This requires an anatomically accurate patient-specific 3D structural representation of the heart, suitable for electro-mechanical simulations or study of disease mechanisms. However, generation of cardiac digital twins at scale is demanding and there are no public repositories of models across demographic groups. We describe an automatic open-source pipeline for creating patient-specific left and right ventricular meshes from cardiovascular magnetic resonance images, its application to a large cohort of ~55000 participants from UK Biobank, and the construction of the most comprehensive cohort of adult heart models to date, comprising 1423 representative meshes across sex (male, female), body mass index (range: 16 - 42 kg/m$^2$) and age (range: 49 - 80 years). Our code is available at https://github.com/cdttk/biv-volumetric-meshing/tree/plos2025 , and pre-trained networks, representative volumetric meshes with fibers and UVCs will be made available soon.
心脏数字孪生是一种用于疾病管理的虚拟心脏模型,本研究开发了一种自动化流程,能从心脏磁共振图像生成患者特异的心室网格,并应用于英国生物银行的大规模队列,构建了包含1423个涵盖不同人口统计学特征的成人心脏模型的最全面数据集。
A cardiac digital twin is a virtual heart model used for disease management, and this study presents an automated pipeline to generate patient-specific ventricular meshes from MRI data, applied to a large UK Biobank cohort to create a comprehensive set of 1423 representative adult heart models across demographics.
Authors:Shuo Wang, Shunyang Huang, Jinghui Yuan, Zhixiang Shen, Zhao Kang
Abstract:
Fusing heterogeneous information remains a persistent challenge in modern data analysis. While significant progress has been made, existing approaches often fail to account for the inherent heterogeneity of object patterns across different semantic spaces. To address this limitation, we propose the Cooperation of Experts (CoE) framework, which encodes multi-typed information into unified heterogeneous multiplex networks. By overcoming modality and connection differences, CoE provides a powerful and flexible model for capturing the intricate structures of real-world complex data. In our framework, dedicated encoders act as domain-specific experts, each specializing in learning distinct relational patterns in specific semantic spaces. To enhance robustness and extract complementary knowledge, these experts collaborate through a novel large margin mechanism supported by a tailored optimization strategy. Rigorous theoretical analyses guarantee the framework's feasibility and stability, while extensive experiments across diverse benchmarks demonstrate its superior performance and broad applicability. Our code is available at https://github.com/strangeAlan/CoE.
Chinese: 专家协作(CoE)框架通过将多类型信息编码为统一的异构多重网络,解决了异构数据融合的挑战,并利用新颖的大间隔机制和定制优化策略,实现了领域特定编码器之间的稳健协作。
English: The Cooperation of Experts (CoE) framework addresses the challenge of fusing heterogeneous data by encoding multi-typed information into unified heterogeneous multiplex networks, enabling robust collaboration among domain-specific encoders through a novel large margin mechanism and tailored optimization strategy.
Authors:Dooho Lee, Myeong Kong, Sagad Hamid, Cheonwoo Lee, Jaemin Yoo
Abstract:
We revisit DropEdge, a data augmentation technique for GNNs which randomly removes edges to expose diverse graph structures during training. While being a promising approach to effectively reduce overfitting on specific connections in the graph, we observe that its potential performance gain in supervised learning tasks is significantly limited. To understand why, we provide a theoretical analysis showing that the limited performance of DropEdge comes from the fundamental limitation that exists in many GNN architectures. Based on this analysis, we propose Aggregation Buffer, a parameter block specifically designed to improve the robustness of GNNs by addressing the limitation of DropEdge. Our method is compatible with any GNN model, and shows consistent performance improvements on multiple datasets. Moreover, our method effectively addresses well-known problems such as degree bias or structural disparity as a unifying solution. Code and datasets are available at https://github.com/dooho00/agg-buffer.
Chinese: 本研究分析了DropEdge在图神经网络中因架构限制导致的性能局限,并提出通用参数模块Aggregation Buffer,该模块通过增强模型鲁棒性在多个数据集上实现持续性能提升,同时有效解决度偏差等已知问题。
English: This study analyzes DropEdge's limited performance in GNNs due to architectural constraints and introduces Aggregation Buffer, a universally compatible parameter block that enhances robustness and consistently improves performance across datasets while addressing issues like degree bias.
Authors:Jungyoub Cha, Hyunjong Kim, Sungzoon Cho
Abstract:
Speculative decoding is a widely adopted technique for accelerating inference in large language models (LLMs), but its performance degrades on long inputs due to increased attention cost and reduced draft accuracy. We introduce SpecExtend, a drop-in enhancement that improves the performance of speculative decoding on long sequences without any additional training. First, SpecExtend integrates efficient attention mechanisms such as FlashAttention and Hybrid Tree Attention into both the draft and target models. To improve draft accuracy and speed on long inputs without retraining, we propose Cross-model Retrieval, a novel KV cache eviction strategy that uses the target model's attention scores to dynamically select relevant context for the draft model. Extensive evaluations on three long-context understanding datasets show that SpecExtend accelerates standard tree-based speculative decoding by up to 2.22x for inputs up to 16K tokens, providing an effective solution for speculative decoding of long sequences. Our code is available at https://github.com/jycha98/SpecExtend .
推测解码在长输入下性能下降,而SpecExtend通过高效注意力机制和新型KV缓存策略,无需额外训练即可将长序列处理速度提升最高3.86倍。
Speculative decoding performance declines with longer inputs, but SpecExtend enhances it for long sequences using efficient attention and a novel KV cache strategy, achieving up to 3.86x speedup without extra training.
Authors:Xiaowen Ma, Zhenliang Ni, Shuai Xiao, Xinghao Chen
Abstract:
In long-term time series forecasting, different variables often influence the target variable over distinct time intervals, a challenge known as the multi-delay issue. Traditional models typically process all variables or time points uniformly, which limits their ability to capture complex variable relationships and obtain non-trivial time representations. To address this issue, we propose TimePro, an innovative Mamba-based model that constructs variate- and time-aware hyper-states. Unlike conventional approaches that merely transfer plain states across variable or time dimensions, TimePro preserves the fine-grained temporal features of each variate token and adaptively selects the focused time points to tune the plain state. The reconstructed hyper-state can perceive both variable relationships and salient temporal information, which helps the model make accurate forecasting. In experiments, TimePro performs competitively on eight real-world long-term forecasting benchmarks with satisfactory linear complexity. Code is available at https://github.com/xwmaxwma/TimePro.
中文: TimePro是一种基于Mamba的创新模型,通过构建变量和时间感知的超状态来解决长期时间序列预测中的多延迟问题,在真实世界基准测试中以线性复杂度实现了优异性能。
English: TimePro is a novel Mamba-based model that addresses the multi-delay issue in long-term time series forecasting by constructing variate- and time-aware hyper-states, achieving competitive performance with linear complexity on real-world benchmarks.
Authors:Eric Xing, Pranavi Kolouju, Robert Pless, Abby Stylianou, Nathan Jacobs
Abstract:
Composed image retrieval (CIR) is the task of retrieving a target image specified by a query image and a relative text that describes a semantic modification to the query image. Existing methods in CIR struggle to accurately represent the image and the text modification, resulting in subpar performance. To address this limitation, we introduce a CIR framework, ConText-CIR, trained with a Text Concept-Consistency loss that encourages the representations of noun phrases in the text modification to better attend to the relevant parts of the query image. To support training with this loss function, we also propose a synthetic data generation pipeline that creates training data from existing CIR datasets or unlabeled images. We show that these components together enable stronger performance on CIR tasks, setting a new state-of-the-art in composed image retrieval in both the supervised and zero-shot settings on multiple benchmark datasets, including CIRR and CIRCO. Source code, model checkpoints, and our new datasets are available at https://github.com/mvrl/ConText-CIR.
中文:提出的ConText-CIR框架通过文本概念一致性损失和合成数据生成技术,在多个基准测试中实现了组合图像检索的最新性能突破。
English: The proposed ConText-CIR framework improves composed image retrieval by using a Text Concept-Consistency loss and synthetic data generation, achieving state-of-the-art performance on multiple benchmarks.
Authors:Ryota Ushio, Takashi Ishida, Masashi Sugiyama
Abstract:
While the performance of machine learning systems has experienced significant improvement in recent years, relatively little attention has been paid to the fundamental question: to what extent can we improve our models? This paper provides a means of answering this question in the setting of binary classification, which is practical and theoretically supported. We extend a previous work that utilizes soft labels for estimating the Bayes error, the optimal error rate, in two important ways. First, we theoretically investigate the properties of the bias of the hard-label-based estimator discussed in the original work. We reveal that the decay rate of the bias is adaptive to how well the two class-conditional distributions are separated, and it can decay significantly faster than the previous result suggested as the number of hard labels per instance grows. Second, we tackle a more challenging problem setting: estimation with corrupted soft labels. One might be tempted to use calibrated soft labels instead of clean ones. However, we reveal that calibration guarantee is not enough, that is, even perfectly calibrated soft labels can result in a substantially inaccurate estimate. Then, we show that isotonic calibration can provide a statistically consistent estimator under an assumption weaker than that of the previous work. Our method is instance-free, i.e., we do not assume access to any input instances. This feature allows it to be adopted in practical scenarios where the instances are not available due to privacy issues. Experiments with synthetic and real-world datasets show the validity of our methods and theory.
中文: 本文通过理论分析硬标签偏差衰减特性,并针对含噪软标签提出基于保序校准的一致性估计方法,在多种数据集上验证了二元分类中贝叶斯误差估计的有效性。
English: This paper enhances binary classification by refining Bayes error estimation through theoretical analysis of bias decay with hard labels and introducing a consistent estimator using isotonic calibration for corrupted soft labels, validated across various datasets.
Authors:Sunwoo Kim, Soo Yong Lee, Jaemin Yoo, Kijung Shin
Abstract:
While graph neural networks (GNNs) have shown remarkable performance across diverse graph-related tasks, their high-dimensional hidden representations render them black boxes. In this work, we propose Graph Lingual Network (GLN), a GNN built on large language models (LLMs), with hidden representations in the form of human-readable text. Through careful prompt design, GLN incorporates not only the message passing module of GNNs but also advanced GNN techniques, including graph attention and initial residual connection. The comprehensibility of GLN's hidden representations enables an intuitive analysis of how node representations change (1) across layers and (2) under advanced GNN techniques, shedding light on the inner workings of GNNs. Furthermore, we demonstrate that GLN achieves strong zero-shot performance on node classification and link prediction, outperforming existing LLM-based baseline methods.
中文: 图语言网络(GLN)将大语言模型与图神经网络相结合,生成可解释的文本形式隐藏表示,不仅直观地揭示了GNN的内部工作机制,还在节点分类和链接预测任务中实现了卓越的零样本性能。
English: The Graph Lingual Network (GLN) integrates large language models with graph neural networks to create interpretable, text-based hidden representations, enabling intuitive analysis of GNN operations and achieving superior zero-shot performance in node classification and link prediction tasks.
Authors:Hanlin Wang, Chak Tou Leong, Jiashuo Wang, Jian Wang, Wenjie Li
Abstract:
Reinforcement learning (RL) holds significant promise for training LLM agents to handle complex, goal-oriented tasks that require multi-step interactions with external environments. However, a critical challenge when applying RL to these agentic tasks arises from delayed rewards: feedback signals are typically available only after the entire task is completed. This makes it non-trivial to assign delayed rewards to earlier actions, providing insufficient guidance regarding environmental constraints and hindering agent training. In this work, we draw on the insight that the ultimate completion of a task emerges from the cumulative progress an agent makes across individual steps. We propose Stepwise Progress Attribution (SPA), a general reward redistribution framework that decomposes the final reward into stepwise contributions, each reflecting its incremental progress toward overall task completion. To achieve this, we train a progress estimator that accumulates stepwise contributions over a trajectory to match the task completion. During policy optimization, we combine the estimated per-step contribution with a grounding signal for actions executed in the environment as the fine-grained, intermediate reward for effective agent training. Extensive experiments on common agent benchmarks (including Webshop, ALFWorld, and VirtualHome) demonstrate that SPA consistently outperforms the state-of-the-art method in both success rate (+2.5\% on average) and grounding accuracy (+1.9\% on average). Further analyses demonstrate that our method remarkably provides more effective intermediate rewards for RL training. Our code is available at https://github.com/WangHanLinHenry/SPA-RL-Agent.
中文摘要:强化学习在训练LLM智能体时面临延迟奖励的挑战,而提出的逐步进展归因(SPA)框架通过将最终奖励分解为逐步贡献来解决这一问题,有效提升训练效果和性能表现。
English Summary: Reinforcement learning faces delayed reward challenges in training LLM agents, which the proposed Stepwise Progress Attribution (SPA) framework addresses by decomposing final rewards into stepwise contributions to improve training effectiveness and performance.
Authors:Reza Nematirad, Anil Pahwa, Balasubramaniam Natarajan
Abstract:
Time series forecasting plays a crucial role in many real-world applications, and numerous complex forecasting models have been proposed in recent years. Despite their architectural innovations, most state-of-the-art models report only marginal improvements -- typically just a few thousandths in standard error metrics. These models often incorporate complex data embedding layers to transform raw inputs into higher-dimensional representations to enhance accuracy. But are data embedding techniques actually effective in time series forecasting? Through extensive ablation studies across fifteen state-of-the-art models and four benchmark datasets, we find that removing data embedding layers from many state-of-the-art models does not degrade forecasting performance. In many cases, it improves both accuracy and computational efficiency. The gains from removing embedding layers often exceed the performance differences typically reported between competing models. Code available at: https://github.com/neuripsdataembedidng/DataEmbedding
Chinese: 最新研究表明,去除先进时间序列预测模型中的复杂数据嵌入层通常能同时提升预测精度和计算效率,其性能改进甚至超过模型间常规比较的差异。
English: Recent research reveals that removing complex data embedding layers from advanced time series forecasting models often improves both accuracy and computational efficiency, with performance gains surpassing typical model comparison metrics.
Authors:Zechen Li, Lanqing Yang, Yiheng Bian, Hao Pan, Yongjian Fu, Yezhou Wang, Yi-Chao Chen, Guangtao Xue, Ju Ren
Abstract:
This paper presents an innovative frequency-embedded 3D Gaussian splatting (3DGS) algorithm for wideband radio-frequency (RF) radiance field modeling, offering an advancement over the existing works limited to single-frequency modeling. Grounded in fundamental physics, we uncover the complex relationship between EM wave propagation behaviors and RF frequencies. Inspired by this, we design an EM feature network with attenuation and radiance modules to learn the complex relationships between RF frequencies and the key properties of each 3D Gaussian, specifically the attenuation factor and RF signal intensity. By training the frequency-embedded 3DGS model, we can efficiently reconstruct RF radiance fields at arbitrary unknown frequencies within a given 3D environment. Finally, we propose a large-scale power angular spectrum (PAS) dataset containing 50000 samples ranging from 1 to 100 GHz in 6 indoor environments, and conduct extensive experiments to verify the effectiveness of our method. Our approach achieves an average Structural Similarity Index Measure (SSIM) up to 0.72, and a significant improvement up to 17.8% compared to the current state-of-the-art (SOTA) methods trained on individual test frequencies. Additionally, our method achieves an SSIM of 0.70 without prior training on these frequencies, which represents only a 2.8% performance drop compared to models trained with full PAS data. This demonstrates our model's capability to estimate PAS at unknown frequencies. For related code and datasets, please refer to https://github.com/sim-2-real/Wideband3DGS.
本文提出了一种频率嵌入的3D高斯泼溅算法,通过建立电磁特征网络实现了宽带射频辐射场的任意频率重建,相比现有方法性能提升最高达17.8%,并在未训练频率上仍保持优异性能。
This paper introduces a frequency-embedded 3D Gaussian splatting algorithm that advances wideband RF radiance field modeling by enabling reconstruction at arbitrary frequencies, achieving up to 17.8% improvement over existing methods and demonstrating robust performance even on untrained frequencies.
Authors:Woomin Song, Jihoon Tack, Sangwoo Mo, Seunghyuk Oh, Jinwoo Shin
Abstract:
State-space models (SSMs) offer a promising architecture for sequence modeling, providing an alternative to Transformers by replacing expensive self-attention with linear recurrences. In this paper, we propose a simple yet effective trick to enhance SSMs within given computational budgets by sparsifying them. Our intuition is that tokens in SSMs are highly redundant due to gradual recurrent updates, and dense recurrence operations block the delivery of past information. In particular, we observe that upper layers of SSMs tend to be more redundant as they encode global information, while lower layers encode local information. Motivated by this, we introduce Simba, a hierarchical sparsification method for SSMs based on token pruning. Simba sparsifies upper layers more than lower layers, encouraging the upper layers to behave like highways. To achieve this, we propose a novel token pruning criterion for SSMs, measuring the global impact of tokens on the final output by accumulating local recurrences. We demonstrate that Simba outperforms the baseline model, Mamba, with the same FLOPS in various natural language tasks. Moreover, we illustrate the effect of highways, showing that Simba not only enhances efficiency but also improves the information flow across long sequences. Code is available at https://github.com/woominsong/Simba.
中文: 本文提出Simba方法,通过对状态空间模型进行分层稀疏化处理,在高层实施更激进的令牌剪枝以构建信息高速通道,在相同计算量下显著提升了Mamba基线模型在多项自然语言任务中的性能表现。
English: This paper introduces Simba, a hierarchical sparsification method for state-space models that prunes tokens more aggressively in upper layers to create information highways, improving both efficiency and performance over the baseline Mamba model across various language tasks.
Authors:Kianté Brantley, Mingyu Chen, Zhaolin Gao, Jason D. Lee, Wen Sun, Wenhao Zhan, Xuezhou Zhang
Abstract:
Reinforcement learning (RL) has emerged as a powerful tool for fine-tuning large language models (LLMs) to improve complex reasoning abilities. However, state-of-the-art policy optimization methods often suffer from high computational overhead and memory consumption, primarily due to the need for multiple generations per prompt and the reliance on critic networks or advantage estimates of the current policy. In this paper, we propose $A$*-PO, a novel two-stage policy optimization framework that directly approximates the optimal advantage function and enables efficient training of LLMs for reasoning tasks. In the first stage, we leverage offline sampling from a reference policy to estimate the optimal value function $V$*, eliminating the need for costly online value estimation. In the second stage, we perform on-policy updates using a simple least-squares regression loss with only a single generation per prompt. Theoretically, we establish performance guarantees and prove that the KL-regularized RL objective can be optimized without requiring complex exploration strategies. Empirically, $A$*-PO achieves competitive performance across a wide range of mathematical reasoning benchmarks, while reducing training time by up to 2$\times$ and peak memory usage by over 30% compared to PPO, GRPO, and REBEL. Implementation of $A$*-PO can be found at https://github.com/ZhaolinGao/A-PO.
中文: 提出的A*-PO框架通过离线采样估计最优优势函数并进行在线策略更新,有效优化大语言模型的推理能力,在保持竞争力的同时显著降低了计算和内存开销。
English: The proposed A*-PO framework efficiently optimizes large language models for reasoning tasks by estimating the optimal advantage function through offline sampling and on-policy updates, achieving competitive performance while significantly reducing computational and memory costs compared to existing methods.
Authors:Danush Khanna, Pratinav Seth, Sidhaarth Sredharan Murali, Aditya Kumar Guru, Siddharth Shukla, Tanuj Tyagi, Sandeep Chaurasia, Kripabandhu Ghosh
Abstract:
Mental manipulation is a subtle yet pervasive form of abuse in interpersonal communication, making its detection critical for safeguarding potential victims. However, due to manipulation's nuanced and context-specific nature, identifying manipulative language in complex, multi-turn, and multi-person conversations remains a significant challenge for large language models (LLMs). To address this gap, we introduce the MultiManip dataset, comprising 220 multi-turn, multi-person dialogues balanced between manipulative and non-manipulative interactions, all drawn from reality shows that mimic real-world scenarios. For manipulative interactions, it includes 11 distinct manipulations depicting real-life scenarios. We conduct extensive evaluations of state-of-the-art LLMs, such as GPT-4o and Llama-3.1-8B, employing various prompting strategies. Despite their capabilities, these models often struggle to detect manipulation effectively. To overcome this limitation, we propose SELF-PERCEPT, a novel, two-stage prompting framework inspired by Self-Perception Theory, demonstrating strong performance in detecting multi-person, multi-turn mental manipulation. Our code and data are publicly available at https://github.com/danushkhanna/self-percept .
中文: 本研究提出了MultiManip数据集和SELF-PERCEPT框架,以解决大型语言模型在多轮对话中检测微妙心理操纵的难题,相比现有模型展现出显著性能提升。
English: The study introduces the MultiManip dataset and SELF-PERCEPT framework to address LLMs' challenges in detecting nuanced mental manipulation in multi-turn dialogues, showing significant improvement over existing models.
Authors:Mengmeng Chen, Xiaohu Wu, Qiqi Liu, Tiantian He, Yew-Soon Ong, Yaochu Jin, Qicheng Lao, Han Yu
Abstract:
Multi-objective optimization (MOO) exists extensively in machine learning, and aims to find a set of Pareto-optimal solutions, called the Pareto front, e.g., it is fundamental for multiple avenues of research in federated learning (FL). Pareto-Front Learning (PFL) is a powerful method implemented using Hypernetworks (PHNs) to approximate the Pareto front. This method enables the acquisition of a mapping function from a given preference vector to the solutions on the Pareto front. However, most existing PFL approaches still face two challenges: (a) sampling rays in high-dimensional spaces; (b) failing to cover the entire Pareto Front which has a convex shape. Here, we introduce a novel PFL framework, called as PHN-HVVS, which decomposes the design space into Voronoi grids and deploys a genetic algorithm (GA) for Voronoi grid partitioning within high-dimensional space. We put forward a new loss function, which effectively contributes to more extensive coverage of the resultant Pareto front and maximizes the HV Indicator. Experimental results on multiple MOO machine learning tasks demonstrate that PHN-HVVS outperforms the baselines significantly in generating Pareto front. Also, we illustrate that PHN-HVVS advances the methodologies of several recent problems in the FL field. The code is available at https://github.com/buptcmm/phnhvvs}{https://github.com/buptcmm/phnhvvs.
中文:PHN-HVVS框架通过引入Voronoi网格分解和遗传算法,有效解决了帕累托前沿学习中的覆盖难题,在多项多目标优化任务中展现出优于基准方法的性能。
English: The PHN-HVVS framework addresses challenges in Pareto-Front Learning by employing Voronoi grid decomposition and a genetic algorithm to enhance coverage of the Pareto front, demonstrating superior performance in multi-objective optimization tasks.
Authors:Juan Ramirez, Meraj Hashemizadeh, Simon Lacoste-Julien
Abstract:
Recent efforts to develop trustworthy AI systems with accountability guarantees have led to widespread use of machine learning formulations incorporating external requirements, or constraints. These requirements are often enforced via penalization--adding fixed-weight terms to the task loss. We argue this approach is fundamentally ill-suited since there may be no penalty coefficient that simultaneously ensures constraint satisfaction and optimal constrained performance, i.e., that truly solves the constrained problem. Moreover, tuning these coefficients requires costly trial-and-error, incurring significant time and computational overhead. We, therefore, advocate for broader adoption of tailored constrained optimization methods--such as the Lagrangian approach, which jointly optimizes the penalization "coefficients" (the Lagrange multipliers) and the model parameters. Such methods (i) truly solve the constrained problem and do so accountably, by clearly defining feasibility and verifying when it is achieved, (ii) eliminate the need for extensive penalty tuning, and (iii) integrate seamlessly with modern deep learning pipelines.
中文摘要:当前AI系统常采用固定惩罚项来施加约束,但这种方法既无法确保约束满足与最优性能,又需耗费大量调参成本,因此采用拉格朗日法等定制化约束优化方法更能有效实现可信AI的可问责发展。
English summary: Current AI systems often use fixed penalty terms to enforce constraints, but this approach fails to guarantee both constraint satisfaction and optimal performance while requiring costly tuning, making tailored constrained optimization methods like the Lagrangian approach more effective for accountable AI development.
Authors:Peter Robicheaux, Matvei Popov, Anish Madan, Isaac Robinson, Joseph Nelson, Deva Ramanan, Neehar Peri
Abstract:
Vision-language models (VLMs) trained on internet-scale data achieve remarkable zero-shot detection performance on common objects like car, truck, and pedestrian. However, state-of-the-art models still struggle to generalize to out-of-distribution classes, tasks and imaging modalities not typically found in their pre-training. Rather than simply re-training VLMs on more visual data, we argue that one should align VLMs to new concepts with annotation instructions containing a few visual examples and rich textual descriptions. To this end, we introduce Roboflow100-VL, a large-scale collection of 100 multi-modal object detection datasets with diverse concepts not commonly found in VLM pre-training. We evaluate state-of-the-art models on our benchmark in zero-shot, few-shot, semi-supervised, and fully-supervised settings, allowing for comparison across data regimes. Notably, we find that VLMs like GroundingDINO and Qwen2.5-VL achieve less than 2% zero-shot accuracy on challenging medical imaging datasets within Roboflow100-VL, demonstrating the need for few-shot concept alignment. Lastly, we discuss our recent CVPR 2025 Foundational FSOD competition and share insights from the community. Notably, the winning team significantly outperforms our baseline by 16.8 mAP! Our code and dataset are available at https://github.com/roboflow/rf100-vl/ and https://universe.roboflow.com/rf100-vl/
中文: 视觉语言模型在常见物体检测上表现出色,但难以泛化到分布外概念,为此我们提出Roboflow100-VL基准,通过多模态指令实现少样本概念对齐以解决这一局限。
English: Vision-language models excel at detecting common objects but struggle with out-of-distribution concepts, prompting the introduction of Roboflow100-VL for few-shot alignment using multimodal instructions to address this limitation.
Authors:Mahdi Pourmirzaei, Farzaneh Esmaili, Salhuldin Alqarghuli, Mohammadreza Pourmirzaei, Ye Han, Kai Chen, Mohsen Rezaei, Duolin Wang, Dong Xu
Abstract:
The diverse nature of protein prediction tasks has traditionally necessitated specialized models, hindering the development of broadly applicable and computationally efficient Protein Language Models (PLMs). In this work, we introduce Prot2Token, a unified framework that overcomes these challenges by converting a wide spectrum of protein-related predictions, from sequence-level properties and residue-specific attributes to complex inter-protein interactions, into a standardized next-token prediction format. At its core, Prot2Token employs an autoregressive decoder, conditioned on embeddings from pre-trained protein encoders and guided by learnable task tokens, to perform diverse predictions. This architecture uniquely facilitates multi-task learning, enabling a single model to master numerous tasks with improved efficiency. We present extensive experimental validation across a variety of benchmarks, demonstrating Prot2Tokens strong predictive power in different types of protein-prediction tasks. Key results include significant speedups (e.g., near 1000x over AlphaFold2 with MSA) and performance often matching or exceeding specialized approaches. Beyond that, we introduce an auxiliary self-supervised decoder pre-training approach to improve spatially sensitive task performance. Prot2Token thus offers a significant step towards a versatile, high-throughput paradigm for protein modeling, promising to accelerate biological discovery and the development of novel therapeutics. The code is available at https://github.com/mahdip72/prot2token .
中文: Prot2Token提出了一种统一框架,将多种蛋白质预测任务转化为标准化的下一标记预测格式,通过多任务学习实现了与专业模型相当或更优的性能,并显著提升了计算效率。
English: Prot2Token introduces a unified framework that transforms various protein prediction tasks into a standardized next-token prediction format, enabling efficient multi-task learning with performance matching or exceeding specialized models while achieving significant speedups.
Authors:Jihoon Lee, Min Song
Abstract:
Despite significant advancements in Large Vision-Language Models, Object Hallucination (OH) remains a persistent challenge. Building upon prior studies on contrastive decoding that address this issue without requiring additional model training, we introduce RVCD (Retrieval Visual Contrastive Decoding), an advanced method to suppress OH. RVCD leverages both negative and positive images at the logit level, explicitly referencing AI-generated images designed to represent a single concept. Our approach demonstrates substantial improvements over existing decoding-based methods.
中文: RVCD是一种先进的对比解码方法,通过在逻辑层面同时利用负面和正面的AI生成图像,有效抑制视觉语言模型中的物体幻觉问题,相比现有方法展现出显著改进。
English: RVCD is an advanced contrastive decoding method that suppresses object hallucination in vision-language models by leveraging both negative and positive AI-generated images at the logit level, showing significant improvements over existing approaches.
Authors:Shenao Zhang, Yaqing Wang, Yinxiao Liu, Tianqi Liu, Peter Grabowski, Eugene Ie, Zhaoran Wang, Yunxuan Li
Abstract:
Large Language Models (LLMs) trained via Reinforcement Learning (RL) have exhibited strong reasoning capabilities and emergent reflective behaviors, such as backtracking and error correction. However, conventional Markovian RL confines exploration to the training phase to learn an optimal deterministic policy and depends on the history contexts only through the current state. Therefore, it remains unclear whether reflective reasoning will emerge during Markovian RL training, or why they are beneficial at test time. To remedy this, we recast reflective exploration within the Bayes-Adaptive RL framework, which explicitly optimizes the expected return under a posterior distribution over Markov decision processes. This Bayesian formulation inherently incentivizes both reward-maximizing exploitation and information-gathering exploration via belief updates. Our resulting algorithm, BARL, instructs the LLM to stitch and switch strategies based on the observed outcomes, offering principled guidance on when and how the model should reflectively explore. Empirical results on both synthetic and mathematical reasoning tasks demonstrate that BARL outperforms standard Markovian RL approaches at test time, achieving superior token efficiency with improved exploration effectiveness. Our code is available at https://github.com/shenao-zhang/BARL.
Chinese Summary: 本研究提出BARL这一贝叶斯强化学习框架,通过增强大语言模型的反思性探索与策略适应能力,在推理任务中实现了优于传统方法的性能与更高的效率。
English Summary: The study introduces BARL, a Bayesian reinforcement learning framework that enhances large language models' reflective exploration and strategy adaptation, outperforming standard methods in reasoning tasks with greater efficiency.
Authors:Dongyu Luo, Kelin Yu, Amir-Hossein Shahidzadeh, Cornelia Fermüller, Yiannis Aloimonos, Ruohan Gao
Abstract:
Vision-based tactile sensing has been widely used in perception, reconstruction, and robotic manipulation. However, collecting large-scale tactile data remains costly due to the localized nature of sensor-object interactions and inconsistencies across sensor instances. Existing approaches to scaling tactile data, such as simulation and free-form tactile generation, often suffer from unrealistic output and poor transferability to downstream tasks. To address this, we propose ControlTac, a two-stage controllable framework that generates realistic tactile images conditioned on a single reference tactile image, contact force, and contact position. With those physical priors as control input, ControlTac generates physically plausible and varied tactile images that can be used for effective data augmentation. Through experiments on three downstream tasks, we demonstrate that ControlTac can effectively augment tactile datasets and lead to consistent gains. Our three real-world experiments further validate the practical utility of our approach. Project page: https://dongyuluo.github.io/controltac.
Authors:Chenghao Qian, Wenjing Li, Yuhu Guo, Gustav Markkula
Abstract:
In this work, we present WeatherEdit, a novel weather editing pipeline for generating realistic weather effects with controllable types and severity in 3D scenes. Our approach is structured into two key components: weather background editing and weather particle construction. For weather background editing, we introduce an all-in-one adapter that integrates multiple weather styles into a single pretrained diffusion model, enabling the generation of diverse weather effects in 2D image backgrounds. During inference, we design a Temporal-View (TV-) attention mechanism that follows a specific order to aggregate temporal and spatial information, ensuring consistent editing across multi-frame and multi-view images. To construct the weather particles, we first reconstruct a 3D scene using the edited images and then introduce a dynamic 4D Gaussian field to generate snowflakes, raindrops and fog in the scene. The attributes and dynamics of these particles are precisely controlled through physical-based modelling and simulation, ensuring realistic weather representation and flexible severity adjustments. Finally, we integrate the 4D Gaussian field with the 3D scene to render consistent and highly realistic weather effects. Experiments on multiple driving datasets demonstrate that WeatherEdit can generate diverse weather effects with controllable condition severity, highlighting its potential for autonomous driving simulation in adverse weather. See project page: https://jumponthemoon.github.io/w-edit
Authors:Tal Gonen, Itai Pemper, Ilan Naiman, Nimrod Berman, Omri Azencot
Abstract:
Generative modeling of time series is a central challenge in time series analysis, particularly under data-scarce conditions. Despite recent advances in generative modeling, a comprehensive understanding of how state-of-the-art generative models perform under limited supervision remains lacking. In this work, we conduct the first large-scale study evaluating leading generative models in data-scarce settings, revealing a substantial performance gap between full-data and data-scarce regimes. To close this gap, we propose a unified diffusion-based generative framework that can synthesize high-fidelity time series across diverse domains using just a few examples. Our model is pre-trained on a large, heterogeneous collection of time series datasets, enabling it to learn generalizable temporal representations. It further incorporates architectural innovations such as dynamic convolutional layers for flexible channel adaptation and dataset token conditioning for domain-aware generation. Without requiring abundant supervision, our unified model achieves state-of-the-art performance in few-shot settings-outperforming domain-specific baselines across a wide range of subset sizes. Remarkably, it also surpasses all baselines even when tested on full datasets benchmarks, highlighting the strength of pre-training and cross-domain generalization. We hope this work encourages the community to revisit few-shot generative modeling as a key problem in time series research and pursue unified solutions that scale efficiently across domains. Code is available at https://github.com/azencot-group/ImagenFew.
中文摘要:本研究提出了一种基于扩散的统一框架,通过跨领域预训练和架构创新,在极少数据条件下实现了生成高质量时间序列的最先进性能。
English Summary: This study introduces a unified diffusion-based framework that achieves state-of-the-art performance in generating high-fidelity time series with minimal data by leveraging cross-domain pre-training and architectural innovations.
Authors:Haoran Li, Yingjie Qin, Baoyuan Ou, Lai Xu, Ruiwen Xu
Abstract:
Vision-Language Models (VLMs) have made significant progress in multimodal tasks. However, their performance often deteriorates in long-context scenarios, particularly long videos. While Rotary Position Embedding (RoPE) has been widely adopted for length generalization in Large Language Models (LLMs), extending vanilla RoPE to capture the intricate spatial-temporal dependencies in videos remains an unsolved challenge. Existing methods typically allocate different frequencies within RoPE to encode 3D positional information. However, these allocation strategies mainly rely on heuristics, lacking in-depth theoretical analysis. In this paper, we first study how different allocation strategies impact the long-context capabilities of VLMs. Our analysis reveals that current multimodal RoPEs fail to reliably capture semantic similarities over extended contexts. To address this issue, we propose HoPE, a Hybrid of Position Embedding designed to improve the long-context capabilities of VLMs. HoPE introduces a hybrid frequency allocation strategy for reliable semantic modeling over arbitrarily long context, and a dynamic temporal scaling mechanism to facilitate robust learning and flexible inference across diverse context lengths. Extensive experiments across four video benchmarks on long video understanding and retrieval tasks demonstrate that HoPE consistently outperforms existing methods, confirming its effectiveness. Code is available at https://github.com/hrlics/HoPE.
中文: 提出的HoPE方法通过混合频率分配和动态时间缩放机制增强视觉语言模型的长上下文处理能力,在长视频理解和检索任务中相比现有方法展现出更优性能。
English: The proposed HoPE method enhances Vision-Language Models' long-context capabilities through hybrid frequency allocation and dynamic temporal scaling, demonstrating superior performance in long video understanding and retrieval tasks compared to existing approaches.
Authors:Lijun Zhang, Lin Li, Yajie Qi, Huizhong Song, Yaodong Yang, Jun Wang, Wei Wei
Abstract:
When fine-tuning pre-trained Large Language Models (LLMs) to align with human values and intentions, maximizing the estimated reward can lead to superior performance, but it also introduces potential risks due to deviations from the reference model's intended behavior. Most existing methods typically introduce KL divergence to constrain deviations between the trained model and the reference model; however, this may not be sufficient in certain applications that require tight risk control. In this paper, we introduce Risk-aware Direct Preference Optimization (Ra-DPO), a novel approach that incorporates risk-awareness by employing a class of nested risk measures. This approach formulates a constrained risk-aware advantage function maximization problem and then converts the Bradley-Terry model into a token-level representation. The objective function maximizes the likelihood of the policy while suppressing the deviation between a trained model and the reference model using a sequential risk ratio, thereby enhancing the model's risk-awareness. Experimental results across three open-source datasets: IMDb Dataset, Anthropic HH Dataset, and AlpacaEval, demonstrate the proposed method's superior performance in balancing alignment performance and model drift. Our code is opensourced at https://github.com/zlj123-max/Ra-DPO.
中文:Ra-DPO通过引入风险感知度量来优化大型语言模型的性能并控制模型偏移,在多个基准数据集上实现了更优的对齐效果。
English: Ra-DPO enhances LLM alignment by integrating risk-aware measures to optimize performance while controlling model drift, achieving superior results on benchmark datasets.
Authors:Yeonjoon Jung, Daehyun Ahn, Hyungjun Kim, Taesu Kim, Eunhyeok Park
Abstract:
Low-Rank Adaptation (LoRA) is a popular method for parameter-efficient fine-tuning (PEFT) of generative models, valued for its simplicity and effectiveness. Despite recent enhancements, LoRA still suffers from a fundamental limitation: overfitting when the bottleneck is widened. It performs best at ranks 32-64, yet its accuracy stagnates or declines at higher ranks, still falling short of full fine-tuning (FFT) performance. We identify the root cause as LoRA's structural bottleneck, which introduces gradient entanglement to the unrelated input channels and distorts gradient propagation. To address this, we introduce a novel structure, Granular Low-Rank Adaptation (GraLoRA) that partitions weight matrices into sub-blocks, each with its own low-rank adapter. With negligible computational or storage cost, GraLoRA overcomes LoRA's limitations, effectively increases the representational capacity, and more closely approximates FFT behavior. Experiments on code generation and commonsense reasoning benchmarks show that GraLoRA consistently outperforms LoRA and other baselines, achieving up to +8.5% absolute gain in Pass@1 on HumanEval+. These improvements hold across model sizes and rank settings, making GraLoRA a scalable and robust solution for PEFT. Code, data, and scripts are available at https://github.com/SqueezeBits/GraLoRA.git
Chinese: 低秩适应(LoRA)因梯度纠缠问题在较高秩时存在过拟合,而粒度低秩适应(GraLoRA)通过将权重矩阵划分为子块有效解决了这一局限,在代码生成和常识推理基准测试中表现更优。
English: Low-Rank Adaptation (LoRA) faces overfitting issues at higher ranks due to gradient entanglement, but Granular Low-Rank Adaptation (GraLoRA) overcomes this by partitioning weight matrices into sub-blocks, achieving superior performance in benchmarks like code generation and commonsense reasoning.
Authors:Dong Liu, Yanxuan Yu, Jiayi Zhang, Yifan Li, Ben Lengerich, Ying Nian Wu
Abstract:
Diffusion Transformers (DiT) are powerful generative models but remain computationally intensive due to their iterative structure and deep transformer stacks. To alleviate this inefficiency, we propose FastCache, a hidden-state-level caching and compression framework that accelerates DiT inference by exploiting redundancy within the model's internal representations. FastCache introduces a dual strategy: (1) a spatial-aware token selection mechanism that adaptively filters redundant tokens based on hidden state saliency, and (2) a transformer-level cache that reuses latent activations across timesteps when changes are statistically insignificant. These modules work jointly to reduce unnecessary computation while preserving generation fidelity through learnable linear approximation. Theoretical analysis shows that FastCache maintains bounded approximation error under a hypothesis-testing-based decision rule. Empirical evaluations across multiple DiT variants demonstrate substantial reductions in latency and memory usage, with best generation output quality compared to other cache methods, as measured by FID and t-FID. Code implementation of FastCache is available on GitHub at https://github.com/NoakLiu/FastCache-xDiT.
中文摘要:FastCache是一种创新的缓存压缩框架,通过自适应过滤冗余令牌和复用潜在激活来加速扩散Transformer推理,在保持生成质量的同时显著降低计算成本。
English Summary: FastCache is a novel caching and compression framework that accelerates Diffusion Transformer inference by adaptively filtering redundant tokens and reusing latent activations, significantly reducing computational costs while maintaining generation quality.
Authors:Mathew J. Koretsky, Maya Willey, Adi Asija, Owen Bianchi, Chelsea X. Alvarado, Tanay Nayak, Nicole Kuznetsov, Sungwon Kim, Mike A. Nalls, Daniel Khashabi, Faraz Faghri
Abstract:
Biomedical researchers increasingly rely on large-scale structured databases for complex analytical tasks. However, current text-to-SQL systems often struggle to map qualitative scientific questions into executable SQL, particularly when implicit domain reasoning is required. We introduce BiomedSQL, the first benchmark explicitly designed to evaluate scientific reasoning in text-to-SQL generation over a real-world biomedical knowledge base. BiomedSQL comprises 68,000 question/SQL query/answer triples grounded in a harmonized BigQuery knowledge base that integrates gene-disease associations, causal inference from omics data, and drug approval records. Each question requires models to infer domain-specific criteria, such as genome-wide significance thresholds, effect directionality, or trial phase filtering, rather than rely on syntactic translation alone. We evaluate a range of open- and closed-source LLMs across prompting strategies and interaction paradigms. Our results reveal a substantial performance gap: GPT-o3-mini achieves 59.0% execution accuracy, while our custom multi-step agent, BMSQL, reaches 62.6%, both well below the expert baseline of 90.0%. BiomedSQL provides a new foundation for advancing text-to-SQL systems capable of supporting scientific discovery through robust reasoning over structured biomedical knowledge bases. Our dataset is publicly available at https://huggingface.co/datasets/NIH-CARD/BiomedSQL, and our code is open-source at https://github.com/NIH-CARD/biomedsql.
中文: BiomedSQL是首个专门评估生物医学知识库中文本转SQL系统科学推理能力的基准,结果显示现有模型性能与专家水平存在显著差距,为提升结构化数据推理支持科学发现奠定了基础。
English: BiomedSQL is a new benchmark designed to assess scientific reasoning in text-to-SQL systems for biomedical databases, revealing a significant performance gap where even the best models fall well below expert accuracy.
Authors:Michael Kirchhof, Luca Füger, Adam GoliÅski, Eeshan Gunesh Dhekane, Arno Blaas, Sinead Williamson
Abstract:
To reveal when a large language model (LLM) is uncertain about a response, uncertainty quantification commonly produces percentage numbers along with the output. But is this all we can do? We argue that in the output space of LLMs, the space of strings, exist strings expressive enough to summarize the distribution over output strings the LLM deems possible. We lay a foundation for this new avenue of uncertainty explication and present SelfReflect, a theoretically-motivated metric to assess how faithfully a string summarizes an LLM's internal answer distribution. We show that SelfReflect is able to discriminate even subtle differences of candidate summary strings and that it aligns with human judgement, outperforming alternative metrics such as LLM judges and embedding comparisons. With SelfReflect, we investigate a number of self-summarization methods and find that even state-of-the-art reasoning models struggle to explicate their internal uncertainty. But we find that faithful summarizations can be generated by sampling and summarizing. To support the development of this universal form of LLM uncertainties, we publish our metric at https://github.com/apple/ml-selfreflect
This research introduces SelfReflect, a novel metric for evaluating how accurately a string summarizes a large language model's internal uncertainty distribution over possible outputs, demonstrating its superiority over existing methods and its alignment with human judgment.
English Summary:
Authors:Haoyu Wang, Zeyu Qin, Yifei Zhao, Chao Du, Min Lin, Xueqian Wang, Tianyu Pang
Abstract:
LLMs have made impressive progress, but their growing capabilities also expose them to highly flexible jailbreaking attacks designed to bypass safety alignment. While many existing defenses focus on known types of attacks, it is more critical to prepare LLMs for unseen attacks that may arise during deployment. To address this, we propose a lifelong safety alignment framework that enables LLMs to continuously adapt to new and evolving jailbreaking strategies. Our framework introduces a competitive setup between two components: a Meta-Attacker, trained to actively discover novel jailbreaking strategies, and a Defender, trained to resist them. To effectively warm up the Meta-Attacker, we first leverage the GPT-4o API to extract key insights from a large collection of jailbreak-related research papers. Through iterative training, the first iteration Meta-Attacker achieves a 73% attack success rate (ASR) on RR and a 57% transfer ASR on LAT using only single-turn attacks. Meanwhile, the Defender progressively improves its robustness and ultimately reduces the Meta-Attacker's success rate to just 7%, enabling safer and more reliable deployment of LLMs in open-ended environments. The code is available at https://github.com/sail-sg/LifelongSafetyAlignment.
中文: 本文提出了一种终身安全对齐框架,通过元攻击器与防御器的竞争机制,使大语言模型持续适应新型越狱攻击,初始攻击成功率高达73%,经迭代训练后降至7%,从而提升开放环境下的部署安全性。
English: This paper introduces a lifelong safety alignment framework that uses a competitive Meta-Attacker and Defender to continuously adapt LLMs to evolving jailbreaking attacks, achieving a 73% attack success rate initially and reducing it to 7% after iterative training for safer deployment.
Authors:Qi Cao, Ruiyi Wang, Ruiyi Zhang, Sai Ashish Somayajula, Pengtao Xie
Abstract:
Reasoning has improved the performance of large language models (LLMs) on complicated tasks. Central to the current reasoning studies, Process Reward Models (PRMs) offer a fine-grained evaluation of intermediate reasoning steps and guide the reasoning process. However, extending PRMs to multimodal large language models (MLLMs) introduces challenges. Since multimodal reasoning covers a wider range of tasks compared to text-only scenarios, the resulting distribution shift from the training to testing sets is more severe, leading to greater generalization difficulty. Training a reliable multimodal PRM, therefore, demands large and diverse datasets to ensure sufficient coverage. However, current multimodal reasoning datasets suffer from quality imbalance, which degrades PRM performance and highlights the need for data selection strategy. To address the issues, we introduce DreamPRM, a domain-reweighted training framework for multimodal PRMs which employs bi-level optimization. In the lower-level optimization, DreamPRM performs fine-tuning on multiple datasets with domain weights, allowing the PRM to prioritize high-quality reasoning signals and alleviating the impact of dataset quality imbalance. In the upper-level optimization, the PRM is evaluated on a separate meta-learning dataset; this feedback updates the domain weights through an aggregation loss function, thereby improving the generalization capability of trained PRM. Extensive experiments on multiple multimodal reasoning benchmarks covering both mathematical and general reasoning show that test-time scaling with DreamPRM consistently improves performance of state-of-the-art MLLMs. Further comparisons reveal that DreamPRM's domain-reweighting strategy surpasses data selection methods and yields higher accuracy gains than existing test-time scaling approaches. Codes are available at https://github.com/coder-qicao/DreamPRM.
中文:DreamPRM是一种领域重加权训练框架,通过双层优化优先处理高质量数据,提升了多模态推理的泛化能力和性能。
English: DreamPRM is a domain-reweighted training framework that enhances multimodal reasoning by prioritizing high-quality data through bi-level optimization, improving generalization and performance across various benchmarks.
Authors:Pranav Poudel, Aavash Chhetri, Prashnna Gyawali, Georgios Leontidis, Binod Bhattarai
Abstract:
Multimodal federated learning holds immense potential for collaboratively training models from multiple sources without sharing raw data, addressing both data scarcity and privacy concerns, two key challenges in healthcare. A major challenge in training multimodal federated models in healthcare is the presence of missing modalities due to multiple reasons, including variations in clinical practice, cost and accessibility constraints, retrospective data collection, privacy concerns, and occasional technical or human errors. Previous methods typically rely on publicly available real datasets or synthetic data to compensate for missing modalities. However, obtaining real datasets for every disease is impractical, and training generative models to synthesize missing modalities is computationally expensive and prone to errors due to the high dimensionality of medical data. In this paper, we propose a novel, lightweight, low-dimensional feature translator to reconstruct bottleneck features of the missing modalities. Our experiments on three different datasets (MIMIC-CXR, NIH Open-I, and CheXpert), in both homogeneous and heterogeneous settings consistently improve the performance of competitive baselines. The code and implementation details are available at: https://github.com/bhattarailab/FedFeatGen
中文摘要:本文提出了一种轻量级特征转换器,用于在医疗多模态联邦学习中重构缺失模态,在三个医疗数据集上均持续优于基线方法。
English Summary: The paper introduces a lightweight feature translator for reconstructing missing modalities in multimodal federated learning for healthcare, which consistently outperforms baseline methods across three medical datasets.
Authors:Maximilian Dreyer, Lorenz Hufe, Jim Berend, Thomas Wiegand, Sebastian Lapuschkin, Wojciech Samek
Abstract:
Transformer-based CLIP models are widely used for text-image probing and feature extraction, making it relevant to understand the internal mechanisms behind their predictions. While recent works show that Sparse Autoencoders (SAEs) yield interpretable latent components, they focus on what these encode and miss how they drive predictions. We introduce a scalable framework that reveals what latent components activate for, how they align with expected semantics, and how important they are to predictions. To achieve this, we adapt attribution patching for instance-wise component attributions in CLIP and highlight key faithfulness limitations of the widely used Logit Lens technique. By combining attributions with semantic alignment scores, we can automatically uncover reliance on components that encode semantically unexpected or spurious concepts. Applied across multiple CLIP variants, our method uncovers hundreds of surprising components linked to polysemous words, compound nouns, visual typography and dataset artifacts. While text embeddings remain prone to semantic ambiguity, they are more robust to spurious correlations compared to linear classifiers trained on image embeddings. A case study on skin lesion detection highlights how such classifiers can amplify hidden shortcuts, underscoring the need for holistic, mechanistic interpretability. We provide code at https://github.com/maxdreyer/attributing-clip.
中文摘要:本文提出了一种可扩展的框架,通过结合归因修补和语义对齐来解读CLIP模型的内部机制,揭示了潜在组件如何驱动预测,并在多个模型变体中发现了令人惊讶的语义关联和隐藏的捷径。
English Summary: This paper introduces a scalable framework for interpreting CLIP models by combining attribution patching with semantic alignment to reveal how latent components drive predictions, uncovering surprising semantic associations and hidden shortcuts across multiple model variants.
Authors:Hao Kang, Zichun Yu, Chenyan Xiong
Abstract:
Recent large language models such as Gemini-1.5, DeepSeek-V3, and Llama-4 increasingly adopt Mixture-of-Experts (MoE) architectures, which offer strong efficiency-performance trade-offs by activating only a fraction of the model per token. Yet academic researchers still lack a fully open, end-to-end MoE platform for investigating scaling, routing, and expert behavior. We release FLAME-MoE, a completely open-source research suite composed of seven decoder-only models, ranging from 38M to 1.7B active parameters, whose architecture--64 experts with top-8 gating and 2 shared experts--closely reflects modern production LLMs. All training data pipelines, scripts, logs, and checkpoints are publicly available to enable reproducible experimentation. Across six evaluation tasks, FLAME-MoE improves average accuracy by up to 3.4 points over dense baselines trained with identical FLOPs. Leveraging full training trace transparency, we present initial analyses showing that (i) experts increasingly specialize on distinct token subsets, (ii) co-activation matrices remain sparse, reflecting diverse expert usage, and (iii) routing behavior stabilizes early in training. All code, training logs, and model checkpoints are available at https://github.com/cmu-flame/FLAME-MoE.
中文:FLAME-MoE是一个完全开源的研究平台,提供采用专家混合架构的仅解码器模型,在比密集基线准确率提升最高3.4个百分点的同时,实现了对专家专业化和路由行为的透明化分析。
English: FLAME-MoE is a fully open-source research suite that provides decoder-only models with Mixture-of-Experts architecture, improving accuracy by up to 3.4 points over dense baselines while enabling transparent analysis of expert specialization and routing behavior.
Authors:Yixin Cui, Haotian Lin, Shuo Yang, Yixiao Wang, Yanjun Huang, Hong Chen
Abstract:
The rapid evolution of large language models in natural language processing has substantially elevated their semantic understanding and logical reasoning capabilities. Such proficiencies have been leveraged in autonomous driving systems, contributing to significant improvements in system performance. Models such as OpenAI o1 and DeepSeek-R1, leverage Chain-of-Thought (CoT) reasoning, an advanced cognitive method that simulates human thinking processes, demonstrating remarkable reasoning capabilities in complex tasks. By structuring complex driving scenarios within a systematic reasoning framework, this approach has emerged as a prominent research focus in autonomous driving, substantially improving the system's ability to handle challenging cases. This paper investigates how CoT methods improve the reasoning abilities of autonomous driving models. Based on a comprehensive literature review, we present a systematic analysis of the motivations, methodologies, challenges, and future research directions of CoT in autonomous driving. Furthermore, we propose the insight of combining CoT with self-learning to facilitate self-evolution in driving systems. To ensure the relevance and timeliness of this study, we have compiled a dynamic repository of literature and open-source projects, diligently updated to incorporate forefront developments. The repository is publicly available at https://github.com/cuiyx1720/Awesome-CoT4AD.
中文: 本文探讨了思维链推理如何通过提升语义理解和逻辑推理能力来增强自动驾驶系统,同时提出结合自学习以实现自我进化,并提供了持续更新的相关研究资源库。
English: This paper explores how Chain-of-Thought reasoning enhances autonomous driving systems by improving their semantic understanding and logical reasoning, while also proposing a self-learning integration for self-evolution and providing an updated repository of related research.
Authors:Chenxiao Fan, Chongming Gao, Wentao Shi, Yaxin Gong, Zihao Zhao, Fuli Feng
Abstract:
Accurate and safe medication recommendations are critical for effective clinical decision-making, especially in multimorbidity cases. However, existing systems rely on point-wise prediction paradigms that overlook synergistic drug effects and potential adverse drug-drug interactions (DDIs). We propose FLAME, a fine-grained list-wise alignment framework for large language models (LLMs), enabling drug-by-drug generation of drug lists. FLAME formulates recommendation as a sequential decision process, where each step adds or removes a single drug. To provide fine-grained learning signals, we devise step-wise Group Relative Policy Optimization (GRPO) with potential-based reward shaping, which explicitly models DDIs and optimizes the contribution of each drug to the overall prescription. Furthermore, FLAME enhances patient modeling by integrating structured clinical knowledge and collaborative information into the representation space of LLMs. Experiments on benchmark datasets demonstrate that FLAME achieves state-of-the-art performance, delivering superior accuracy, controllable safety-accuracy trade-offs, and strong generalization across diverse clinical scenarios. Our code is available at https://github.com/cxfann/Flame.
中文摘要:FLAME提出了一种细粒度列表对齐框架,通过顺序决策建模和基于潜在奖励的逐步优化,在大语言模型中实现了精准可控的药物推荐,在临床场景中展现出卓越的准确性和安全性。
English Summary: FLAME introduces a fine-grained list-wise alignment framework for large language models that optimizes medication recommendations through sequential decision-making and step-wise reward shaping, achieving state-of-the-art accuracy and safety in clinical applications.
Authors:Bingguang Hao, Maolin Wang, Zengzhuang Xu, Cunyin Peng, Yicheng Chen, Xiangyu Zhao, Jinjie Gu, Chenyi Zhuang
Abstract:
The integration of large language models (LLMs) with function calling has emerged as a crucial capability for enhancing their practical utility in real-world applications. However, effectively combining reasoning processes with accurate function execution remains a significant challenge. Traditional training approaches often struggle to balance the detailed reasoning steps with the precision of function calls, leading to suboptimal performance. To address these limitations, we introduce FunReason, a novel framework that enhances LLMs' function calling capabilities through an automated data refinement strategy and a Self-Refinement Multiscale Loss (SRML) approach. FunReason leverages LLMs' natural reasoning abilities to generate high-quality training examples, focusing on query parseability, reasoning coherence, and function call precision. The SRML approach dynamically balances the contribution of reasoning processes and function call accuracy during training, addressing the inherent trade-off between these two critical aspects. FunReason achieves performance comparable to GPT-4o while effectively mitigating catastrophic forgetting during fine-tuning. FunReason provides a comprehensive solution for enhancing LLMs' function calling capabilities by introducing a balanced training methodology and a data refinement pipeline. For code and dataset, please refer to our repository at GitHub https://github.com/BingguangHao/FunReason
中文: FunReason是一个通过自动化数据优化和自优化多尺度损失方法增强大语言模型函数调用能力的新框架,在保持推理过程与函数调用准确性平衡的同时,实现了与GPT-4o相当的性能。
English: FunReason is a novel framework that enhances large language models' function calling capabilities through automated data refinement and a Self-Refinement Multiscale Loss approach, achieving performance comparable to GPT-4o while balancing reasoning processes with function call accuracy.
Authors:Chun-Yi Kuan, Hung-yi Lee
Abstract:
Audio-aware large language models (ALLMs) have recently made great strides in understanding and processing audio inputs. These models are typically adapted from text-based large language models (LLMs) through additional training on audio-related tasks. However, this adaptation process presents two major limitations. First, ALLMs often suffer from catastrophic forgetting, where crucial textual capabilities like instruction-following are lost after training on audio data. In some cases, models may even hallucinate sounds that are not present in the input audio, raising concerns about reliability. Second, achieving cross-modal alignment between audio and language typically relies on large collections of task-specific question-answer pairs for instruction tuning, making it resource-intensive. To address these issues, previous works have leveraged the backbone LLMs to synthesize general-purpose, caption-style alignment data. In this paper, we propose a data generation framework that produces contrastive-like training data, designed to enhance ALLMs' ability to differentiate between present and absent sounds. We further extend our approach to multi-audio scenarios, enabling the model to either explain differences between audio inputs or produce unified captions that describe all inputs, thereby enhancing audio-language alignment. We refer to the entire ALLM training framework as bootstrapping audio-language alignment via synthetic data generation from backbone LLMs (BALSa). Experimental results indicate that our method effectively mitigates audio hallucinations while reliably maintaining strong performance on audio understanding and reasoning benchmarks, as well as instruction-following skills. Moreover, incorporating multi-audio training further enhances the model's comprehension and reasoning capabilities. Overall, BALSa offers an efficient and scalable approach to developing ALLMs.
Authors:Cédric Goemaere, Gaspard Oliviers, Rafal Bogacz, Thomas Demeester
Abstract:
Predictive Coding (PC) offers a biologically plausible alternative to backpropagation for neural network training, yet struggles with deeper architectures. This paper identifies the root cause: an inherent signal decay problem where gradients attenuate exponentially with depth, becoming computationally negligible due to numerical precision constraints. To address this fundamental limitation, we introduce Error Optimization (EO), a novel reparameterization that preserves PC's theoretical properties while eliminating signal decay. By optimizing over prediction errors rather than states, EO enables signals to reach all layers simultaneously and without attenuation, converging orders of magnitude faster than standard PC. Experiments across multiple architectures and datasets demonstrate that EO matches backpropagation's performance even for deeper models where conventional PC struggles. Besides practical improvements, our work provides theoretical insight into PC dynamics and establishes a foundation for scaling biologically-inspired learning to deeper architectures on digital hardware and beyond.
Chinese: 本文提出了基于误差的预测编码(ePC),通过重新参数化解决了状态预测编码中的信号衰减问题,使其在深层架构中收敛更快且性能与反向传播相当。
English: This paper introduces error-based Predictive Coding (ePC), a reparameterization that overcomes the signal decay issue in state-based PC, enabling faster convergence and matching backpropagation's performance in deep architectures.
Authors:Jin Zhu, Jingyi Li, Hongyi Zhou, Yinan Lin, Zhenhua Lin, Chengchun Shi
Abstract:
This paper focuses on the design of spatial experiments to optimize the amount of information derived from the experimental data and enhance the accuracy of the resulting causal effect estimator. We propose a surrogate function for the mean squared error (MSE) of the estimator, which facilitates the use of classical graph cut algorithms to learn the optimal design. Our proposal offers three key advances: (1) it accommodates moderate to large spatial interference effects; (2) it adapts to different spatial covariance functions; (3) it is computationally efficient. Theoretical results and numerical experiments based on synthetic environments and a dispatch simulator that models a city-scale ridesharing market, further validate the effectiveness of our design. A python implementation of our method is available at https://github.com/Mamba413/CausalGraphCut.
本文提出了一种新的均方误差替代函数,通过图割算法优化空间实验设计,有效处理空间干扰并适应不同协方差结构,从而显著提高因果效应估计的准确性。
This paper introduces a novel surrogate function for the mean squared error to optimize spatial experimental designs, enabling efficient graph cut algorithms to enhance estimator accuracy while accommodating spatial interference and various covariance structures.
Authors:Florian Eichin, Yupei Du, Philipp Mondorf, Barbara Plank, Michael A. Hedderich
Abstract:
Post-hoc interpretability methods typically attribute a model's behavior to its components, data, or training trajectory in isolation. This leads to explanations that lack a unified view and may miss key interactions. While combining existing methods or applying them at different training stages offers broader insights, these approaches usually lack theoretical support. In this work, we present ExPLAIND, a unified framework that integrates all three perspectives. First, we generalize recent work on gradient path kernels, which reformulate models trained by gradient descent as a kernel machine, to more realistic training settings. Empirically, we find that both a CNN and a Transformer model are replicated accurately by this reformulation. Second, we derive novel parameter- and step-wise influence scores from the kernel feature maps. We show their effectiveness in parameter pruning that is comparable to existing methods, reinforcing their value for model component attribution. Finally, jointly interpreting model components and data over the training process, we leverage ExPLAIND to analyze a Transformer that exhibits Grokking. Among other things, our findings support previously proposed stages of Grokking, while refining the final phase as one of alignment of input embeddings and final layers around a representation pipeline learned after the memorization phase. Overall, ExPLAIND provides a theoretically grounded, unified framework to interpret model behavior and training dynamics.
Chinese: 本文提出ExPLAIND统一框架,整合模型组件、数据和训练轨迹视角,为模型行为和训练动态提供理论支持的解读,并通过分析Transformer中的"顿悟"现象等应用验证其有效性。
English: The paper introduces ExPLAIND, a unified framework that integrates model components, data, and training trajectory perspectives to provide theoretically grounded interpretations of model behavior and training dynamics, validated through applications like analyzing Grokking in Transformers.
Authors:Qiong Zhang, Yan Shuo Tan, Qinglong Tian, Pengfei Li
Abstract:
Hollmann et al. (Nature 637 (2025) 319-326) recently introduced TabPFN, a transformer-based deep learning model for regression and classification on tabular data, which they claim "outperforms all previous methods on datasets with up to 10,000 samples by a wide margin, using substantially less training time." Furthermore, they have called TabPFN a "foundation model" for tabular data, as it can support "data generation, density estimation, learning reusable embeddings and fine-tuning". If these statements are well-supported, TabPFN may have the potential to supersede existing modeling approaches on a wide range of statistical tasks, mirroring a similar revolution in other areas of artificial intelligence that began with the advent of large language models. In this paper, we provide a tailored explanation of how TabPFN works for a statistics audience, by emphasizing its interpretation as approximate Bayesian inference. We also provide more evidence of TabPFN's "foundation model" capabilities: We show that an out-of-the-box application of TabPFN vastly outperforms specialized state-of-the-art methods for semi-supervised parameter estimation, prediction under covariate shift, and heterogeneous treatment effect estimation. We further show that TabPFN can outperform LASSO at sparse regression and can break a robustness-efficiency trade-off in classification. All experiments can be reproduced using the code provided at https://github.com/qinglong-tian/tabpfn_study (https://github.com/qinglong-tian/tabpfn_study).
中文: TabPFN是一种基于Transformer的深度学习模型,专为表格数据设计,在回归和分类任务中表现卓越,以更少训练时间大幅超越现有方法,并具备数据生成和微调等基础模型能力。
English: TabPFN is a transformer-based deep learning model for tabular data that excels in regression and classification tasks, outperforming existing methods with less training time and demonstrating foundation model capabilities like data generation and fine-tuning.
Authors:Herbert Woisetschläger, Ryan Zhang, Shiqiang Wang, Hans-Arno Jacobsen
Abstract:
Open-weight large language model (LLM) zoos provide access to numerous high-quality models, but selecting the appropriate model for specific tasks remains challenging and requires technical expertise. Most users simply want factually correct, safe, and satisfying responses without concerning themselves with model technicalities, while inference service providers prioritize minimizing operating costs. These competing interests are typically mediated through service level agreements (SLAs) that guarantee minimum service quality. We introduce MESS+, a stochastic optimization algorithm for cost-optimal LLM request routing while providing rigorous SLA compliance guarantees. MESS+ learns request satisfaction probabilities of LLMs in real-time as users interact with the system, based on which model selection decisions are made by solving a per-request optimization problem. Our algorithm includes a novel combination of virtual queues and request satisfaction prediction, along with a theoretical analysis of cost optimality and constraint satisfaction. Across a wide range of state-of-the-art LLM benchmarks, MESS+ achieves an average of $2\times$ cost savings compared to existing LLM routing techniques.
Chinese: MESS+ 是一种随机优化算法,通过实时学习模型性能并解决单请求优化问题,在保证严格服务等级协议合规的同时,实现了大型语言模型请求的成本最优路由。
English: MESS+ is a stochastic optimization algorithm that enables cost-effective routing of large language model requests while ensuring strict compliance with service level agreements by learning model performance in real-time and solving per-request optimization problems.
Authors:Huan Zhang, Fan Lyu, Shuyu Dong, Shenghua Fan, Yujin Zheng, Dingwen Wang
Abstract:
Continual Learning with Pre-trained Models holds great promise for efficient adaptation across sequential tasks. However, most existing approaches freeze PTMs and rely on auxiliary modules like prompts or adapters, limiting model plasticity and leading to suboptimal generalization when facing significant distribution shifts. While full fine-tuning can improve adaptability, it risks disrupting crucial pre-trained knowledge. In this paper, we propose Mutual Information-guided Sparse Tuning (MIST), a plug-and-play method that selectively updates a small subset of PTM parameters, less than 5%, based on sensitivity to mutual information objectives. MIST enables effective task-specific adaptation while preserving generalization. To further reduce interference, we introduce strong sparsity regularization by randomly dropping gradients during tuning, resulting in fewer than 0.5% of parameters being updated per step. Applied before standard freeze-based methods, MIST consistently boosts performance across diverse continual learning benchmarks. Experiments show that integrating our method into multiple baselines yields significant performance gains. Our code is available at https://github.com/zhwhu/MIST.
中文: 本文提出互信息指导的稀疏调优方法MIST,通过选择性更新预训练模型中不足5%的参数,在持续学习场景下既能有效适应任务又能保持模型的泛化能力。
English: This paper introduces Mutual Information-guided Sparse Tuning (MIST), a plug-and-play method that selectively updates under 5% of pre-trained model parameters to enhance task adaptation while preserving generalization in continual learning scenarios.
Authors:Lukas Meyer, Andrei-Timotei Ardelean, Tim Weyrich, Marc Stamminger
Abstract:
We introduce FruitNeRF++, a novel fruit-counting approach that combines contrastive learning with neural radiance fields to count fruits from unstructured input photographs of orchards. Our work is based on FruitNeRF, which employs a neural semantic field combined with a fruit-specific clustering approach. The requirement for adaptation for each fruit type limits the applicability of the method, and makes it difficult to use in practice. To lift this limitation, we design a shape-agnostic multi-fruit counting framework, that complements the RGB and semantic data with instance masks predicted by a vision foundation model. The masks are used to encode the identity of each fruit as instance embeddings into a neural instance field. By volumetrically sampling the neural fields, we extract a point cloud embedded with the instance features, which can be clustered in a fruit-agnostic manner to obtain the fruit count. We evaluate our approach using a synthetic dataset containing apples, plums, lemons, pears, peaches, and mangoes, as well as a real-world benchmark apple dataset. Our results demonstrate that FruitNeRF++ is easier to control and compares favorably to other state-of-the-art methods.
Authors:Hexuan Deng, Wenxiang Jiao, Xuebo Liu, Jun Rao, Min Zhang
Abstract:
Large Reasoning Models (LRMs) demonstrate strong performance in complex tasks but often face the challenge of overthinking, leading to substantially high inference costs. Existing approaches synthesize shorter reasoning responses for LRMs to learn, but are inefficient for online usage due to the time-consuming data generation and filtering processes. Meanwhile, online reinforcement learning mainly adopts a length reward to encourage short reasoning responses, but tends to lose the reflection ability and harm the performance. To address these issues, we propose REA-RL, which introduces a small reflection model for efficient scaling in online training, offering both parallel sampling and sequential revision. Besides, a reflection reward is designed to further prevent LRMs from favoring short yet non-reflective responses. Experiments show that both methods maintain or enhance performance while significantly improving inference efficiency. Their combination achieves a good balance between performance and efficiency, reducing inference costs by 35% without compromising performance. Further analysis demonstrates that our methods are effective by maintaining reflection frequency for hard problems while appropriately reducing it for simpler ones without losing reflection ability. Codes are available at https://github.com/hexuandeng/REA-RL.
中文摘要:提出的REA-RL方法通过引入反思模型进行高效在线训练并设计反思奖励,解决了大型推理模型的过度思考问题,在保持性能的同时显著提升35%的推理效率。
English Summary: The proposed REA-RL method addresses overthinking in Large Reasoning Models by combining a reflection model for efficient online training with a reflection reward, achieving 35% inference cost reduction while maintaining performance.
Authors:Sirui Chen, Shuqin Ma, Shu Yu, Hanwang Zhang, Shengjie Zhao, Chaochao Lu
Abstract:
Consciousness stands as one of the most profound and distinguishing features of the human mind, fundamentally shaping our understanding of existence and agency. As large language models (LLMs) develop at an unprecedented pace, questions concerning intelligence and consciousness have become increasingly significant. However, discourse on LLM consciousness remains largely unexplored territory. In this paper, we first clarify frequently conflated terminologies (e.g., LLM consciousness and LLM awareness). Then, we systematically organize and synthesize existing research on LLM consciousness from both theoretical and empirical perspectives. Furthermore, we highlight potential frontier risks that conscious LLMs might introduce. Finally, we discuss current challenges and outline future directions in this emerging field. The references discussed in this paper are organized at https://github.com/OpenCausaLab/Awesome-LLM-Consciousness.
中文: 本文系统探讨了大型语言模型意识这一新兴领域,厘清了相关术语,整合了现有研究,并指出了潜在风险及未来研究方向。
English: This paper systematically explores the largely uncharted territory of LLM consciousness, clarifying terminology, synthesizing research, and addressing potential risks and future directions in the field.
Authors:Sajjad Shahabodini, Mobina Mansoori, Farnoush Bayatmakou, Jamshid Abouei, Konstantinos N. Plataniotis, Arash Mohammadi
Abstract:
Image segmentation remains a challenging task in computer vision, demanding robust mask generation and precise classification. Recent mask-based approaches yield high-quality masks by capturing global context. However, accurately classifying these masks, especially in the presence of ambiguous boundaries and imbalanced class distributions, remains an open challenge. In this work, we introduce ViT-P, a novel two-stage segmentation framework that decouples mask generation from classification. The first stage employs a proposal generator to produce class-agnostic mask proposals, while the second stage utilizes a point-based classification model built on the Vision Transformer (ViT) to refine predictions by focusing on mask central points. ViT-P serves as a pre-training-free adapter, allowing the integration of various pre-trained vision transformers without modifying their architecture, ensuring adaptability to dense prediction tasks. Furthermore, we demonstrate that coarse and bounding box annotations can effectively enhance classification without requiring additional training on fine annotation datasets, reducing annotation costs while maintaining strong performance. Extensive experiments across COCO, ADE20K, and Cityscapes datasets validate the effectiveness of ViT-P, achieving state-of-the-art results with 54.0 PQ on ADE20K panoptic segmentation, 87.4 mIoU on Cityscapes semantic segmentation, and 63.6 mIoU on ADE20K semantic segmentation. The code and pretrained models are available at: https://github.com/sajjad-sh33/ViT-P}{https://github.com/sajjad-sh33/ViT-P.
中文: 本文提出ViT-P,一种两阶段分割框架,将掩码生成与分类解耦,利用基于视觉变换器的模型提高精度和适应性,无需预训练,在多个数据集上实现了最先进的性能。
English: This paper introduces ViT-P, a two-stage segmentation framework that decouples mask generation from classification, using a Vision Transformer-based model to enhance accuracy and adaptability without pre-training, achieving state-of-the-art results across multiple datasets.
Authors:Mobina Mansoori, Sajjad Shahabodini, Farnoush Bayatmakou, Jamshid Abouei, Konstantinos N. Plataniotis, Arash Mohammadi
Abstract:
Using massive datasets, foundation models are large-scale, pre-trained models that perform a wide range of tasks. These models have shown consistently improved results with the introduction of new methods. It is crucial to analyze how these trends impact the medical field and determine whether these advancements can drive meaningful change. This study investigates the application of recent state-of-the-art foundation models, DINOv2, MAE, VMamba, CoCa, SAM2, and AIMv2, for medical image classification. We explore their effectiveness on datasets including CBIS-DDSM for mammography, ISIC2019 for skin lesions, APTOS2019 for diabetic retinopathy, and CHEXPERT for chest radiographs. By fine-tuning these models and evaluating their configurations, we aim to understand the potential of these advancements in medical image classification. The results indicate that these advanced models significantly enhance classification outcomes, demonstrating robust performance despite limited labeled data. Based on our results, AIMv2, DINOv2, and SAM2 models outperformed others, demonstrating that progress in natural domain training has positively impacted the medical domain and improved classification outcomes. Our code is publicly available at: https://github.com/sajjad-sh33/Medical-Transfer-Learning.
中文: 本研究评估了DINOv2和AIMv2等先进基础模型在医学图像数据集上的表现,结果表明这些模型显著提升了分类准确性和鲁棒性,即使在标注数据有限的情况下也表现优异。
English: This study evaluates advanced foundation models like DINOv2 and AIMv2 on medical image datasets, showing they significantly improve classification accuracy and robustness, even with limited labeled data.
Authors:Patara Trirat, Wonyong Jeong, Sung Ju Hwang
Abstract:
Large language models (LLMs) have demonstrated remarkable capabilities across diverse tasks, but optimizing LLM-based agentic systems remains challenging due to the vast search space of agent configurations, prompting strategies, and communication patterns. Existing approaches often rely on heuristic-based tuning or exhaustive evaluation, which can be computationally expensive and suboptimal. This paper proposes Agentic Predictor, a lightweight predictor for efficient agentic workflow evaluation. Agentic Predictor is equipped with a multi-view workflow encoding technique that leverages multi-view representation learning of agentic systems by incorporating code architecture, textual prompts, and interaction graph features. To achieve high predictive accuracy while significantly reducing the number of required workflow evaluations for training a predictor, Agentic Predictor employs cross-domain unsupervised pretraining. By learning to approximate task success rates, Agentic Predictor enables fast and accurate selection of optimal agentic workflow configurations for a given task, significantly reducing the need for expensive trial-and-error evaluations. Experiments on a carefully curated benchmark spanning three domains show that our predictor outperforms state-of-the-art methods in both predictive accuracy and workflow utility, highlighting the potential of performance predictors in streamlining the design of LLM-based agentic workflows.
Chinese: 本文提出Agentic Predictor,一种轻量级预测器,通过多视图编码和跨域预训练技术,能高效评估并选择最优的大语言模型智能体工作流配置,显著减少昂贵试错评估需求,同时提升预测准确性和工作流效用。
English: This paper introduces Agentic Predictor, a lightweight tool that uses multi-view encoding and cross-domain pretraining to efficiently evaluate and select optimal LLM-based agentic workflow configurations, reducing the need for costly trial-and-error evaluations while improving accuracy and utility.
Authors:Yang Zhang, Yu Yu, Bo Tang, Yu Zhu, Chuxiong Sun, Wenqiang Wei, Jie Hu, Zipeng Xie, Zhiyu Li, Feiyu Xiong, Edward Chung
Abstract:
With the rapid development of Large Language Models (LLMs), aligning these models with human preferences and values is critical to ensuring ethical and safe applications. However, existing alignment techniques such as RLHF or DPO often require direct fine-tuning on LLMs with billions of parameters, resulting in substantial computational costs and inefficiencies. To address this, we propose Micro token-level Accept-Reject Aligning (MARA) approach designed to operate independently of the language models. MARA simplifies the alignment process by decomposing sentence-level preference learning into token-level binary classification, where a compact three-layer fully-connected network determines whether candidate tokens are "Accepted" or "Rejected" as part of the response. Extensive experiments across seven different LLMs and three open-source datasets show that MARA achieves significant improvements in alignment performance while reducing computational costs. The source code and implementation details are publicly available at https://github.com/IAAR-Shanghai/MARA, and the trained models are released at https://huggingface.co/IAAR-Shanghai/MARA_AGENTS.
中文: 提出的微令牌级接受-拒绝对齐(MARA)方法通过将句子级偏好学习转化为令牌级二元分类,在多个模型和数据集上显著提升了对齐性能并降低了计算成本。
English: The proposed Micro token-level Accept-Reject Aligning (MARA) method efficiently aligns large language models with human preferences by converting sentence-level learning into token-level classification, significantly improving performance while reducing computational costs across multiple models and datasets.
Authors:Junming Liu, Yanting Gao, Siyuan Meng, Yifei Sun, Aoqi Wu, Yufei Jin, Yirong Chen, Ding Wang, Guosun Zeng
Abstract:
Federated Learning (FL) is a decentralized machine learning paradigm that enables clients to collaboratively train models while preserving data privacy. However, the coexistence of model and data heterogeneity gives rise to inconsistent representations and divergent optimization dynamics across clients, ultimately hindering robust global performance. To transcend these challenges, we propose Mosaic, a novel data-free knowledge distillation framework tailored for heterogeneous distributed environments. Mosaic first trains local generative models to approximate each client's personalized distribution, enabling synthetic data generation that safeguards privacy through strict separation from real data. Subsequently, Mosaic forms a Mixture-of-Experts (MoE) from client models based on their specialized knowledge, and distills it into a global model using the generated data. To further enhance the MoE architecture, Mosaic integrates expert predictions via a lightweight meta model trained on a few representative prototypes. Extensive experiments on standard image classification benchmarks demonstrate that Mosaic consistently outperforms state-of-the-art approaches under both model and data heterogeneity. The source code has been published at https://github.com/Wings-Of-Disaster/Mosaic.
中文: 该摘要提出了Mosaic框架,通过训练本地生成模型合成数据并构建专家混合模型蒸馏为全局模型,有效解决了联邦学习中的模型与数据异构性问题,实验证明其性能优于现有先进方法。
English: This abstract introduces Mosaic, a data-free knowledge distillation framework designed to address model and data heterogeneity in Federated Learning by training local generative models for synthetic data generation and forming a Mixture-of-Experts distilled into a global model, which outperforms state-of-the-art methods in experiments.
Authors:Xinrui Wang, Shao-yuan Li, Jiaqiang Zhang, Songcan Chen
Abstract:
Multi-Label Online Continual Learning (MOCL) requires models to learn continuously from endless multi-label data streams, facing complex challenges including persistent catastrophic forgetting, potential missing labels, and uncontrollable imbalanced class distributions. While existing MOCL methods attempt to address these challenges through various techniques, \textit{they all overlook label-specific region identifying and feature learning} - a fundamental solution rooted in multi-label learning but challenging to achieve in the online setting with incremental and partial supervision. To this end, we first leverage the inherent structural information of input data to evaluate and verify the innate localization capability of different pre-trained models. Then, we propose CUTER (CUT-out-and-Experience-Replay), a simple yet versatile strategy that provides fine-grained supervision signals by further identifying, strengthening and cutting out label-specific regions for efficient experience replay. It not only enables models to simultaneously address catastrophic forgetting, missing labels, and class imbalance challenges, but also serves as an orthogonal solution that seamlessly integrates with existing approaches. Extensive experiments on multiple multi-label image benchmarks demonstrate the superiority of our proposed method. The code is available at \href{https://github.com/wxr99/Cut-Replay}{https://github.com/wxr99/Cut-Replay}
中文: 本文提出CUTER策略,通过识别和回放标签特定区域,有效解决了多标签在线持续学习中的灾难性遗忘、标签缺失和类别不平衡问题,在多个基准测试中表现出优越性能。
English: This paper introduces CUTER, a novel strategy for Multi-Label Online Continual Learning that addresses catastrophic forgetting, missing labels, and class imbalance by identifying and replaying label-specific regions, demonstrating superior performance across multiple benchmarks.
Authors:Jiawen Chen, Qi Shao, Duxin Chen, Wenwu Yu
Abstract:
Spatio-temporal prediction is a pivotal task with broad applications in traffic management, climate monitoring, energy scheduling, etc. However, existing methodologies often struggle to balance model expressiveness and computational efficiency, especially when scaling to large real-world datasets. To tackle these challenges, we propose STH-SepNet (Spatio-Temporal Hypergraph Separation Networks), a novel framework that decouples temporal and spatial modeling to enhance both efficiency and precision. Therein, the temporal dimension is modeled using lightweight large language models, which effectively capture low-rank temporal dynamics. Concurrently, the spatial dimension is addressed through an adaptive hypergraph neural network, which dynamically constructs hyperedges to model intricate, higher-order interactions. A carefully designed gating mechanism is integrated to seamlessly fuse temporal and spatial representations. By leveraging the fundamental principles of low-rank temporal dynamics and spatial interactions, STH-SepNet offers a pragmatic and scalable solution for spatio-temporal prediction in real-world applications. Extensive experiments on large-scale real-world datasets across multiple benchmarks demonstrate the effectiveness of STH-SepNet in boosting predictive performance while maintaining computational efficiency. This work may provide a promising lightweight framework for spatio-temporal prediction, aiming to reduce computational demands and while enhancing predictive performance. Our code is avaliable at https://github.com/SEU-WENJIA/ST-SepNet-Lightweight-LLMs-Meet-Adaptive-Hypergraphs.
中文: STH-SepNet提出了一种创新框架,通过轻量级大语言模型处理时间动态和自适应超图神经网络捕捉空间交互,实现了时空解耦建模,在多个基准测试中显著提升了预测性能并保持了计算效率。
English: STH-SepNet introduces a novel framework that decouples temporal and spatial modeling using lightweight large language models for temporal dynamics and adaptive hypergraph neural networks for spatial interactions, achieving enhanced predictive performance and computational efficiency across multiple benchmarks.
Authors:Ho Hin Lee, Quan Liu, Shunxing Bao, Yuankai Huo, Bennett A. Landman
Abstract:
In contrast to vision transformers, which model long-range dependencies through global self-attention, large kernel convolutions provide a more efficient and scalable alternative, particularly in high-resolution 3D volumetric settings. However, naively increasing kernel size often leads to optimization instability and degradation in performance. Motivated by the spatial bias observed in effective receptive fields (ERFs), we hypothesize that different kernel elements converge at variable rates during training. To support this, we derive a theoretical connection between element-wise gradients and first-order optimization, showing that structurally re-parameterized convolution blocks inherently induce spatially varying learning rates. Building on this insight, we introduce Rep3D, a 3D convolutional framework that incorporates a learnable spatial prior into large kernel training. A lightweight two-stage modulation network generates a receptive-biased scaling mask, adaptively re-weighting kernel updates and enabling local-to-global convergence behavior. Rep3D adopts a plain encoder design with large depthwise convolutions, avoiding the architectural complexity of multi-branch compositions. We evaluate Rep3D on five challenging 3D segmentation benchmarks and demonstrate consistent improvements over state-of-the-art baselines, including transformer-based and fixed-prior re-parameterization methods. By unifying spatial inductive bias with optimization-aware learning, Rep3D offers an interpretable, and scalable solution for 3D medical image analysis. The source code is publicly available at https://github.com/leeh43/Rep3D.
Chinese: Rep3D提出了一种带有可学习空间先验的3D卷积框架,通过自适应重加权大核训练中的内核更新,在3D医学图像分割任务中实现了优于现有方法的性能与可扩展性。
English: Rep3D introduces a 3D convolutional framework with a learnable spatial prior that adaptively re-weights kernel updates during large kernel training, achieving improved performance and scalability over existing methods in 3D medical image segmentation.
Authors:Xuandong Zhao, Zhewei Kang, Aosong Feng, Sergey Levine, Dawn Song
Abstract:
Training large language models (LLMs) for complex reasoning via Reinforcement Learning with Verifiable Rewards (RLVR) is effective but limited by reliance on costly, domain-specific supervision. We explore Reinforcement Learning from Internal Feedback (RLIF), a framework that enables LLMs to learn from intrinsic signals without external rewards or labeled data. We propose Intuitor, an RLIF method that uses a model's own confidence, termed self-certainty, as its sole reward signal. Intuitor replaces external rewards in Group Relative Policy Optimization (GRPO) with self-certainty scores, enabling fully unsupervised learning. Experiments demonstrate that Intuitor matches GRPO's performance on mathematical benchmarks while achieving superior generalization to out-of-domain tasks like code generation, without requiring gold solutions or test cases. Our findings show that intrinsic model signals can drive effective learning across domains, offering a scalable alternative to RLVR for autonomous AI systems where verifiable rewards are unavailable. Code is available at https://github.com/sunblaze-ucb/Intuitor
中文摘要:提出的Intuitor框架通过将模型自身置信度作为内在奖励信号,使大语言模型能够在无需外部监督的情况下学习复杂推理,在保持与监督方法相当性能的同时,实现了跨领域任务的更优泛化能力。
English Summary: The proposed Intuitor framework enables large language models to learn complex reasoning through self-certainty as an intrinsic reward signal, achieving comparable performance to supervised methods while demonstrating superior generalization across domains without requiring external supervision.
Authors:Yuan Feng, Yukun Cao, Hairu Wang, Xike Xie, S Kevin Zhou
Abstract:
Sketches, probabilistic structures for estimating item frequencies in infinite data streams with limited space, are widely used across various domains. Recent studies have shifted the focus from handcrafted sketches to neural sketches, leveraging memory-augmented neural networks (MANNs) to enhance the streaming compression capabilities and achieve better space-accuracy trade-offs.However, existing neural sketches struggle to scale across different data domains and space budgets due to inflexible MANN configurations. In this paper, we introduce a scalable MANN architecture that brings to life the {\it Lego sketch}, a novel sketch with superior scalability and accuracy. Much like assembling creations with modular Lego bricks, the Lego sketch dynamically coordinates multiple memory bricks to adapt to various space budgets and diverse data domains. Our theoretical analysis guarantees its high scalability and provides the first error bound for neural sketch. Furthermore, extensive experimental evaluations demonstrate that the Lego sketch exhibits superior space-accuracy trade-offs, outperforming existing handcrafted and neural sketches. Our code is available at https://github.com/FFY0/LegoSketch_ICML.
中文: Lego sketch 提出了一种可扩展的记忆增强神经网络架构,通过动态协调模块化记忆块,实现了跨数据域和空间预算的卓越适应性,在空间-精度权衡上优于现有草图方法。
English: The Lego sketch introduces a scalable memory-augmented neural network architecture that dynamically coordinates modular memory bricks to achieve superior adaptability across data domains and space budgets, outperforming existing sketches in space-accuracy trade-offs.
Authors:Zhaowei Zhang, Minghua Yi, Mengmeng Wang, Fengshuo Bai, Zilong Zheng, Yipeng Kang, Yaodong Yang
Abstract:
Achieving political consensus is crucial yet challenging for the effective functioning of social governance. However, although frontier AI systems represented by large language models (LLMs) have developed rapidly in recent years, their capabilities on this scope are still understudied. In this paper, we introduce EuroCon, a novel benchmark constructed from 2,225 high-quality deliberation records of the European Parliament over 13 years, ranging from 2009 to 2022, to evaluate the ability of LLMs to reach political consensus among divergent party positions across diverse parliament settings. Specifically, EuroCon incorporates four factors to build each simulated parliament setting: specific political issues, political goals, participating parties, and power structures based on seat distribution. We also develop an evaluation framework for EuroCon to simulate real voting outcomes in different parliament settings, assessing whether LLM-generated resolutions meet predefined political goals. Our experimental results demonstrate that even state-of-the-art models remain undersatisfied with complex tasks like passing resolutions by a two-thirds majority and addressing security issues, while revealing some common strategies LLMs use to find consensus under different power structures, such as prioritizing the stance of the dominant party, highlighting EuroCon's promise as an effective platform for studying LLMs' ability to find political consensus.
Authors:Yiyun Zhou, Zheqi Lv, Shengyu Zhang, Jingyuan Chen
Abstract:
Knowledge Tracing (KT) is a core component of Intelligent Tutoring Systems, modeling learners' knowledge state to predict future performance and provide personalized learning support. Traditional KT models assume that learners' learning abilities remain relatively stable over short periods or change in predictable ways based on prior performance. However, in reality, learners' abilities change irregularly due to factors like cognitive fatigue, motivation, and external stress -- a task introduced, which we refer to as Real-time Learning Pattern Adjustment (RLPA). Existing KT models, when faced with RLPA, lack sufficient adaptability, because they fail to timely account for the dynamic nature of different learners' evolving learning patterns. Current strategies for enhancing adaptability rely on retraining, which leads to significant overfitting and high time overhead issues. To address this, we propose Cuff-KT, comprising a controller and a generator. The controller assigns value scores to learners, while the generator generates personalized parameters for selected learners. Cuff-KT controllably adapts to data changes fast and flexibly without fine-tuning. Experiments on five datasets from different subjects demonstrate that Cuff-KT significantly improves the performance of five KT models with different structures under intra- and inter-learner shifts, with an average relative increase in AUC of 10% and 4%, respectively, at a negligible time cost, effectively tackling RLPA task. Our code and datasets are fully available at https://github.com/zyy-2001/Cuff-KT.
Chinese: 知识追踪模型通常假设学习能力稳定,但现实因素导致其不规则变化,提出的Cuff-KT通过动态适应学习者变化模式而无需重新训练,显著提升了多数据集上的性能表现。
English: Knowledge Tracing models traditionally assume stable learning abilities, but real-world factors cause irregular changes, which the proposed Cuff-KT addresses by dynamically adapting to learners' evolving patterns without retraining, significantly improving performance across datasets.
Authors:Yifan Jia, Kailin Jiang, Yuyang Liang, Qihan Ren, Yi Xin, Rui Yang, Fenze Feng, Mingcai Chen, Hengyang Lu, Haozhe Wang, Xiaoye Qu, Dongrui Liu, Lizhen Cui, Yuntao Du
Abstract:
Large Multimodal Models(LMMs) face notable challenges when encountering multimodal knowledge conflicts, particularly under retrieval-augmented generation(RAG) frameworks where the contextual information from external sources may contradict the model's internal parametric knowledge, leading to unreliable outputs. However, existing benchmarks fail to reflect such realistic conflict scenarios. Most focus solely on intra-memory conflicts, while context-memory and inter-context conflicts remain largely investigated. Furthermore, commonly used factual knowledge-based evaluations are often overlooked, and existing datasets lack a thorough investigation into conflict detection capabilities. To bridge this gap, we propose MMKC-Bench, a benchmark designed to evaluate factual knowledge conflicts in both context-memory and inter-context scenarios. MMKC-Bench encompasses three types of multimodal knowledge conflicts and includes 1,573 knowledge instances and 3,381 images across 23 broad types, collected through automated pipelines with human verification. We evaluate three representative series of LMMs on both model behavior analysis and conflict detection tasks. Our findings show that while current LMMs are capable of recognizing knowledge conflicts, they tend to favor internal parametric knowledge over external evidence. We hope MMKC-Bench will foster further research in multimodal knowledge conflict and enhance the development of multimodal RAG systems. The source code is available at https://github.com/MLLMKCBENCH/MLLMKC.
中文: 大型多模态模型在检索增强生成中面临知识冲突问题,为此我们提出MMKC-Bench基准进行评估,发现现有模型倾向于依赖内部参数知识而非外部证据。
English: Large Multimodal Models struggle with knowledge conflicts in retrieval-augmented generation, so we introduce MMKC-Bench to evaluate these scenarios and find that models often prioritize internal knowledge over external evidence.
Authors:Pingzhi Li, Zhen Tan, Huaizhi Qu, Huan Liu, Tianlong Chen
Abstract:
Large Language Models (LLMs) represent substantial intellectual and economic investments, yet their effectiveness can inadvertently facilitate model imitation via knowledge distillation (KD).In practical scenarios, competitors can distill proprietary LLM capabilities by simply observing publicly accessible outputs, akin to reverse-engineering a complex performance by observation alone. Existing protective methods like watermarking only identify imitation post-hoc, while other defenses assume the student model mimics the teacher's internal logits, rendering them ineffective against distillation purely from observed output text. This paper confronts the challenge of actively protecting LLMs within the realistic constraints of API-based access. We introduce an effective and efficient Defensive Output Generation (DOGe) strategy that subtly modifies the output behavior of an LLM. Its outputs remain accurate and useful for legitimate users, yet are designed to be misleading for distillation, significantly undermining imitation attempts. We achieve this by fine-tuning only the final linear layer of the teacher LLM with an adversarial loss. This targeted training approach anticipates and disrupts distillation attempts during inference time. Our experiments show that, while preserving or even improving the original performance of the teacher model, student models distilled from the defensively generated teacher outputs demonstrate catastrophically reduced performance, demonstrating our method's effectiveness as a practical safeguard against KD-based model imitation.
中文: 本文提出防御性输出生成(DOGe)策略,通过微调教师大语言模型的最后一层,使其输出在保持对合法用户有效的同时,严重破坏基于知识蒸馏的模型模仿效果。
English: This paper introduces Defensive Output Generation (DOGe), an efficient strategy that fine-tunes the final layer of a teacher LLM to subtly alter its outputs, preserving utility for legitimate users while severely degrading performance in distillation-based imitation attempts.
Authors:Peijie Dong, Zhenheng Tang, Xiang Liu, Lujun Li, Xiaowen Chu, Bo Li
Abstract:
Post-training compression reduces the computational and memory costs of large language models (LLMs), enabling resource-efficient deployment. However, existing compression benchmarks only focus on language modeling (e.g., perplexity) and natural language understanding tasks (e.g., GLUE accuracy), ignoring the agentic capabilities - workflow, tool use/function call, long-context understanding and real-world application. We introduce the Agent Compression Benchmark (ACBench), the first comprehensive benchmark for evaluating how compression impacts LLMs' agentic abilities. ACBench spans (1) 12 tasks across 4 capabilities (e.g., WorfBench for workflow generation, Needle-in-Haystack for long-context retrieval), (2) quantization (GPTQ, AWQ) and pruning (Wanda, SparseGPT), and (3) 15 models, including small (Gemma-2B), standard (Qwen2.5 7B-32B), and distilled reasoning LLMs (DeepSeek-R1-Distill). Our experiments reveal compression tradeoffs: 4-bit quantization preserves workflow generation and tool use (1%-3% drop) but degrades real-world application accuracy by 10%-15%. We introduce ERank, Top-k Ranking Correlation and Energy to systematize analysis. ACBench provides actionable insights for optimizing LLM compression in agentic scenarios. The code can be found in https://github.com/pprp/ACBench.
中文: 训练后压缩可降低大语言模型的计算与内存成本,但现有基准仅关注语言建模而忽略智能体能力,因此ACBench作为首个全面评估压缩对智能体能力影响的基准,涵盖12项任务与多种压缩技术,实验表明4位量化能保持工作流生成与工具使用功能,但会降低实际应用准确性10%-15%。
English: Post-training compression enables efficient deployment of large language models but existing benchmarks overlook agentic capabilities, prompting the introduction of ACBench to comprehensively evaluate compression impacts on tasks like workflow generation and tool use, revealing tradeoffs such as preserved functionality in quantization but degraded real-world accuracy.
Authors:Sihan Chen, Dan Zhao, Jongwoo Ko, Colby Banbury, Huiping Zhuang, Luming Liang, Tianyi Chen
Abstract:
The growing computational demands of large language models (LLMs) make efficient inference and activation strategies increasingly critical. While recent approaches, such as Mixture-of-Experts (MoE), leverage selective activation but require specialized training, training-free sparse activation methods offer broader applicability and superior resource efficiency through their plug-and-play design. However, many existing methods rely solely on hidden state magnitudes to determine activation, resulting in high approximation errors and suboptimal inference accuracy. To address these limitations, we propose WINA (Weight Informed Neuron Activation), a novel, simple, and training-free sparse activation framework that jointly considers hidden state magnitudes and the column-wise $\ell_2$-norms of weight matrices. We show that this leads to a sparsification strategy that obtains optimal approximation error bounds with theoretical guarantees tighter than existing techniques. Empirically, WINA also outperforms state-of-the-art methods (e.g., TEAL) by up to $2.94\%$ in average performance at the same sparsity levels, across a diverse set of LLM architectures and datasets. These results position WINA as a new performance frontier for training-free sparse activation in LLM inference, advancing training-free sparse activation methods and setting a robust baseline for efficient inference. The source code is available at https://github.com/microsoft/wina.
中文: WINA是一种无需训练的新型稀疏激活框架,通过联合考虑隐藏状态幅度和权重矩阵范数,在相同稀疏度下比现有方法性能提升最高达2.94%,为LLM推理设立了新的性能基准。
English: WINA is a novel training-free sparse activation framework that jointly considers hidden state magnitudes and weight matrix norms to achieve superior approximation accuracy and outperform existing methods by up to 2.94% across various LLMs and datasets.
Authors:Kidist Amde Mekonnen, Yosef Worku Alemneh, Maarten de Rijke
Abstract:
Neural retrieval methods using transformer-based pre-trained language models have advanced multilingual and cross-lingual retrieval. However, their effectiveness for low-resource, morphologically rich languages such as Amharic remains underexplored due to data scarcity and suboptimal tokenization. We address this gap by introducing Amharic-specific dense retrieval models based on pre-trained Amharic BERT and RoBERTa backbones. Our proposed RoBERTa-Base-Amharic-Embed model (110M parameters) achieves a 17.6% relative improvement in MRR@10 and a 9.86% gain in Recall@10 over the strongest multilingual baseline, Arctic Embed 2.0 (568M parameters). More compact variants, such as RoBERTa-Medium-Amharic-Embed (42M), remain competitive while being over 13x smaller. Additionally, we train a ColBERT-based late interaction retrieval model that achieves the highest MRR@10 score (0.843) among all evaluated models. We benchmark our proposed models against both sparse and dense retrieval baselines to systematically assess retrieval effectiveness in Amharic. Our analysis highlights key challenges in low-resource settings and underscores the importance of language-specific adaptation. To foster future research in low-resource IR, we publicly release our dataset, codebase, and trained models at https://github.com/kidist-amde/amharic-ir-benchmarks.
Chinese: 本研究针对阿姆哈拉语开发了专门的密集检索模型,相比现有多语言基线实现了高达17.6%的性能提升,同时模型体积大幅缩小,并公开了全部数据集与代码以推动低资源信息检索研究。
English: This research introduces Amharic-specific dense retrieval models that significantly outperform existing multilingual baselines, achieving up to 17.6% improvement in retrieval metrics while being substantially more compact, with all resources made publicly available to advance low-resource information retrieval.
Authors:Xiao Liu, Lijun Zhang, Deepak Ganesan, Hui Guan
Abstract:
Transformer models power many AI applications but suffer from high inference latency, limiting their use in real-time settings. Multi-device inference can reduce latency by parallelizing computation. Yet, existing methods require high inter-device bandwidth, making them impractical for bandwidth-constrained environments. We propose ASTRA, a communication-efficient framework that accelerates Transformer inference through a novel integration of sequence parallelism and a Mixed-Precision Attention mechanism designed to minimize inter-device communication. ASTRA compresses non-local token embeddings via vector quantization and preserves task accuracy through two optimizations, Noise-Augmented Quantization and Distributed Class Tokens. Experiments on ViT and GPT2 across vision and NLP tasks show that ASTRA achieves up to 2.64X speedups over single-device inference and up to 15.25X speedups over state-of-the-art multi-device inferences, while operating under bandwidths as low as 10 Mbps. ASTRA is open-sourced at https://github.com/xl1990/Astra.
中文:ASTRA是一种通信高效的框架,通过整合序列并行和混合精度注意力机制来加速Transformer推理,在低带宽下实现显著加速,同时通过量化优化保持任务准确性。
English: ASTRA is a communication-efficient framework that accelerates Transformer inference by integrating sequence parallelism and a Mixed-Precision Attention mechanism, achieving significant speedups under low bandwidth while maintaining accuracy through quantization optimizations.
Authors:Jimeng Shi, Sizhe Zhou, Bowen Jin, Wei Hu, Runchu Tian, Shaowen Wang, Giri Narasimhan, Jiawei Han
Abstract:
Large language models (LLMs) often need to incorporate external knowledge to solve theme-specific problems. Retrieval-augmented generation (RAG) has shown its high promise, empowering LLMs to generate more qualified responses with retrieved external data and knowledge. However, most RAG methods retrieve relevant documents based on either sparse or dense retrieval methods or their combinations, which overlooks the essential, multi-dimensional, and structured semantic information present in documents. This structured information plays a critical role in finding concise yet highly relevant information for domain knowledge-intensive tasks, such as scientific question-answering (QA). In this work, we introduce a multi-dimensional (cube) structure, Hypercube, which can index and allocate documents in a pre-defined multi-dimensional space. Built on the hypercube, we further propose Hypercube-RAG, a novel RAG framework for precise and efficient retrieval. Given a query, Hypercube-RAG first decomposes it based on its entities, phrases, and topics along with pre-defined hypercube dimensions, and then retrieves relevant documents from cubes by aligning these decomposed components with corresponding dimensions. Experiments on three datasets across different domains demonstrate that our method improves response accuracy by 3.7% and retrieval accuracy by 5.3% over the strongest RAG baseline. It also boosts retrieval efficiency (speed) by one or two magnitudes faster than graph-based RAG. Notably, our Hypercube-RAG inherently offers explainability by revealing those underlying dimensions used for retrieval. The code and data are available at https://github.com/JimengShi/Hypercube-RAG.
中文: Hypercube-RAG通过构建多维索引结构,将查询分解并与预设语义维度对齐进行文档检索,在多个领域数据集上实现了响应准确率3.7%和检索效率数量级级的显著提升。
English: Hypercube-RAG introduces a multi-dimensional indexing structure to enhance retrieval-augmented generation by decomposing queries and aligning them with structured semantic dimensions, achieving significant improvements in accuracy and efficiency over existing methods.
Authors:Hossein Zaremehrjerdi, Shreyan Ganguly, Ashlyn Rairdin, Elizabeth Tranel, Benjamin Feuer, Juan Ignacio Di Salvo, Srikanth Panthulugiri, Hernan Torres Pacin, Victoria Moser, Sarah Jones, Joscif G Raigne, Yanben Shen, Heidi M. Dornath, Aditya Balu, Adarsh Krishnamurthy, Asheesh K Singh, Arti Singh, Baskar Ganapathysubramanian, Chinmay Hegde, Soumik Sarkar
Abstract:
Agricultural decision-making involves complex, context-specific reasoning, where choices about crops, practices, and interventions depend heavily on geographic, climatic, and economic conditions. Traditional large language models (LLMs) often fall short in navigating this nuanced problem due to limited reasoning capacity. We hypothesize that recent advances in large reasoning models (LRMs) can better handle such structured, domain-specific inference. To investigate this, we introduce AgReason, the first expert-curated open-ended science benchmark with 100 questions for agricultural reasoning. Evaluations across thirteen open-source and proprietary models reveal that LRMs outperform conventional ones, though notable challenges persist, with the strongest Gemini-based baseline achieving 36% accuracy. We also present AgThoughts, a large-scale dataset of 44.6K question-answer pairs generated with human oversight and equipped with synthetically generated reasoning traces. Using AgThoughts, we develop AgThinker, a suite of small reasoning models that can be run on consumer-grade GPUs, and show that our dataset can be effective in unlocking agricultural reasoning abilities in LLMs. Our project page is here: https://baskargroup.github.io/Ag_reasoning/
中文: 农业决策需要细致推理,传统语言模型难以胜任,但大型推理模型在新基准AgReason上表现更优,而AgThoughts数据集则支持开发适用于消费级硬件的紧凑型AgThinker模型。
English: Agricultural decision-making requires nuanced reasoning that traditional language models struggle with, but large reasoning models show improved performance on the new AgReason benchmark, while the AgThoughts dataset enables the development of compact AgThinker models for consumer hardware.
Authors:Mingyuan Wu, Jingcheng Yang, Jize Jiang, Meitang Li, Kaizhuo Yan, Hanchao Yu, Minjia Zhang, Chengxiang Zhai, Klara Nahrstedt
Abstract:
Reinforcement Learning Finetuning (RFT) has significantly advanced the reasoning capabilities of large language models (LLMs) by enabling long chains of thought, self-correction, and effective tool use. While recent works attempt to extend RFT to vision-language models (VLMs), these efforts largely produce text-only reasoning conditioned on static image inputs, falling short of true multimodal reasoning in the response. In contrast, test-time methods like Visual Sketchpad incorporate visual steps but lack training mechanisms.
We introduce VTool-R1, the first framework that trains VLMs to generate multimodal chains of thought by interleaving text and intermediate visual reasoning steps. VTool-R1 integrates Python-based visual editing tools into the RFT process, enabling VLMs to learn when and how to generate visual reasoning steps that benefit final reasoning. Trained with outcome-based rewards tied to task accuracy, our approach elicits strategic visual tool use for reasoning without relying on process-based supervision. Experiments on structured visual question answering over charts and tables show that VTool-R1 enhances reasoning performance by teaching VLMs to "think with images" and generate multimodal chain of thoughts with tools.
Chinese Summary: VTool-R1是首个通过整合文本与基于Python工具的视觉推理步骤来训练视觉语言模型生成多模态思维链的框架,无需过程监督即可显著提升视觉任务中的推理准确性。
English Summary: VTool-R1 is the first framework that trains vision-language models to produce multimodal chains of thought by integrating text with visual reasoning steps using Python-based tools, significantly improving reasoning accuracy on visual tasks without process supervision.
Authors:Aida Kostikova, Zhipin Wang, Deidamea Bajri, Ole Pütz, Benjamin PaaÃen, Steffen Eger
Abstract:
Large language model (LLM) research has grown rapidly, along with increasing concern about their limitations such as failures in reasoning, hallucinations, and limited multilingual capability. While prior reviews have addressed these issues, they often focus on individual limitations or consider them within the broader context of evaluating overall model performance. This survey addresses the gap by presenting a data-driven, semi-automated review of research on limitations of LLMs (LLLMs) from 2022 to 2025, using a bottom-up approach. From a corpus of 250,000 ACL and arXiv papers, we extract 14,648 relevant limitation papers using keyword filtering and LLM-based classification, validated against expert labels. Using topic clustering (via two approaches, HDBSCAN+BERTopic and LlooM), we identify between 7 and 15 prominent types of limitations discussed in recent LLM research across the ACL and arXiv datasets. We find that LLM-related research increases nearly sixfold in ACL and nearly fifteenfold in arXiv between 2022 and 2025, while LLLMs research grows even faster, by a factor of over 12 in ACL and nearly 28 in arXiv. Reasoning remains the most studied limitation, followed by generalization, hallucination, bias, and security. The distribution of topics in the ACL dataset stays relatively stable over time, while arXiv shifts toward safety and controllability (with topics like security risks, alignment, hallucinations, knowledge editing), and multimodality between 2022 and 2025. We offer a quantitative view of trends in LLM limitations research and release a dataset of annotated abstracts and a validated methodology, available at: https://github.com/a-kostikova/LLLMs-Survey.
中文摘要:本综述通过数据驱动方法分析了2022至2025年间大语言模型的局限性,发现推理能力是最受关注的研究短板,同时揭示了相关研究的加速增长态势及不同学术数据集中的研究主题演变。
English Summary: This survey provides a data-driven analysis of large language model limitations from 2022-2025, identifying reasoning as the most studied constraint while documenting accelerated research growth and evolving focus areas across academic datasets.
Authors:Qinsi Wang, Hancheng Ye, Ming-Yu Chung, Yudong Liu, Yueqian Lin, Martin Kuo, Mingyuan Ma, Jianyi Zhang, Yiran Chen
Abstract:
Vision-Language Models (VLMs) excel across diverse tasks but suffer from high inference costs in time and memory. Token sparsity mitigates inefficiencies in token usage, while neuron sparsity reduces high-dimensional computations, both offering promising solutions to enhance efficiency. Recently, these two sparsity paradigms have evolved largely in parallel, fostering the prevailing assumption that they function independently. However, a fundamental yet underexplored question remains: Do they truly operate in isolation, or is there a deeper underlying interplay that has yet to be uncovered? In this paper, we conduct the first comprehensive investigation into this question. By introducing and analyzing the matching mechanism between Core Neurons and Core Tokens, we found that key neurons and tokens for inference mutually influence and reinforce each other. Building on this insight, we propose CoreMatching, a co-adaptive sparse inference framework, which leverages the synergy between token and neuron sparsity to enhance inference efficiency. Through theoretical analysis and efficiency evaluations, we demonstrate that the proposed method surpasses state-of-the-art baselines on ten image understanding tasks and three hardware devices. Notably, on the NVIDIA Titan Xp, it achieved 5x FLOPs reduction and a 10x overall speedup. Code is released at https://github.com/wangqinsi1/2025-ICML-CoreMatching/tree/main.
中文: 视觉语言模型存在高推理成本问题,本研究揭示了令牌与神经元稀疏性之间的相互增强关系,提出的CoreMatching框架通过协同优化显著提升了效率,在多项任务和硬件上实现了计算量削减和加速效果。
English: Vision-Language Models face high inference costs, and this study reveals a mutual reinforcement between token and neuron sparsity, leading to the CoreMatching framework that significantly boosts efficiency with demonstrated speedups and computational reductions.
Authors:Fengqi Zhu, Rongzhen Wang, Shen Nie, Xiaolu Zhang, Chunwei Wu, Jun Hu, Jun Zhou, Jianfei Chen, Yankai Lin, Ji-Rong Wen, Chongxuan Li
Abstract:
While Masked Diffusion Models (MDMs), such as LLaDA, present a promising paradigm for language modeling, there has been relatively little effort in aligning these models with human preferences via reinforcement learning. The challenge primarily arises from the high variance in Evidence Lower Bound (ELBO)-based likelihood estimates required for preference optimization. To address this issue, we propose Variance-Reduced Preference Optimization (VRPO), a framework that formally analyzes the variance of ELBO estimators and derives bounds on both the bias and variance of preference optimization gradients. Building on this theoretical foundation, we introduce unbiased variance reduction strategies, including optimal Monte Carlo budget allocation and antithetic sampling, that significantly improve the performance of MDM alignment. We demonstrate the effectiveness of VRPO by applying it to LLaDA, and the resulting model, LLaDA 1.5, outperforms its SFT-only predecessor consistently and significantly across mathematical (GSM8K +4.7), code (HumanEval +3.0, MBPP +1.8), and alignment benchmarks (IFEval +4.0, Arena-Hard +4.3). Furthermore, LLaDA 1.5 demonstrates a highly competitive mathematical performance compared to strong language MDMs and ARMs. Project page: https://ml-gsai.github.io/LLaDA-1.5-Demo/.
Chinese: 本文提出方差缩减偏好优化(VRPO)框架,通过分析ELBO估计器的方差并引入无偏方差缩减策略,有效解决了掩码扩散模型对齐中的高方差问题,由此开发的LLaDA 1.5模型在数学、代码和对齐基准测试中均展现出显著性能提升。
English: This paper introduces Variance-Reduced Preference Optimization (VRPO), a novel framework that addresses the high variance in ELBO-based likelihood estimates to effectively align Masked Diffusion Models like LLaDA with human preferences, resulting in the enhanced LLaDA 1.5 model which demonstrates significant performance improvements across multiple benchmarks.
Authors:Shiyue Wang, Haozheng Xu, Yuhan Zhang, Jingran Lin, Changhong Lu, Xiangfeng Wang, Wenhao Li
Abstract:
Multi-Agent Path Finding (MAPF) is a fundamental problem in artificial intelligence and robotics, requiring the computation of collision-free paths for multiple agents navigating from their start locations to designated goals. As autonomous systems become increasingly prevalent in warehouses, urban transportation, and other complex environments, MAPF has evolved from a theoretical challenge to a critical enabler of real-world multi-robot coordination. This comprehensive survey bridges the long-standing divide between classical algorithmic approaches and emerging learning-based methods in MAPF research. We present a unified framework that encompasses search-based methods (including Conflict-Based Search, Priority-Based Search, and Large Neighborhood Search), compilation-based approaches (SAT, SMT, CSP, ASP, and MIP formulations), and data-driven techniques (reinforcement learning, supervised learning, and hybrid strategies). Through systematic analysis of experimental practices across 200+ papers, we uncover significant disparities in evaluation methodologies, with classical methods typically tested on larger-scale instances (up to 200 by 200 grids with 1000+ agents) compared to learning-based approaches (predominantly 10-100 agents). We provide a comprehensive taxonomy of evaluation metrics, environment types, and baseline selections, highlighting the need for standardized benchmarking protocols. Finally, we outline promising future directions including mixed-motive MAPF with game-theoretic considerations, language-grounded planning with large language models, and neural solver architectures that combine the rigor of classical methods with the flexibility of deep learning. This survey serves as both a comprehensive reference for researchers and a practical guide for deploying MAPF solutions in increasingly complex real-world applications.
中文: 这篇多智能体路径规划(MAPF)综述弥合了经典算法与新兴学习方法之间的鸿沟,通过系统分析评估实践揭示了标准化基准的必要性,并展望了博弈论规划与神经求解器等未来方向。
English: This comprehensive survey on Multi-Agent Path Finding (MAPF) unifies classical algorithmic approaches with emerging learning-based methods, providing a systematic analysis of evaluation practices while highlighting the need for standardized benchmarking and outlining future directions like game-theoretic planning and neural solver architectures.
Authors:Jiayi Xin, Sukwon Yun, Jie Peng, Inyoung Choi, Jenna L. Ballard, Tianlong Chen, Qi Long
Abstract:
Modality fusion is a cornerstone of multimodal learning, enabling information integration from diverse data sources. However, vanilla fusion methods are limited by (1) inability to account for heterogeneous interactions between modalities and (2) lack of interpretability in uncovering the multimodal interactions inherent in the data. To this end, we propose I2MoE (Interpretable Multimodal Interaction-aware Mixture of Experts), an end-to-end MoE framework designed to enhance modality fusion by explicitly modeling diverse multimodal interactions, as well as providing interpretation on a local and global level. First, I2MoE utilizes different interaction experts with weakly supervised interaction losses to learn multimodal interactions in a data-driven way. Second, I2MoE deploys a reweighting model that assigns importance scores for the output of each interaction expert, which offers sample-level and dataset-level interpretation. Extensive evaluation of medical and general multimodal datasets shows that I2MoE is flexible enough to be combined with different fusion techniques, consistently improves task performance, and provides interpretation across various real-world scenarios. Code is available at https://github.com/Raina-Xin/I2MoE.
中文:提出的I2MoE框架通过专门设计的交互专家显式建模多样化多模态交互,并在局部和全局层面提供可解释性,从而提升多模态融合效果,在多个数据集上验证了其性能优势。
English: The proposed I2MoE framework enhances multimodal fusion by explicitly modeling diverse interactions through specialized experts and providing interpretability at both local and global levels, demonstrating improved performance across various datasets.
Authors:Pradyumna Shyama Prasad, Minh Nhat Nguyen
Abstract:
Can LLMs accurately adjust their confidence when facing opposition? Building on previous studies measuring calibration on static fact-based question-answering tasks, we evaluate Large Language Models (LLMs) in a dynamic, adversarial debate setting, uniquely combining two realistic factors: (a) a multi-turn format requiring models to update beliefs as new information emerges, and (b) a zero-sum structure to control for task-related uncertainty, since mutual high-confidence claims imply systematic overconfidence. We organized 60 three-round policy debates among ten state-of-the-art LLMs, with models privately rating their confidence (0-100) in winning after each round. We observed five concerning patterns: (1) Systematic overconfidence: models began debates with average initial confidence of 72.9% vs. a rational 50% baseline. (2) Confidence escalation: rather than reducing confidence as debates progressed, debaters increased their win probabilities, averaging 83% by the final round. (3) Mutual overestimation: in 61.7% of debates, both sides simultaneously claimed >=75% probability of victory, a logical impossibility. (4) Persistent self-debate bias: models debating identical copies increased confidence from 64.1% to 75.2%; even when explicitly informed their chance of winning was exactly 50%, confidence still rose (from 50.0% to 57.1%). (5) Misaligned private reasoning: models' private scratchpad thoughts sometimes differed from their public confidence ratings, raising concerns about faithfulness of chain-of-thought reasoning. These results suggest LLMs lack the ability to accurately self-assess or update their beliefs in dynamic, multi-turn tasks; a major concern as LLMs are now increasingly deployed without careful review in assistant and agentic roles.
Code for our experiments is available at https://github.com/pradyuprasad/llms_overconfidence
中文摘要:本研究表明大型语言模型在对抗性辩论中表现出系统性过度自信且无法恰当调整置信度,对其在动态多轮任务中的可靠性提出了重要质疑。
English Summary: This study reveals that large language models exhibit systematic overconfidence and fail to properly adjust their confidence in adversarial debates, raising concerns about their reliability in dynamic, multi-turn tasks.
Authors:Eric Tillmann Bill, Enis Simsar, Thomas Hofmann
Abstract:
We introduce JEDI, a test-time adaptation method that enhances subject separation and compositional alignment in diffusion models without requiring retraining or external supervision. JEDI operates by minimizing semantic entanglement in attention maps using a novel Jensen-Shannon divergence based objective. To improve efficiency, we leverage adversarial optimization, reducing the number of updating steps required. JEDI is model-agnostic and applicable to architectures such as Stable Diffusion 1.5 and 3.5, consistently improving prompt alignment and disentanglement in complex scenes. Additionally, JEDI provides a lightweight, CLIP-free disentanglement score derived from internal attention distributions, offering a principled benchmark for compositional alignment under test-time conditions. Code and results are available at https://ericbill21.github.io/JEDI/.
Authors:Brian Chmiel, Maxim Fishman, Ron Banner, Daniel Soudry
Abstract:
We demonstrate, for the first time, fully quantized training (FQT) of large language models (LLMs) using predominantly 4-bit floating-point (FP4) precision for weights, activations, and gradients on datasets up to 200 billion tokens. We extensively investigate key design choices for FP4, including block sizes, scaling formats, and rounding methods. Our analysis shows that the NVFP4 format, where each block of 16 FP4 values (E2M1) shares a scale represented in E4M3, provides optimal results. We use stochastic rounding for backward and update passes and round-to-nearest for the forward pass to enhance stability. Additionally, we identify a theoretical and empirical threshold for effective quantized training: when the gradient norm falls below approximately $\sqrt{3}$ times the quantization noise, quantized training becomes less effective. Leveraging these insights, we successfully train a 7-billion-parameter model on 256 Intel Gaudi2 accelerators. The resulting FP4-trained model achieves downstream task performance comparable to a standard BF16 baseline, confirming that FP4 training is a practical and highly efficient approach for large-scale LLM training. A reference implementation is supplied in https://github.com/Anonymous1252022/fp4-all-the-way .
中文摘要:本研究首次实现了使用4位浮点精度对权重、激活值和梯度进行全量化训练的大型语言模型,在保持与标准BF16基准相当性能的同时,证明了该方法在大规模训练中的实用高效性。
English Summary: This study introduces the first fully quantized training of large language models using 4-bit floating-point precision for weights, activations, and gradients, achieving performance comparable to standard BF16 baselines while demonstrating practical efficiency for large-scale training.
Authors:Benjamin Clavié, Florian Brand
Abstract:
Recent advancements in Large Vision-Language Models (VLMs), have greatly enhanced their capability to jointly process text and images. However, despite extensive benchmarks evaluating visual comprehension (e.g., diagrams, color schemes, OCR tasks...), there is limited assessment of VLMs' ability to read and reason about text-rich images effectively. To fill this gap, we introduce ReadBench, a multimodal benchmark specifically designed to evaluate the reading comprehension capabilities of VLMs. ReadBench transposes contexts from established text-only benchmarks into images of text while keeping textual prompts and questions intact. Evaluating leading VLMs with ReadBench, we find minimal-but-present performance degradation on short, text-image inputs, while performance sharply declines for longer, multi-page contexts. Our experiments further reveal that text resolution has negligible effects on multimodal performance. These findings highlight needed improvements in VLMs, particularly their reasoning over visually presented extensive textual content, a capability critical for practical applications. ReadBench is available at https://github.com/answerdotai/ReadBench .
中文摘要:ReadBench是一个专门评估大型视觉语言模型在文本密集图像上阅读理解能力的新基准,发现模型在处理长篇内容时表现显著下降,而文本分辨率影响甚微。
English Summary: ReadBench is a new benchmark designed to assess how well Large Vision-Language Models read and reason about text-rich images, revealing significant performance drops with longer textual content despite minimal impact from text resolution.
Authors:Yifeng Xu, Zhenliang He, Meina Kan, Shiguang Shan, Xilin Chen
Abstract:
Visual generation and understanding are two deeply interconnected aspects of human intelligence, yet they have been traditionally treated as separate tasks in machine learning. In this paper, we propose Jodi, a diffusion framework that unifies visual generation and understanding by jointly modeling the image domain and multiple label domains. Specifically, Jodi is built upon a linear diffusion transformer along with a role switch mechanism, which enables it to perform three particular types of tasks: (1) joint generation, where the model simultaneously generates images and multiple labels; (2) controllable generation, where images are generated conditioned on any combination of labels; and (3) image perception, where multiple labels can be predicted at once from a given image. Furthermore, we present the Joint-1.6M dataset, which contains 200,000 high-quality images collected from public sources, automatic labels for 7 visual domains, and LLM-generated captions. Extensive experiments demonstrate that Jodi excels in both generation and understanding tasks and exhibits strong extensibility to a wider range of visual domains. Code is available at https://github.com/VIPL-GENUN/Jodi.
中文: 本文提出Jodi框架,通过联合建模图像与多标签域的统一扩散方法,在生成与理解任务中表现优异,并展现出对广泛视觉领域的强大扩展能力。
English: This paper introduces Jodi, a diffusion framework that unifies visual generation and understanding through joint modeling of images and multiple labels, demonstrating strong performance across three task types and extensibility to diverse visual domains.
Authors:Jaemin Kim, Hangeol Chang, Hyunmin Hwang, Choonghan Kim, Jong Chul Ye
Abstract:
Large Language Models (LLMs) have demonstrated remarkable general capabilities, but enhancing skills such as reasoning often demands substantial computational resources and may compromise their generalization. While Parameter-Efficient Fine-Tuning (PEFT) methods offer a more resource-conscious alternative, they typically requires retraining for each LLM backbone due to architectural dependencies. To address these challenges, here we propose Universal Reasoner (UniR) - a single, lightweight, composable, and plug-and-play reasoning module that can be used with any frozen LLM to endow it with specialized reasoning capabilities. Specifically, UniR decomposes the reward into a standalone reasoning module that is trained independently using predefined rewards, effectively translating trajectory-level signals into token-level guidance. Once trained, UniR can be combined with any frozen LLM at inference time by simply adding its output logits to those of the LLM backbone. This additive structure naturally enables modular composition: multiple UniR modules trained for different tasks can be jointly applied by summing their logits, enabling complex reasoning via composition. Experimental results on mathematical reasoning and machine translation tasks show that UniR significantly outperforms existing baseline fine-tuning methods using the Llama3.2 model. Furthermore, UniR demonstrates strong weak-to-strong generalization: reasoning modules trained on smaller models effectively guide much larger LLMs. This makes UniR a cost-efficient, adaptable, and robust solution for enhancing reasoning in LLMs without compromising their core capabilities. Code is open-sourced at https://github.com/hangeol/UniR
中文: Universal Reasoner (UniR) 是一种轻量级即插即用推理模块,可与任何冻结的大型语言模型结合,无需重新训练即可增强其专业推理能力,在多项任务中超越现有方法,实现了高效、自适应的推理增强。
English: The Universal Reasoner (UniR) is a lightweight, plug-and-play reasoning module that can be added to any frozen large language model to enhance its specialized reasoning capabilities without retraining, outperforming existing methods and enabling cost-efficient, adaptable reasoning across tasks.
Authors:Tianyu Zhang, Xinyu Wang, Lu Li, Zhenghan Tai, Jijun Chi, Jingrui Tian, Hailin He, Suyuchen Wang
Abstract:
While diffusion models have revolutionized text-to-image generation with their ability to synthesize realistic and diverse scenes, they continue to struggle to generate consistent and legible text within images. This shortcoming is commonly attributed to the locality bias inherent in diffusion-based generation, which limits their ability to model long-range spatial dependencies. In this paper, we introduce $\textbf{STRICT}$, a benchmark designed to systematically stress-test the ability of diffusion models to render coherent and instruction-aligned text in images. Our benchmark evaluates models across multiple dimensions: (1) the maximum length of readable text that can be generated; (2) the correctness and legibility of the generated text, and (3) the ratio of not following instructions for generating text. We evaluate several state-of-the-art models, including proprietary and open-source variants, and reveal persistent limitations in long-range consistency and instruction-following capabilities. Our findings provide insights into architectural bottlenecks and motivate future research directions in multimodal generative modeling. We release our entire evaluation pipeline at https://github.com/tianyu-z/STRICT-Bench.
中文:扩散模型在生成逼真图像方面表现出色,但在图像中生成一致且清晰文本方面存在不足,因此我们开发了STRICT基准来系统评估文本长度、正确性和指令遵循能力,揭示了持续存在的局限并推动了未来研究方向。
English: Diffusion models excel in realistic image generation but fail to produce consistent and legible text, leading to the creation of the STRICT benchmark for systematic evaluation across text length, correctness, and instruction adherence, revealing persistent limitations and motivating future research.
Authors:Pingbang Hu, Joseph Melkonian, Weijing Tang, Han Zhao, Jiaqi W. Ma
Abstract:
Gradient-based data attribution methods, such as influence functions, are critical for understanding the impact of individual training samples without requiring repeated model retraining. However, their scalability is often limited by the high computational and memory costs associated with per-sample gradient computation. In this work, we propose GraSS, a novel gradient compression algorithm and its variants FactGraSS for linear layers specifically, that explicitly leverage the inherent sparsity of per-sample gradients to achieve sub-linear space and time complexity. Extensive experiments demonstrate the effectiveness of our approach, achieving substantial speedups while preserving data influence fidelity. In particular, FactGraSS achieves up to 165% faster throughput on billion-scale models compared to the previous state-of-the-art baselines. Our code is publicly available at https://github.com/TRAIS-Lab/GraSS.
Chinese: GraSS及其变体FactGraSS通过利用逐样本梯度的稀疏性,在保持数据影响力准确性的同时显著降低了计算和内存成本,实现了大规模模型训练效率的显著提升。
English: GraSS and its variant FactGraSS are gradient compression algorithms that exploit the sparsity of per-sample gradients to reduce computational and memory costs while maintaining data influence accuracy, achieving significant speed improvements in large-scale models.
Authors:Yining Pan, Qiongjie Cui, Xulei Yang, Na Zhao
Abstract:
LiDAR-based 3D panoptic segmentation often struggles with the inherent sparsity of data from LiDAR sensors, which makes it challenging to accurately recognize distant or small objects. Recently, a few studies have sought to overcome this challenge by integrating LiDAR inputs with camera images, leveraging the rich and dense texture information provided by the latter. While these approaches have shown promising results, they still face challenges, such as misalignment during data augmentation and the reliance on post-processing steps. To address these issues, we propose Image-Assists-LiDAR (IAL), a novel multi-modal 3D panoptic segmentation framework. In IAL, we first introduce a modality-synchronized data augmentation strategy, PieAug, to ensure alignment between LiDAR and image inputs from the start. Next, we adopt a transformer decoder to directly predict panoptic segmentation results. To effectively fuse LiDAR and image features into tokens for the decoder, we design a Geometric-guided Token Fusion (GTF) module. Additionally, we leverage the complementary strengths of each modality as priors for query initialization through a Prior-based Query Generation (PQG) module, enhancing the decoder's ability to generate accurate instance masks. Our IAL framework achieves state-of-the-art performance compared to previous multi-modal 3D panoptic segmentation methods on two widely used benchmarks. Code and models are publicly available at .
中文: 提出的图像辅助激光雷达(IAL)框架通过同步数据增强和创新融合模块,有效结合激光雷达与相机数据,解决了多模态错位问题并优化特征整合,从而在三维全景分割中实现了最先进的性能。
English: The proposed Image-Assists-LiDAR (IAL) framework introduces synchronized data augmentation and novel fusion modules to effectively combine LiDAR and camera data, achieving state-of-the-art 3D panoptic segmentation by addressing misalignment and enhancing feature integration.
Authors:Xiping Li, Xiangyu Dong, Xingyi Zhang, Kun Xie, Yuanhao Feng, Bo Wang, Guilin Li, Wuxiong Zeng, Xiujun Shu, Sibo Wang
Abstract:
Graph Anomaly Detection (GAD) in heterogeneous networks presents unique challenges due to node and edge heterogeneity. Existing Graph Neural Network (GNN) methods primarily focus on homogeneous GAD and thus fail to address three key issues: (C1) Capturing abnormal signal and rich semantics across diverse meta-paths; (C2) Retaining high-frequency content in HIN dimension alignment; and (C3) Learning effectively from difficult anomaly samples with class imbalance. To overcome these, we propose ChiGAD, a spectral GNN framework based on a novel Chi-Square filter, inspired by the wavelet effectiveness in diverse domains. Specifically, ChiGAD consists of: (1) Multi-Graph Chi-Square Filter, which captures anomalous information via applying dedicated Chi-Square filters to each meta-path graph; (2) Interactive Meta-Graph Convolution, which aligns features while preserving high-frequency information and incorporates heterogeneous messages by a unified Chi-Square Filter; and (3) Contribution-Informed Cross-Entropy Loss, which prioritizes difficult anomalies to address class imbalance. Extensive experiments on public and industrial datasets show that ChiGAD outperforms state-of-the-art models on multiple metrics. Additionally, its homogeneous variant, ChiGNN, excels on seven GAD datasets, validating the effectiveness of Chi-Square filters. Our code is available at https://github.com/HsipingLi/ChiGAD.
Chinese: 提出的ChiGAD框架通过采用基于卡方检验的谱图滤波器,能够在异构网络中跨元路径捕捉异常信号并保留高频信息,同时通过贡献感知的交叉熵损失解决类别不平衡问题,实验证明其性能优于现有先进模型。
English: The proposed ChiGAD framework addresses key challenges in heterogeneous graph anomaly detection by employing spectral Chi-Square filters to capture anomalies across meta-paths while preserving high-frequency information and handling class imbalance through specialized loss functions, demonstrating superior performance over existing methods.
Authors:Javier Salazar Cavazos, Jeffrey A Fessler, Laura Balzano
Abstract:
Principal component analysis (PCA) is a key tool in the field of data dimensionality reduction. Various methods have been proposed to extend PCA to the union of subspace (UoS) setting for clustering data that come from multiple subspaces like K-Subspaces (KSS). However, some applications involve heterogeneous data that vary in quality due to noise characteristics associated with each data sample. Heteroscedastic methods aim to deal with such mixed data quality. This paper develops a heteroscedastic-focused subspace clustering method, named ALPCAHUS, that can estimate the sample-wise noise variances and use this information to improve the estimate of the subspace bases associated with the low-rank structure of the data. This clustering algorithm builds on K-Subspaces (KSS) principles by extending the recently proposed heteroscedastic PCA method, named LR-ALPCAH, for clusters with heteroscedastic noise in the UoS setting. Simulations and real-data experiments show the effectiveness of accounting for data heteroscedasticity compared to existing clustering algorithms. Code available at https://github.com/javiersc1/ALPCAHUS.
中文: 本文提出了ALPCAHUS异方差子空间聚类方法,通过估计样本级噪声方差来改进子空间基估计,在处理混合质量数据方面展现出优于现有算法的性能。
English: This paper introduces ALPCAHUS, a heteroscedastic subspace clustering method that estimates sample-wise noise variances to enhance subspace basis estimation, demonstrating superior performance over existing algorithms in handling data with mixed quality.
Authors:Alexander Conzelmann, Robert Bamler
Abstract:
The ever-growing size of neural networks poses serious challenges on resource-constrained devices, such as embedded sensors. Compression algorithms that reduce their size can mitigate these problems, provided that model performance stays close to the original. We propose a novel post-training compression framework that combines rate-aware quantization with entropy coding by (1) extending the well-known layer-wise loss by a quadratic rate estimation, and (2) providing locally exact solutions to this modified objective following the Optimal Brain Surgeon (OBS) method. Our method allows for very fast decoding and is compatible with arbitrary quantization grids. We verify our results empirically by testing on various computer-vision networks, achieving a 20-40\% decrease in bit rate at the same performance as the popular compression algorithm NNCodec. Our code is available at https://github.com/Conzel/cerwu.
Chinese: 本研究提出了一种新颖的训练后压缩框架,将速率感知量化与熵编码相结合,在多种计算机视觉网络中实现了20-40%的比特率降低,同时保持与NNCodec相当的性能。
English: This study introduces a novel post-training compression framework that integrates rate-aware quantization with entropy coding, achieving a 20-40% reduction in bit rate while maintaining performance comparable to NNCodec across various computer-vision networks.
Authors:Meng Li, Guangda Huzhang, Haibo Zhang, Xiting Wang, Anxiang Zeng
Abstract:
Direct Preference Optimization (DPO) has emerged as a promising framework for aligning Large Language Models (LLMs) with human preferences by directly optimizing the log-likelihood difference between chosen and rejected responses. However, existing methods assign equal importance to all tokens in the response, while humans focus on more meaningful parts. This leads to suboptimal preference optimization, as irrelevant or noisy tokens disproportionately influence DPO loss. To address this limitation, we propose \textbf{O}ptimal \textbf{T}ransport-based token weighting scheme for enhancing direct \textbf{P}reference \textbf{O}ptimization (OTPO). By emphasizing semantically meaningful token pairs and de-emphasizing less relevant ones, our method introduces a context-aware token weighting scheme that yields a more contrastive reward difference estimate. This adaptive weighting enhances reward stability, improves interpretability, and ensures that preference optimization focuses on meaningful differences between responses. Extensive experiments have validated OTPO's effectiveness in improving instruction-following ability across various settings\footnote{Code is available at https://github.com/Mimasss2/OTPO.}.
中文摘要:直接偏好优化(DPO)因对所有令牌同等处理而存在局限,因此提出的OTPO方法通过基于最优传输的令牌加权机制,重点优化语义显著的令牌对,有效提升了偏好优化的性能。
English Summary: Direct Preference Optimization (DPO) faces limitations by treating all tokens equally, so the proposed OTPO method introduces optimal transport-based token weighting to emphasize meaningful tokens and improve preference optimization effectiveness.
Authors:Guodong Du, Zitao Fang, Jing Li, Junlin Li, Runhua Jiang, Shuyang Yu, Yifei Guo, Yangneng Chen, Sim Kuan Goh, Ho-Kin Tang, Daojing He, Honghai Liu, Min Zhang
Abstract:
Foundation models and their checkpoints have significantly advanced deep learning, boosting performance across various applications. However, fine-tuned models often struggle outside their specific domains and exhibit considerable redundancy. Recent studies suggest that combining a pruned fine-tuned model with the original pre-trained model can mitigate forgetting, reduce interference when merging model parameters across tasks, and improve compression efficiency. In this context, developing an effective pruning strategy for fine-tuned models is crucial. Leveraging the advantages of the task vector mechanism, we preprocess fine-tuned models by calculating the differences between them and the original model. Recognizing that different task vector subspaces contribute variably to model performance, we introduce a novel method called Neural Parameter Search (NPS-Pruning) for slimming down fine-tuned models. This method enhances pruning efficiency by searching through neural parameters of task vectors within low-rank subspaces. Our method has three key applications: enhancing knowledge transfer through pairwise model interpolation, facilitating effective knowledge fusion via model merging, and enabling the deployment of compressed models that retain near-original performance while significantly reducing storage costs. Extensive experiments across vision, NLP, and multi-modal benchmarks demonstrate the effectiveness and robustness of our approach, resulting in substantial performance gains. The code is publicly available at: https://github.com/duguodong7/NPS-Pruning.
中文摘要:该研究提出的神经参数搜索(NPS-Pruning)方法通过利用任务向量子空间有效压缩微调模型,在保持视觉、自然语言处理和多模态任务性能的同时,显著提升了知识迁移、模型融合和存储效率。
English Summary: The proposed Neural Parameter Search (NPS-Pruning) method effectively compresses fine-tuned models by leveraging task vector subspaces, enhancing knowledge transfer, model merging, and storage efficiency while maintaining performance across vision, NLP, and multi-modal tasks.
Authors:Can Yaras, Alec S. Xu, Pierre Abillama, Changwoo Lee, Laura Balzano
Abstract:
Transformers have achieved state-of-the-art performance across various tasks, but suffer from a notable quadratic complexity in sequence length due to the attention mechanism. In this work, we propose MonarchAttention -- a novel approach to sub-quadratic attention approximation via Monarch matrices, an expressive class of structured matrices. Based on the variational form of softmax, we describe an efficient optimization-based algorithm to compute an approximate projection of softmax attention onto the class of Monarch matrices with $Î(N\sqrt{N} d)$ computational complexity and $Î(Nd)$ memory/IO complexity. Unlike previous approaches, MonarchAttention is both (1) transferable, yielding minimal performance loss with no additional training, even when replacing every attention layer of the transformer, and (2) hardware-efficient, utilizing the highest-throughput tensor core units on modern GPUs. With optimized kernels, MonarchAttention achieves substantial speed-ups in wall-time over FlashAttention-2: $1.4\times$ for shorter sequences $(N=256)$, $4.5\times$ for medium-length sequences $(N=4K)$, and $8.2\times$ for longer sequences $(N=16K)$. We demonstrate the quality of MonarchAttention on diverse tasks and architectures in vision and language problems, showing that it flexibly and accurately approximates softmax attention in a variety of contexts. Our code is available at https://github.com/cjyaras/monarch-attention.
中文: 本文提出MonarchAttention方法,通过使用表现力强的Monarch矩阵实现次二次注意力近似,在降低计算复杂度的同时保持性能与硬件效率,适用于多种任务场景。
English: This paper introduces MonarchAttention, a sub-quadratic attention approximation method using expressive Monarch matrices that reduces computational complexity while maintaining performance and hardware efficiency across various tasks.
Authors:Ziyang Cheng, Zhixun Li, Yuhan Li, Yixin Song, Kangyi Zhao, Dawei Cheng, Jia Li, Jeffrey Xu Yu
Abstract:
Nowadays, real-world data, including graph-structure data, often arrives in a streaming manner, which means that learning systems need to continuously acquire new knowledge without forgetting previously learned information. Although substantial existing works attempt to address catastrophic forgetting in graph machine learning, they are all based on training from scratch with streaming data. With the rise of pretrained models, an increasing number of studies have leveraged their strong generalization ability for continual learning. Therefore, in this work, we attempt to answer whether large language models (LLMs) can mitigate catastrophic forgetting in Graph Continual Learning (GCL). We first point out that current experimental setups for GCL have significant flaws, as the evaluation stage may lead to task ID leakage. Then, we evaluate the performance of LLMs in more realistic scenarios and find that even minor modifications can lead to outstanding results. Finally, based on extensive experiments, we propose a simple-yet-effective method, Simple Graph Continual Learning (SimGCL), that surpasses the previous state-of-the-art GNN-based baseline by around 20% under the rehearsal-free constraint. To facilitate reproducibility, we have developed an easy-to-use benchmark LLM4GCL for training and evaluating existing GCL methods. The code is available at: https://github.com/ZhixunLEE/LLM4GCL.
Chinese: 本研究探讨大型语言模型(LLMs)能否缓解图持续学习中的灾难性遗忘,提出了一种简单而有效的SimGCL方法,显著超越先前基于图神经网络的方法,并推出了便于未来研究使用的基准测试平台。
English: This study explores whether large language models (LLMs) can mitigate catastrophic forgetting in Graph Continual Learning (GCL), proposing a simple-yet-effective method called SimGCL that significantly outperforms previous GNN-based approaches and introducing an easy-to-use benchmark for future research.
Authors:Hongzheng Yang, Yongqiang Chen, Zeyu Qin, Tongliang Liu, Chaowei Xiao, Kun Zhang, Bo Han
Abstract:
Representation intervention aims to locate and modify the representations that encode the underlying concepts in Large Language Models (LLMs) to elicit the aligned and expected behaviors. Despite the empirical success, it has never been examined whether one could locate the faithful concepts for intervention. In this work, we explore the question in safety alignment. If the interventions are faithful, the intervened LLMs should erase the harmful concepts and be robust to both in-distribution adversarial prompts and the out-of-distribution (OOD) jailbreaks. While it is feasible to erase harmful concepts without degrading the benign functionalities of LLMs in linear settings, we show that it is infeasible in the general non-linear setting. To tackle the issue, we propose Concept Concentration (COCA). Instead of identifying the faithful locations to intervene, COCA refractors the training data with an explicit reasoning process, which firstly identifies the potential unsafe concepts and then decides the responses. Essentially, COCA simplifies the decision boundary between harmful and benign representations, enabling more effective linear erasure. Extensive experiments with multiple representation intervention methods and model architectures demonstrate that COCA significantly reduces both in-distribution and OOD jailbreak success rates, and meanwhile maintaining strong performance on regular tasks such as math and code generation.
中文: 表征干预旨在定位和修改大语言模型中的概念表征以引导预期行为,但在非线性设置中难以忠实定位概念,因此提出COCA方法,通过重构训练数据简化有害概念的线性擦除,有效提升安全性同时保持模型性能。
English: Representation intervention in LLMs seeks to modify concept representations for aligned behavior, but achieving faithful concept location is challenging in non-linear settings, prompting the proposed COCA method that reframes training data to simplify harmful concept erasure and enhance safety without compromising performance.
Authors:Jian Liang, Wenke Huang, Xianda Guo, Guancheng Wan, Bo Du, Mang Ye
Abstract:
Low-Rank Adaptation (LoRA) is widely adopted for downstream fine-tuning of foundation models due to its efficiency and zero additional inference cost. Many real-world applications require foundation models to specialize in multiple tasks simultaneously, motivating the need for efficient multi-task adaptation. While recent approaches integrate LoRA with mixture-of-experts (MoE) to address this, the use of routers prevents parameter mergeability, which increases inference overhead and hinders unified multi-task adaptation, thereby limiting deployment practicality. In this work, we propose ThanoRA, a Task Heterogeneity-Aware Multi-Task Low-Rank Adaptation framework that enables multi-task adaptation while preserving the inference efficiency of LoRA. ThanoRA jointly models task heterogeneity and mitigates subspace interference throughout training. Specifically, motivated by inherent differences in complexity and heterogeneity across tasks, ThanoRA constructs task-specific LoRA subspaces at initialization, enabling fine-grained knowledge injection aligned with task heterogeneity. Furthermore, to prevent task interference and subspace collapse during multi-task training, ThanoRA introduces a subspace-preserving regularization that maintains the independence of task-specific representations. With the synergy of both components, ThanoRA enables efficient and unified multi-task adaptation. Extensive experiments across multimodal and text-only benchmarks under varying multi-task mixtures demonstrate that ThanoRA consistently achieves robust and superior performance over strong baselines without introducing additional inference overhead. Our code is publicly available at: https://github.com/LiangJian24/ThanoRA.
中文摘要:ThanoRA是一种无需引入额外结构或推理开销的多任务低秩适配框架,通过异构感知的子空间分配和多样性保持训练,在多任务学习中表现卓越,其性能超越现有方法及独立任务微调。
English Summary: ThanoRA is a multi-task low-rank adaptation framework that effectively enhances foundation models' performance across multiple tasks without adding extra structures or inference costs, outperforming existing methods and even separate task-specific fine-tuning.
Authors:Woohyun Cho, Youngmin Kim, Sunghyun Lee, Youngjae Yu
Abstract:
Lyrics translation requires both accurate semantic transfer and preservation of musical rhythm, syllabic structure, and poetic style. In animated musicals, the challenge intensifies due to alignment with visual and auditory cues. We introduce Multilingual Audio-Video Lyrics Benchmark for Animated Song Translation (MAVL), the first multilingual, multimodal benchmark for singable lyrics translation. By integrating text, audio, and video, MAVL enables richer and more expressive translations than text-only approaches. Building on this, we propose Syllable-Constrained Audio-Video LLM with Chain-of-Thought SylAVL-CoT, which leverages audio-video cues and enforces syllabic constraints to produce natural-sounding lyrics. Experimental results demonstrate that SylAVL-CoT significantly outperforms text-based models in singability and contextual accuracy, emphasizing the value of multimodal, multilingual approaches for lyrics translation.
中文: MAVL基准和SylAVL-CoT模型通过整合多模态线索与音节约束,显著提升了可唱歌词翻译的自然度与准确性,全面优于纯文本翻译方法。
English: The MAVL benchmark and SylAVL-CoT model enhance singable lyrics translation by integrating multimodal cues and syllabic constraints, significantly outperforming text-only methods in both naturalness and accuracy.
Authors:Faithful Chiagoziem Onwuegbuche, Adelodun Olaoluwa, Anca Delia Jurcut, Liliana Pasquale
Abstract:
Ransomware remains a critical threat to cybersecurity, yet publicly available datasets for training machine learning-based ransomware detection models are scarce and often have limited sample size, diversity, and reproducibility. In this paper, we introduce MLRan, a behavioural ransomware dataset, comprising over 4,800 samples across 64 ransomware families and a balanced set of goodware samples. The samples span from 2006 to 2024 and encompass the four major types of ransomware: locker, crypto, ransomware-as-a-service, and modern variants. We also propose guidelines (GUIDE-MLRan), inspired by previous work, for constructing high-quality behavioural ransomware datasets, which informed the curation of our dataset. We evaluated the ransomware detection performance of several machine learning (ML) models using MLRan. For this purpose, we performed feature selection by conducting mutual information filtering to reduce the initial 6.4 million features to 24,162, followed by recursive feature elimination, yielding 483 highly informative features. The ML models achieved an accuracy, precision and recall of up to 98.7%, 98.9%, 98.5%, respectively. Using SHAP and LIME, we identified critical indicators of malicious behaviour, including registry tampering, strings, and API misuse. The dataset and source code for feature extraction, selection, ML training, and evaluation are available publicly to support replicability and encourage future research, which can be found at https://github.com/faithfulco/mlran.
中文: 本文提出了MLRan行为勒索软件数据集,包含64个家族的4800多个样本,并证明基于该数据集的机器学习模型检测准确率高达98.7%,同时通过可解释性分析揭示了注册表篡改等关键恶意行为特征。
English: This paper introduces MLRan, a comprehensive behavioral ransomware dataset with over 4,800 samples across 64 families, and demonstrates that machine learning models trained on it achieve up to 98.7% accuracy in detection by identifying key malicious behaviors like registry tampering.
Authors:Bryan Sangwoo Kim, Jeongsol Kim, Jong Chul Ye
Abstract:
Modern single-image super-resolution (SISR) models deliver photo-realistic results at the scale factors on which they are trained, but collapse when asked to magnify far beyond that regime. We address this scalability bottleneck with Chain-of-Zoom (CoZ), a model-agnostic framework that factorizes SISR into an autoregressive chain of intermediate scale-states with multi-scale-aware prompts. CoZ repeatedly re-uses a backbone SR model, decomposing the conditional probability into tractable sub-problems to achieve extreme resolutions without additional training. Because visual cues diminish at high magnifications, we augment each zoom step with multi-scale-aware text prompts generated by a vision-language model (VLM). The prompt extractor itself is fine-tuned using Generalized Reward Policy Optimization (GRPO) with a critic VLM, aligning text guidance towards human preference. Experiments show that a standard 4x diffusion SR model wrapped in CoZ attains beyond 256x enlargement with high perceptual quality and fidelity. Project Page: https://bryanswkim.github.io/chain-of-zoom/ .
Authors:Mengqi Liao, Xiangyu Xi, Ruinian Chen, Jia Leng, Yangen Hu, Ke Zeng, Shuai Liu, Huaiyu Wan
Abstract:
Reasoning large language models (LLMs) excel in complex tasks, which has drawn significant attention to reinforcement learning (RL) for LLMs. However, existing approaches allocate an equal number of rollouts to all questions during the RL process, which is inefficient. This inefficiency stems from the fact that training on simple questions yields limited gains, whereas more rollouts are needed for challenging questions to sample correct answers. Furthermore, while RL improves response precision, it limits the model's exploration ability, potentially resulting in a performance cap below that of the base model prior to RL. To address these issues, we propose a mechanism for dynamically allocating rollout budgets based on the difficulty of the problems, enabling more efficient RL training. Additionally, we introduce an adaptive dynamic temperature adjustment strategy to maintain the entropy at a stable level, thereby encouraging sufficient exploration. This enables LLMs to improve response precision while preserving their exploratory ability to uncover potential correct pathways. The code and data is available on: https://github.com/LiaoMengqi/E3-RL4LLMs
中文: 本研究提出动态分配训练资源和自适应调整温度的策略,以优化大型语言模型的强化学习过程,实现高效训练并保持探索能力。
English: This study introduces a dynamic rollout allocation mechanism and adaptive temperature adjustment strategy to enhance reinforcement learning for large language models, enabling more efficient training and sustained exploratory capacity.
Authors:Md Ahsanul Haque, Ismail Hossain, Md Mahmuduzzaman Kamol, Md Jahangir Alam, Suresh Kumar Amalapuram, Sajedul Talukder, Mohammad Saidur Rahman
Abstract:
Machine learning (ML)-based malware detection systems often fail to account for the dynamic nature of real-world training and test data distributions. In practice, these distributions evolve due to frequent changes in the Android ecosystem, adversarial development of new malware families, and the continuous emergence of both benign and malicious applications. Prior studies have shown that such concept drift -- distributional shifts in benign and malicious samples, leads to significant degradation in detection performance over time. Despite the practical importance of this issue, existing datasets are often outdated and limited in temporal scope, diversity of malware families, and sample scale, making them insufficient for the systematic evaluation of concept drift in malware detection.
To address this gap, we present LAMDA, the largest and most temporally diverse Android malware benchmark to date, designed specifically for concept drift analysis. LAMDA spans 12 years (2013-2025, excluding 2015), includes over 1 million samples (approximately 37% labeled as malware), and covers 1,380 malware families and 150,000 singleton samples, reflecting the natural distribution and evolution of real-world Android applications. We empirically demonstrate LAMDA's utility by quantifying the performance degradation of standard ML models over time and analyzing feature stability across years. As the most comprehensive Android malware dataset to date, LAMDA enables in-depth research into temporal drift, generalization, explainability, and evolving detection challenges. The dataset and code are available at: https://iqsec-lab.github.io/LAMDA/.
Authors:Mingyang Wu, Li Lin, Wenbin Zhang, Xin Wang, Zhenhuan Yang, Shu Hu
Abstract:
The Area Under the ROC Curve (AUC) is a key metric for classification, especially under class imbalance, with growing research focus on optimizing AUC over accuracy in applications like medical image analysis and deepfake detection. This leads to fairness in AUC optimization becoming crucial as biases can impact protected groups. While various fairness mitigation techniques exist, fairness considerations in AUC optimization remain in their early stages, with most research focusing on improving AUC fairness under the assumption of clean protected groups. However, these studies often overlook the impact of noisy protected groups, leading to fairness violations in practice. To address this, we propose the first robust AUC fairness approach under noisy protected groups with fairness theoretical guarantees using distributionally robust optimization. Extensive experiments on tabular and image datasets show that our method outperforms state-of-the-art approaches in preserving AUC fairness. The code is in https://github.com/Purdue-M2/AUC_Fairness_with_Noisy_Groups.
Chinese: 本文首次提出在受保护群体标签存在噪声的情况下,采用分布鲁棒优化方法确保AUC公平性的稳健方案,并通过在表格和图像数据集上的大量实验验证了该方法优于现有技术。
English: This paper introduces the first robust approach to ensuring AUC fairness under noisy protected groups, using distributionally robust optimization with theoretical guarantees, and demonstrates its superiority over existing methods through extensive experiments on tabular and image datasets.
Authors:Yiqing Zhang, Xiaozhong Liu, Fabricio Murai
Abstract:
Many existing models for clinical trial outcome prediction are optimized using task-specific loss functions on trial phase-specific data. While this scheme may boost prediction for common diseases and drugs, it can hinder learning of generalizable representations, leading to more false positives/negatives. To address this limitation, we introduce CLaDMoP, a new pre-training approach for clinical trial outcome prediction, alongside the Successful Clinical Trials dataset(SCT), specifically designed for this task. CLaDMoP leverages a Large Language Model-to encode trials' eligibility criteria-linked to a lightweight Drug-Molecule branch through a novel multi-level fusion technique. To efficiently fuse long embeddings across levels, we incorporate a grouping block, drastically reducing computational overhead. CLaDMoP avoids reliance on task-specific objectives by pre-training on a "pair matching" proxy task. Compared to established zero-shot and few-shot baselines, our method significantly improves both PR-AUC and ROC-AUC, especially for phase I and phase II trials. We further evaluate and perform ablation on CLaDMoP after Parameter-Efficient Fine-Tuning, comparing it to state-of-the-art supervised baselines, including MEXA-CTP, on the Trial Outcome Prediction(TOP) benchmark. CLaDMoP achieves up to 10.5% improvement in PR-AUC and 3.6% in ROC-AUC, while attaining comparable F1 score to MEXA-CTP, highlighting its potential for clinical trial outcome prediction. Code and SCT dataset can be downloaded from https://github.com/murai-lab/CLaDMoP.
Chinese: 我们提出了CLaDMoP,一种用于临床试验结果预测的新预训练方法,通过配对匹配任务提升泛化能力,在PR-AUC和ROC-AUC指标上较现有基线取得显著提升。
English: We introduce CLaDMoP, a novel pre-training method for clinical trial outcome prediction that uses a pair matching task to enhance generalizability and achieves significant improvements in PR-AUC and ROC-AUC over existing baselines.
Authors:Taeckyung Lee, Sorn Chottananurak, Junsu Kim, Jinwoo Shin, Taesik Gong, Sung-Ju Lee
Abstract:
Deep learning models perform poorly when domain shifts exist between training and test data. Test-time adaptation (TTA) is a paradigm to mitigate this issue by adapting pre-trained models using only unlabeled test samples. However, existing TTA methods can fail under severe domain shifts, while recent active TTA approaches requiring full-class labels are impractical due to high labeling costs. To address this issue, we introduce a new setting of TTA with binary feedback. This setting uses a few binary feedback inputs from annotators to indicate whether model predictions are correct, thereby significantly reducing the labeling burden of annotators. Under the setting, we propose BiTTA, a novel dual-path optimization framework that leverages reinforcement learning to balance binary feedback-guided adaptation on uncertain samples with agreement-based self-adaptation on confident predictions. Experiments show BiTTA achieves 13.3%p accuracy improvements over state-of-the-art baselines, demonstrating its effectiveness in handling severe distribution shifts with minimal labeling effort. The source code is available at https://github.com/taeckyung/BiTTA.
中文:BiTTA提出了一种基于二元反馈的测试时自适应框架,通过强化学习平衡不确定样本的反馈引导优化与置信预测的自主适应,在极端域偏移下以最少标注成本实现了显著性能提升。
English: BiTTA introduces a test-time adaptation framework using binary feedback to guide model updates on uncertain samples while self-adapting confident predictions, achieving significant accuracy gains with minimal labeling effort under severe domain shifts.
Authors:Weiwei Sun, Haokun Liu, Nikhil Kandpal, Colin Raffel, Yiming Yang
Abstract:
Training data attribution (TDA) methods aim to measure how training data impacts a model's predictions. While gradient-based attribution methods, such as influence functions, offer theoretical grounding, their computational costs make them impractical for large-scale applications. Representation-based approaches are far more scalable, but typically rely on heuristic embeddings that are not optimized for attribution, limiting their fidelity. To address these challenges, we propose AirRep, a scalable, representation-based approach that closes this gap by learning task-specific and model-aligned representations optimized explicitly for TDA. AirRep introduces two key innovations: a trainable encoder tuned for attribution quality, and an attention-based pooling mechanism that enables accurate estimation of group-wise influence. We train AirRep using a ranking objective over automatically constructed training subsets labeled by their empirical effect on target predictions. Experiments on instruction-tuned LLMs demonstrate that AirRep achieves performance on par with state-of-the-art gradient-based approaches while being nearly two orders of magnitude more efficient at inference time. Further analysis highlights its robustness and generalization across tasks and models. Our code is available at https://github.com/sunnweiwei/AirRep.
中文: AirRep是一种基于表示的可扩展方法,通过优化学习任务特定和模型对齐的表征来进行训练数据归因,在显著提高效率的同时实现了与基于梯度方法相媲美的性能。
English: AirRep is a scalable, representation-based method that learns task-specific and model-aligned representations for training data attribution, achieving performance comparable to gradient-based approaches with significantly higher efficiency.
Authors:Guodong Du, Xuanning Zhou, Junlin Li, Zhuo Li, Zesheng Shi, Wanyu Lin, Ho-Kin Tang, Xiucheng Li, Fangming Liu, Wenya Wang, Min Zhang, Jing Li
Abstract:
Cross-capability transfer is a key challenge in large language model (LLM) research, with applications in multi-task integration, model compression, and continual learning. Recent works like FuseLLM and FuseChat have demonstrated the potential of transferring multiple model capabilities to lightweight models, enhancing adaptability and efficiency, which motivates our investigation into more efficient cross-capability transfer methods. However, existing approaches primarily focus on small, homogeneous models, limiting their applicability. For large, heterogeneous models, knowledge distillation with full-parameter fine-tuning often overlooks the student model's intrinsic capacity and risks catastrophic forgetting, while PEFT methods struggle to effectively absorb knowledge from source LLMs. To address these issues, we introduce GraftLLM, a novel method that stores source model capabilities in a target model with SkillPack format. This approach preserves general capabilities, reduces parameter conflicts, and supports forget-free continual learning and model fusion. We employ a module-aware adaptive compression strategy to compress parameter updates, ensuring efficient storage while maintaining task-specific knowledge. The resulting SkillPack serves as a compact and transferable knowledge carrier, ideal for heterogeneous model fusion and continual learning. Experiments across various scenarios demonstrate that GraftLLM outperforms existing techniques in knowledge transfer, knowledge fusion, and forget-free learning, providing a scalable and efficient solution for cross-capability transfer. The code is publicly available at: https://github.com/duguodong7/GraftLLM.
Chinese: GraftLLM提出了一种新颖的跨能力迁移方法,通过SkillPack格式在异构模型间高效存储和传递知识,既能防止灾难性遗忘,又能实现可扩展的持续学习。
English: GraftLLM introduces a novel cross-capability transfer method using SkillPack format to efficiently store and transfer knowledge between heterogeneous models while preventing catastrophic forgetting and enabling scalable continual learning.
Authors:Xiaojun Guo, Ang Li, Yifei Wang, Stefanie Jegelka, Yisen Wang
Abstract:
Although Large Language Models (LLMs) have demonstrated remarkable progress, their proficiency in graph-related tasks remains notably limited, hindering the development of truly general-purpose models. Previous attempts, including pretraining graph foundation models or employing supervised fine-tuning, often face challenges such as the scarcity of large-scale, universally represented graph data. We introduce G1, a simple yet effective approach demonstrating that Reinforcement Learning (RL) on synthetic graph-theoretic tasks can significantly scale LLMs' graph reasoning abilities. To enable RL training, we curate Erdõs, the largest graph reasoning dataset to date comprising 50 diverse graph-theoretic tasks of varying difficulty levels, 100k training data and 5k test data, all drived from real-world graphs. With RL on Erdõs, G1 obtains substantial improvements in graph reasoning, where our finetuned 3B model even outperforms Qwen2.5-72B-Instruct (24x size). RL-trained models also show strong zero-shot generalization to unseen tasks, domains, and graph encoding schemes, including other graph-theoretic benchmarks as well as real-world node classification and link prediction tasks, without compromising general reasoning abilities. Our findings offer an efficient, scalable path for building strong graph reasoners by finetuning LLMs with RL on graph-theoretic tasks, which combines the strengths of pretrained LLM capabilities with abundant, automatically generated synthetic data, suggesting that LLMs possess graph understanding abilities that RL can elicit successfully. Our implementation is open-sourced at https://github.com/PKU-ML/G1, with models and datasets hosted on Hugging Face collections https://huggingface.co/collections/PKU-ML/g1-683d659e992794fc99618cf2 for broader accessibility.
中文: G1方法通过在合成的Erdős数据集上进行强化学习,显著提升了大语言模型的图推理能力,使仅30亿参数的模型性能超越庞大模型,并能良好泛化且不损害通用推理能力。
English: The G1 approach uses reinforcement learning on the synthetic Erdős dataset to significantly enhance LLMs' graph reasoning, enabling a compact 3B model to outperform much larger models and generalize well without compromising general abilities.
Authors:Junlin Wang, Zhiyun Lin
Abstract:
Learning effective visual representations for robotic manipulation remains a fundamental challenge due to the complex body dynamics involved in action execution. In this paper, we study how visual representations that carry body-relevant cues can enable efficient policy learning for downstream robotic manipulation tasks. We present $\textbf{I}$nter-token $\textbf{Con}$trast ($\textbf{ICon}$), a contrastive learning method applied to the token-level representations of Vision Transformers (ViTs). ICon enforces a separation in the feature space between agent-specific and environment-specific tokens, resulting in agent-centric visual representations that embed body-specific inductive biases. This framework can be seamlessly integrated into end-to-end policy learning by incorporating the contrastive loss as an auxiliary objective. Our experiments show that ICon not only improves policy performance across various manipulation tasks but also facilitates policy transfer across different robots. The project website: https://github.com/HenryWJL/icon
中文: 本文提出ICon方法,通过对视觉变换器的令牌级表征进行对比学习,分离智能体与环境相关令牌,形成具身化视觉表征,从而提升机器人操作策略的学习效果与跨机器人迁移能力。
English: The paper introduces ICon, a contrastive learning method for Vision Transformers that separates agent-specific and environment-specific tokens to create body-relevant visual representations, enhancing robotic manipulation policy learning and transferability.
Authors:Mengran Li, Pengyu Zhang, Wenbin Xing, Yijia Zheng, Klim Zaporojets, Junzhou Chen, Ronghui Zhang, Yong Zhang, Siyuan Gong, Jia Hu, Xiaolei Ma, Zhiyuan Liu, Paul Groth, Marcel Worring
Abstract:
Graphs are a widely used paradigm for representing non-Euclidean data, with applications ranging from social network analysis to biomolecular prediction. While graph learning has achieved remarkable progress, real-world graph data presents a number of challenges that significantly hinder the learning process. In this survey, we focus on four fundamental data-centric challenges: (1) Incompleteness, real-world graphs have missing nodes, edges, or attributes; (2) Imbalance, the distribution of the labels of nodes or edges and their structures for real-world graphs are highly skewed; (3) Cross-domain Heterogeneity, graphs from different domains exhibit incompatible feature spaces or structural patterns; and (4) Dynamic Instability, graphs evolve over time in unpredictable ways. Recently, Large Language Models (LLMs) offer the potential to tackle these challenges by leveraging rich semantic reasoning and external knowledge. This survey focuses on how LLMs can address four fundamental data-centric challenges in graph-structured data, thereby improving the effectiveness of graph learning. For each challenge, we review both traditional solutions and modern LLM-driven approaches, highlighting how LLMs contribute unique advantages. Finally, we discuss open research questions and promising future directions in this emerging interdisciplinary field. To support further exploration, we have curated a repository of recent advances on graph learning challenges: https://github.com/limengran98/Awesome-Literature-Graph-Learning-Challenges.
Chinese: 本综述探讨大型语言模型如何利用语义推理和外部知识应对图学习中的四个核心数据挑战——不完整性、不平衡性、跨域异质性和动态不稳定性,从而提升图学习效能。
English: This survey explores how Large Language Models (LLMs) can address four key data-centric challenges in graph learning—incompleteness, imbalance, cross-domain heterogeneity, and dynamic instability—by leveraging semantic reasoning and external knowledge to enhance learning effectiveness.
Authors:Xuanhe Zhou, Junxuan He, Wei Zhou, Haodong Chen, Zirui Tang, Haoyu Zhao, Xin Tong, Guoliang Li, Youmin Chen, Jun Zhou, Zhaojun Sun, Binyuan Hui, Shuo Wang, Conghui He, Zhiyuan Liu, Jingren Zhou, Fan Wu
Abstract:
The integration of large language model (LLM) and data management (DATA) is rapidly redefining both domains. In this survey, we comprehensively review the bidirectional relationships. On the one hand, DATA4LLM, spanning large-scale data processing, storage, and serving, feeds LLMs with high quality, diversity, and timeliness of data required for stages like pre-training, post-training, retrieval-augmented generation, and agentic workflows: (i) Data processing for LLMs includes scalable acquisition, deduplication, filtering, selection, domain mixing, and synthetic augmentation; (ii) Data Storage for LLMs focuses on efficient data and model formats, distributed and heterogeneous storage hierarchies, KV-cache management, and fault-tolerant checkpointing; (iii) Data serving for LLMs tackles challenges in RAG (e.g., knowledge post-processing), LLM inference (e.g., prompt compression, data provenance), and training strategies (e.g., data packing and shuffling). On the other hand, in LLM4DATA, LLMs are emerging as general-purpose engines for data management. We review recent advances in (i) data manipulation, including automatic data cleaning, integration, discovery; (ii) data analysis, covering reasoning over structured, semi-structured, and unstructured data, and (iii) system optimization (e.g., configuration tuning, query rewriting, anomaly diagnosis), powered by LLM techniques like retrieval-augmented prompting, task-specialized fine-tuning, and multi-agent collaboration.
中文: 本综述探讨了大型语言模型与数据管理的双向融合,既涵盖数据系统通过处理、存储和服务支持模型开发,也涉及模型如何优化数据操作、分析和系统管理。
English: This survey explores the bidirectional integration between large language models (LLMs) and data management, covering how data systems support LLM development through processing, storage, and serving, while LLMs enhance data tasks like manipulation, analysis, and system optimization.
Authors:Zhining Liu, Ze Yang, Xiao Lin, Ruizhong Qiu, Tianxin Wei, Yada Zhu, Hendrik Hamann, Jingrui He, Hanghang Tong
Abstract:
Time-series forecasting plays a critical role in many real-world applications. Although increasingly powerful models have been developed and achieved superior results on benchmark datasets, through a fine-grained sample-level inspection, we find that (i) no single model consistently outperforms others across different test samples, but instead (ii) each model excels in specific cases. These findings prompt us to explore how to adaptively leverage the distinct strengths of various forecasting models for different samples. We introduce TimeFuse, a framework for collective time-series forecasting with sample-level adaptive fusion of heterogeneous models. TimeFuse utilizes meta-features to characterize input time series and trains a learnable fusor to predict optimal model fusion weights for any given input. The fusor can leverage samples from diverse datasets for joint training, allowing it to adapt to a wide variety of temporal patterns and thus generalize to new inputs, even from unseen datasets. Extensive experiments demonstrate the effectiveness of TimeFuse in various long-/short-term forecasting tasks, achieving near-universal improvement over the state-of-the-art individual models. Code is available at https://github.com/ZhiningLiu1998/TimeFuse.
中文: TimeFuse是一种新颖的时序预测框架,通过元特征和可学习的融合器对不同样本自适应地融合多种预测模型,在各类预测任务中均实现了优于现有最优模型的性能提升。
English: TimeFuse is a novel framework that adaptively fuses multiple time-series forecasting models at the sample level using meta-features and a learnable fusor, achieving universal improvements over state-of-the-art models across various forecasting tasks.
Authors:Romeo Valentin, Sydney M. Katz, Vincent Vanhoucke, Mykel J. Kochenderfer
Abstract:
Dictionary learning has recently emerged as a promising approach for mechanistic interpretability of large transformer models. Disentangling high-dimensional transformer embeddings, however, requires algorithms that scale to high-dimensional data with large sample sizes. Recent work has explored sparse autoencoders (SAEs) for this problem. However, SAEs use a simple linear encoder to solve the sparse encoding subproblem, which is known to be NP-hard. It is therefore interesting to understand whether this structure is sufficient to find good solutions to the dictionary learning problem or if a more sophisticated algorithm could find better solutions. In this work, we propose Double-Batch KSVD (DB-KSVD), a scalable dictionary learning algorithm that adapts the classic KSVD algorithm. DB-KSVD is informed by the rich theoretical foundations of KSVD but scales to datasets with millions of samples and thousands of dimensions. We demonstrate the efficacy of DB-KSVD by disentangling embeddings of the Gemma-2-2B model and evaluating on six metrics from the SAEBench benchmark, where we achieve competitive results when compared to established approaches based on SAEs. By matching SAE performance with an entirely different optimization approach, our results suggest that (i) SAEs do find strong solutions to the dictionary learning problem and (ii) that traditional optimization approaches can be scaled to the required problem sizes, offering a promising avenue for further research. We provide an implementation of DB-KSVD at https://github.com/RomeoV/KSVD.jl.
Chinese: 本研究提出可扩展的字典学习算法Double-Batch KSVD,在解构Transformer嵌入表示的任务中与稀疏自编码器取得相当性能,既验证了现有方法的有效性,也展现了传统优化方法在大规模机制可解释性研究中的应用潜力。
English: This study introduces Double-Batch KSVD, a scalable dictionary learning algorithm that achieves competitive performance with sparse autoencoders in disentangling transformer embeddings, demonstrating both the effectiveness of existing methods and the potential of traditional optimization approaches for large-scale mechanistic interpretability.
Authors:Yue Jiang, Jichu Li, Yang Liu, Dingkang Yang, Feng Zhou, Quyu Kong
Abstract:
We introduce DanmakuTPPBench, a comprehensive benchmark designed to advance multi-modal Temporal Point Process (TPP) modeling in the era of Large Language Models (LLMs). While TPPs have been widely studied for modeling temporal event sequences, existing datasets are predominantly unimodal, hindering progress in models that require joint reasoning over temporal, textual, and visual information. To address this gap, DanmakuTPPBench comprises two complementary components: (1) DanmakuTPP-Events, a novel dataset derived from the Bilibili video platform, where user-generated bullet comments (Danmaku) naturally form multi-modal events annotated with precise timestamps, rich textual content, and corresponding video frames; (2) DanmakuTPP-QA, a challenging question-answering dataset constructed via a novel multi-agent pipeline powered by state-of-the-art LLMs and multi-modal LLMs (MLLMs), targeting complex temporal-textual-visual reasoning. We conduct extensive evaluations using both classical TPP models and recent MLLMs, revealing significant performance gaps and limitations in current methods' ability to model multi-modal event dynamics. Our benchmark establishes strong baselines and calls for further integration of TPP modeling into the multi-modal language modeling landscape. The code and dataset have been released at https://github.com/FRENKIE-CHIANG/DanmakuTPPBench
中文: DanmakuTPPBench推出了一个多模态基准,整合了来自B站弹幕的时间事件数据和问答数据集,以推动时序点过程建模的发展,揭示了现有方法在处理时序-文本-视觉推理方面的不足。
English: DanmakuTPPBench introduces a multi-modal benchmark combining temporal event data from Bilibili's bullet comments with a QA dataset to advance Temporal Point Process modeling, revealing current methods' limitations in handling temporal-textual-visual reasoning.
Authors:Lin Zhao, Yushu Wu, Xinru Jiang, Jianyang Gu, Yanzhi Wang, Xiaolin Xu, Pu Zhao, Xue Lin
Abstract:
Recent deep learning models demand larger datasets, driving the need for dataset distillation to create compact, cost-efficient datasets while maintaining performance. Due to the powerful image generation capability of diffusion, it has been introduced to this field for generating distilled images. In this paper, we systematically investigate issues present in current diffusion-based dataset distillation methods, including inaccurate distribution matching, distribution deviation with random noise, and separate sampling. Building on this, we propose D^3HR, a novel diffusion-based framework to generate distilled datasets with high representativeness. Specifically, we adopt DDIM inversion to map the latents of the full dataset from a low-normality latent domain to a high-normality Gaussian domain, preserving information and ensuring structural consistency to generate representative latents for the distilled dataset. Furthermore, we propose an efficient sampling scheme to better align the representative latents with the high-normality Gaussian distribution. Our comprehensive experiments demonstrate that D^3HR can achieve higher accuracy across different model architectures compared with state-of-the-art baselines in dataset distillation. Source code: https://github.com/lin-zhao-resoLve/D3HR.
中文: 本文提出D^3HR框架,通过DDIM反演和高效采样方案解决当前基于扩散的数据集蒸馏方法中的关键问题,生成高代表性数据集,在不同模型架构下均比现有方法获得更高准确率。
English: This paper introduces D^3HR, a diffusion-based framework that addresses key issues in dataset distillation by using DDIM inversion and an efficient sampling scheme to generate highly representative datasets, achieving superior accuracy across various model architectures compared to existing methods.
Authors:Pingchuan Ma, Ziang Yin, Qi Jing, Zhengqi Gao, Nicholas Gangi, Boyang Zhang, Tsung-Wei Huang, Zhaoran Huang, Duane S. Boning, Yu Yao, Jiaqi Gu
Abstract:
DONNs leverage light propagation for efficient analog AI and signal processing. Advances in nanophotonic fabrication and metasurface-based wavefront engineering have opened new pathways to realize high-capacity DONNs across various spectral regimes. Training such DONN systems to determine the metasurface structures remains challenging. Heuristic methods are fast but oversimplify metasurfaces modulation, often resulting in physically unrealizable designs and significant performance degradation. Simulation-in-the-loop optimizes implementable metasurfaces via adjoint methods, but is computationally prohibitive and unscalable. To address these limitations, we propose SP2RINT, a spatially decoupled, progressive training framework that formulates DONN training as a PDE-constrained learning problem. Metasurface responses are first relaxed into freely trainable transfer matrices with a banded structure. We then progressively enforce physical constraints by alternating between transfer matrix training and adjoint-based inverse design, avoiding per-iteration PDE solves while ensuring final physical realizability. To further reduce runtime, we introduce a physics-inspired, spatially decoupled inverse design strategy based on the natural locality of field interactions. This approach partitions the metasurface into independently solvable patches, enabling scalable and parallel inverse design with system-level calibration. Evaluated across diverse DONN training tasks, SP2RINT achieves digital-comparable accuracy while being 1825 times faster than simulation-in-the-loop approaches. By bridging the gap between abstract DONN models and implementable photonic hardware, SP2RINT enables scalable, high-performance training of physically realizable meta-optical neural systems. Our code is available at https://github.com/ScopeX-ASU/SP2RINT
中文: SP2RINT是一种创新的训练框架,通过空间解耦和渐进式物理约束实施,实现了可扩展且高效的衍射光学神经网络设计,在保持与数字系统相当精度的同时,比传统方法提速1825倍。
English: SP2RINT is a novel training framework that enables scalable and efficient design of physically realizable diffractive optical neural networks by decoupling spatial constraints and progressively enforcing physical realizability, achieving digital-comparable accuracy with 1825× speedup over conventional methods.
Authors:Licheng Pan, Yongqi Tong, Xin Zhang, Xiaolu Zhang, Jun Zhou, Zhixuan Chu
Abstract:
Large language models (LLMs) have demonstrated remarkable capabilities across a wide range of tasks, yet they often refuse to answer legitimate queries--a phenomenon known as overrefusal. Overrefusal typically stems from over-conservative safety alignment, causing models to treat many reasonable prompts as potentially risky. To systematically understand this issue, we probe and leverage the models' safety decision boundaries to analyze and mitigate overrefusal. Our findings reveal that overrefusal is closely tied to misalignment at these boundary regions, where models struggle to distinguish subtle differences between benign and harmful content. Building on these insights, we present RASS, an automated framework for prompt generation and selection that strategically targets overrefusal prompts near the safety boundary. By harnessing steering vectors in the representation space, RASS efficiently identifies and curates boundary-aligned prompts, enabling more effective and targeted mitigation of overrefusal. This approach not only provides a more precise and interpretable view of model safety decisions but also seamlessly extends to multilingual scenarios. We have explored the safety decision boundaries of various LLMs and construct the MORBench evaluation set to facilitate robust assessment of model safety and helpfulness across multiple languages. Code and datasets are available at https://github.com/Master-PLC/RASS.
中文摘要:大语言模型常因保守的安全对齐而过度拒绝合理查询,但提出的RASS框架通过策略性地识别边界提示来缓解此问题,同时保持跨语言场景下的安全性。
English Summary: Large language models often overrefuse legitimate queries due to conservative safety alignment, but the proposed RASS framework strategically identifies boundary prompts to mitigate this issue while maintaining safety across languages.
Authors:Roy Elkayam
Abstract:
This study presents a novel approach for decomposing urban water demand patterns using Skewed Gaussian Distributions (SGD) to derive behavioral insights and support operational planning. Hourly demand profiles contain critical information for both long-term infrastructure design and daily operations, influencing network pressures, water quality, energy consumption, and overall reliability. By breaking down each daily demand curve into a baseline component and distinct peak components, the proposed SGD method characterizes each peak with interpretable parameters, including peak amplitude, timing (mean), spread (duration), and skewness (asymmetry), thereby reconstructing the observed pattern and uncovering latent usage dynamics. This detailed peak-level decomposition enables both operational applications, e.g. anomaly and leakage detection, real-time demand management, and strategic analyses, e.g. identifying behavioral shifts, seasonal influences, or policy impacts on consumption patterns. Unlike traditional symmetric Gaussian or purely statistical time-series models, SGDs explicitly capture asymmetric peak shapes such as sharp morning surges followed by gradual declines, improving the fidelity of synthetic pattern generation and enhancing the detection of irregular consumption behavior. The method is demonstrated on several real-world datasets, showing that SGD outperforms symmetric Gaussian models in reconstruction accuracy, reducing root-mean-square error by over 50% on average, while maintaining physical interpretability. The SGD framework can also be used to construct synthetic demand scenarios by designing daily peak profiles with chosen characteristics. All implementation code is publicly available at: https://github.com/Relkayam/water-demand-decomposition-sgd
本研究提出了一种利用偏态高斯分布分解城市用水需求模式的新方法,通过精确捕捉非对称峰值特征,显著提高了重构精度和可解释性,从而获取行为洞察并支持运营规划。
This study introduces a novel method using Skewed Gaussian Distributions to decompose urban water demand patterns, enabling behavioral insights and operational planning by accurately capturing asymmetric peak characteristics with improved reconstruction accuracy and interpretability.
Authors:Zhenglun Kong, Yize Li, Fanhu Zeng, Lei Xin, Shvat Messica, Xue Lin, Pu Zhao, Manolis Kellis, Hao Tang, Marinka Zitnik
Abstract:
In Transformer architectures, tokens\textemdash discrete units derived from raw data\textemdash are formed by segmenting inputs into fixed-length chunks. Each token is then mapped to an embedding, enabling parallel attention computations while preserving the input's essential information. Due to the quadratic computational complexity of transformer self-attention mechanisms, token reduction has primarily been used as an efficiency strategy. This is especially true in single vision and language domains, where it helps balance computational costs, memory usage, and inference latency. Despite these advances, this paper argues that token reduction should transcend its traditional efficiency-oriented role in the era of large generative models. Instead, we position it as a fundamental principle in generative modeling, critically influencing both model architecture and broader applications. Specifically, we contend that across vision, language, and multimodal systems, token reduction can: (i) facilitate deeper multimodal integration and alignment, (ii) mitigate "overthinking" and hallucinations, (iii) maintain coherence over long inputs, and (iv) enhance training stability, etc. We reframe token reduction as more than an efficiency measure. By doing so, we outline promising future directions, including algorithm design, reinforcement learning-guided token reduction, token optimization for in-context learning, and broader ML and scientific domains. We highlight its potential to drive new model architectures and learning strategies that improve robustness, increase interpretability, and better align with the objectives of generative modeling.
中文: 本文主张将标记约简从传统的效率策略提升为生成式建模的核心原则,强调其在促进多模态整合、减少幻觉、保持长输入连贯性及增强训练稳定性等方面的关键作用,并展望了其在算法设计和跨领域应用中的潜力。
English: This paper repositions token reduction from a mere efficiency strategy to a fundamental principle in generative modeling, arguing it enhances multimodal integration, reduces hallucinations, maintains coherence in long inputs, and improves training stability across vision, language, and multimodal systems.
Authors:Beck LaBash, Shahriar Khushrushahi, Fabian Ruehle
Abstract:
We propose a two-stage deep learning framework for the inverse design of rectangular patch antennas. Our approach leverages generative modeling to learn a latent representation of antenna frequency response curves and conditions a subsequent generative model on these responses to produce feasible antenna geometries. We further demonstrate that leveraging search and optimization techniques at test-time improves the accuracy of the generated designs and enables consideration of auxiliary objectives such as manufacturability. Our approach generalizes naturally to different design criteria, and can be easily adapted to more complex geometric design spaces.
中文: 该两阶段深度学习框架通过生成式建模学习天线频率响应的潜在表征并据此生成可行几何结构,结合测试阶段的搜索优化提升设计精度并兼顾可制造性等辅助目标,能自然适应不同设计标准并扩展至复杂几何空间。
English: The proposed two-stage deep learning framework utilizes generative modeling to design rectangular patch antennas by learning frequency response representations and generating corresponding geometries, with test-time optimization enhancing design accuracy and incorporating manufacturability considerations.
Authors:Min Namgung, Yijun Lin, JangHyeon Lee, Yao-Yi Chiang
Abstract:
With the increasing availability of geospatial datasets, researchers have explored region representation learning (RRL) to analyze complex region characteristics. Recent RRL methods use contrastive learning (CL) to capture shared information between two modalities but often overlook task-relevant unique information specific to each modality. Such modality-specific details can explain region characteristics that shared information alone cannot capture. Bringing information factorization to RRL can address this by factorizing multimodal data into shared and unique information. However, existing factorization approaches focus on two modalities, whereas RRL can benefit from various geospatial data. Extending factorization beyond two modalities is non-trivial because modeling high-order relationships introduces a combinatorial number of learning objectives, increasing model complexity. We introduce Cross modal Knowledge Injected Embedding, an information factorization approach for RRL that captures both shared and unique representations. CooKIE uses a pairwise inter-view learning approach that captures high-order information without modeling high-order dependency, avoiding exhaustive combinations. We evaluate CooKIE on three regression tasks and a land use classification task in New York City and Delhi, India. Results show that CooKIE outperforms existing RRL methods and a factorized RRL model, capturing multimodal information with fewer training parameters and floating-point operations per second (FLOPs). We release the code: https://github.com/MinNamgung/CooKIE.
中文摘要:CooKIE提出了一种高效的区域表征学习信息分解方法,无需建模复杂高阶依赖即可捕获多模态共享与独特信息,在降低计算成本的同时展现出优越性能。
English Summary: CooKIE introduces an efficient information factorization method for region representation learning that captures both shared and unique multimodal information without modeling complex high-order dependencies, demonstrating superior performance with reduced computational costs.
Authors:Natia Kukhilava, Tatia Tsmindashvili, Rapael Kalandadze, Anchit Gupta, Sofio Katamadze, François Brémond, Laura M. Ferrari, Philipp Müller, Benedikt Emanuel Wirth
Abstract:
Electroencephalography-based Emotion Recognition (EEG-ER) has become a growing research area in recent years. Analyzing 216 papers published between 2018 and 2023, we uncover that the field lacks a unified evaluation protocol, which is essential to fairly define the state of the art, compare new approaches and to track the field's progress. We report the main inconsistencies between the used evaluation protocols, which are related to ground truth definition, evaluation metric selection, data splitting types (e.g., subject-dependent or subject-independent) and the use of different datasets. Capitalizing on this state-of-the-art research, we propose a unified evaluation protocol, EEGain (https://github.com/EmotionLab/EEGain), which enables an easy and efficient evaluation of new methods and datasets. EEGain is a novel open source software framework, offering the capability to compare - and thus define - state-of-the-art results. EEGain includes standardized methods for data pre-processing, data splitting, evaluation metrics, and the ability to load the six most relevant datasets (i.e., AMIGOS, DEAP, DREAMER, MAHNOB-HCI, SEED, SEED-IV) in EEG-ER with only a single line of code. In addition, we have assessed and validated EEGain using these six datasets on the four most common publicly available methods (EEGNet, DeepConvNet, ShallowConvNet, TSception). This is a significant step to make research on EEG-ER more reproducible and comparable, thereby accelerating the overall progress of the field.
中文: EEG-ER研究领域缺乏统一的评估标准,为此开发了EEGain开源框架,通过标准化数据处理和评估方法,确保研究结果的可复现性与可比性。
English: EEG-ER research lacks standardized evaluation protocols, prompting the development of EEGain, an open-source framework that ensures reproducible and comparable results by unifying data processing and assessment methods.
Authors:Kazem Faghih, Wenxiao Wang, Yize Cheng, Siddhant Bharti, Gaurang Sriramanan, Sriram Balasubramanian, Parsa Hosseini, Soheil Feizi
Abstract:
Large language models (LLMs) can now access a wide range of external tools, thanks to the Model Context Protocol (MCP). This greatly expands their abilities as various agents. However, LLMs rely entirely on the text descriptions of tools to decide which ones to use--a process that is surprisingly fragile. In this work, we expose a vulnerability in prevalent tool/function-calling protocols by investigating a series of edits to tool descriptions, some of which can drastically increase a tool's usage from LLMs when competing with alternatives. Through controlled experiments, we show that tools with properly edited descriptions receive over 10 times more usage from GPT-4.1 and Qwen2.5-7B than tools with original descriptions. We further evaluate how various edits to tool descriptions perform when competing directly with one another and how these trends generalize or differ across a broader set of 17 different models. These phenomena, while giving developers a powerful way to promote their tools, underscore the need for a more reliable foundation for agentic LLMs to select and utilize tools and resources. Our code is publicly available at https://github.com/kazemf78/llm-unreliable-tool-preferences.
中文: 大型语言模型选择工具时易受描述篡改的影响,某些编辑后的描述可使工具使用率激增十倍以上,这凸显了建立更可靠协议的必要性。
English: Large language models' tool selection is vulnerable to manipulated descriptions, with edited versions increasing usage over tenfold in some cases, highlighting the need for more reliable protocols.
Authors:Georgios Kementzidis, Erin Wong, John Nicholson, Ruichen Xu, Yuefan Deng
Abstract:
The techniques of data-driven backmapping from coarse-grained (CG) to fine-grained (FG) representation often struggle with accuracy, unstable training, and physical realism, especially when applied to complex systems such as proteins. In this work, we introduce a novel iterative framework by using conditional Variational Autoencoders and graph-based neural networks, specifically designed to tackle the challenges associated with such large-scale biomolecules. Our method enables stepwise refinement from CG beads to full atomistic details. We outline the theory of iterative generative backmapping and demonstrate via numerical experiments the advantages of multistep schemes by applying them to proteins of vastly different structures with very coarse representations. This multistep approach not only improves the accuracy of reconstructions but also makes the training process more computationally efficient for proteins with ultra-CG representations.
中文: 本研究提出了一种创新的迭代框架,利用条件变分自编码器和图神经网络,有效提升了复杂蛋白质系统中从粗粒度到细粒度表示的数据驱动反向映射的准确性、训练稳定性和物理真实性。
English: This study introduces an innovative iterative framework using conditional Variational Autoencoders and graph neural networks to enhance the accuracy, training stability, and physical realism of data-driven backmapping from coarse-grained to fine-grained representations in complex protein systems.
Authors:Zizhao Chen, Yoav Artzi
Abstract:
We propose KnotGym, an interactive environment for complex, spatial reasoning and manipulation. KnotGym includes goal-oriented rope manipulation tasks with varying levels of complexity, all requiring acting from pure image observations. Tasks are defined along a clear and quantifiable axis of complexity based on the number of knot crossings, creating a natural generalization test. KnotGym has a simple observation space, allowing for scalable development, yet it highlights core challenges in integrating acute perception, spatial reasoning, and grounded manipulation. We evaluate methods of different classes, including model-based RL, model-predictive control, and chain-of-thought reasoning, and illustrate the challenges KnotGym presents. KnotGym is available at https://github.com/lil-lab/knotgym.
中文: KnotGym是一个用于空间推理和操作的交互式环境,通过基于绳结交叉数量的可量化复杂度任务,为评估不同人工智能方法提供了测试平台。
English: KnotGym is an interactive environment for spatial reasoning and manipulation that features goal-oriented rope tasks with scalable complexity based on knot crossings, enabling evaluation of various AI methods.
Authors:Shashank Agnihotri, David Schader, Jonas Jakubassa, Nico Sharei, Simon Kral, Mehmet Ege Kaçar, Ruben Weber, Margret Keuper
Abstract:
Reliability and generalization in deep learning are predominantly studied in the context of image classification. Yet, real-world applications in safety-critical domains involve a broader set of semantic tasks, such as semantic segmentation and object detection, which come with a diverse set of dedicated model architectures. To facilitate research towards robust model design in segmentation and detection, our primary objective is to provide benchmarking tools regarding robustness to distribution shifts and adversarial manipulations. We propose the benchmarking tools SEMSEGBENCH and DETECBENCH, along with the most extensive evaluation to date on the reliability and generalization of semantic segmentation and object detection models. In particular, we benchmark 76 segmentation models across four datasets and 61 object detectors across two datasets, evaluating their performance under diverse adversarial attacks and common corruptions. Our findings reveal systematic weaknesses in state-of-the-art models and uncover key trends based on architecture, backbone, and model capacity. SEMSEGBENCH and DETECBENCH are open-sourced in our GitHub repository (https://github.com/shashankskagnihotri/benchmarking_reliability_generalization) along with our complete set of total 6139 evaluations. We anticipate the collected data to foster and encourage future research towards improved model reliability beyond classification.
中文摘要:本研究提出SEMSEGBENCH和DETECBENCH基准测试工具,通过大规模评估揭示了当前最优语义分割和物体检测模型在分布偏移和对抗攻击下存在的系统性缺陷,旨在推动超越分类任务的模型可靠性研究。
English Summary: This research introduces SEMSEGBENCH and DETECBENCH benchmarking tools to evaluate the robustness of semantic segmentation and object detection models against distribution shifts and adversarial attacks, revealing systematic weaknesses in state-of-the-art models through extensive testing.
Authors:Honghao Li, Yiwen Zhang, Yi Zhang, Lei Sang, Jieming Zhu
Abstract:
Hadamard Product (HP) has long been a cornerstone in click-through rate (CTR) prediction tasks due to its simplicity, effectiveness, and ability to capture feature interactions without additional parameters. However, the underlying reasons for its effectiveness remain unclear. In this paper, we revisit HP from the perspective of Quadratic Neural Networks (QNN), which leverage quadratic interaction terms to model complex feature relationships. We further reveal QNN's ability to expand the feature space and provide smooth nonlinear approximations without relying on activation functions. Meanwhile, we find that traditional post-activation does not further improve the performance of the QNN. Instead, mid-activation is a more suitable alternative. Through theoretical analysis and empirical evaluation of 25 QNN neuron formats, we identify a good-performing variant and make further enhancements on it. Specifically, we propose the Multi-Head Khatri-Rao Product as a superior alternative to HP and a Self-Ensemble Loss with dynamic ensemble capability within the same network to enhance computational efficiency and performance. Ultimately, we propose a novel neuron format, QNN-alpha, which is tailored for CTR prediction tasks. Experimental results show that QNN-alpha achieves new state-of-the-art performance on six public datasets while maintaining low inference latency, good scalability, and excellent compatibility. The code, running logs, and detailed hyperparameter configurations are available at: https://github.com/salmon1802/QNN.
Chinese: 本文通过二次神经网络重新审视哈达玛积在点击率预测中的有效性,提出增强的QNN-alpha模型,该模型在保持高效和兼容性的同时实现了最先进的性能。
English: This paper re-examines the Hadamard Product's effectiveness in CTR prediction through Quadratic Neural Networks, proposing the enhanced QNN-alpha model that achieves state-of-the-art performance with improved efficiency and compatibility.
Authors:Yutong Chen, Jiandong Gao, Ji Wu
Abstract:
R1-style Reinforcement Learning (RL) significantly enhances Large Language Models' reasoning capabilities, yet the mechanism behind rule-based RL remains unclear. We found that small-scale SFT has substantial influence on RL but shows poor efficiency. To explain our observations, we propose an analytical framework and compare the efficiency of SFT and RL by measuring \textbf{sample effect}. Our hypothetical analysis shows the potential to improve SFT efficiency. Guided by our analysis, we propose \textbf{Re-distillation}, a technique that aims to boost the effectiveness of small-scale distillation by sampling from the RL-trained policy. Re-distillation shows consistent surprising efficiency on three datasets and both Qwen\&Llama models: Re-distilled models matched RL performance with far fewer samples and less computation. As a result, on K\&K dataset, our re-distilled Qwen-2.5-1.5B model surpasses DeepSeek-V3-0324 with only 1K SFT samples. We demonstrate that re-distillation can be used to efficiently balance multiple goals in RL. Our work explains several interesting phenomena in R1-style RL, shedding light on the mechanisms behind its empirical success. Code is available at: https://github.com/on1262/deep-reasoning.
中文:R1式强化学习提升了大型语言模型的推理能力,而提出的再蒸馏技术通过利用强化学习训练的策略提高了小规模蒸馏的效率,以更少的样本和计算量实现了更优的性能。
English: R1-style Reinforcement Learning boosts reasoning in Large Language Models, and the proposed Re-distillation technique enhances small-scale distillation efficiency by leveraging RL-trained policies, achieving superior performance with fewer samples and computation.
Authors:Simone Gaisbauer, Prabin Gyawali, Qilin Zhang, Olaf Wysocki, Boris Jutzi
Abstract:
Feature matching is a necessary step for many computer vision and photogrammetry applications such as image registration, structure-from-motion, and visual localization. Classical handcrafted methods such as SIFT feature detection and description combined with nearest neighbour matching and RANSAC outlier removal have been state-of-the-art for mobile mapping cameras. With recent advances in deep learning, learnable methods have been introduced and proven to have better robustness and performance under complex conditions. Despite their growing adoption, a comprehensive comparison between classical and learnable feature matching methods for the specific task of semantic 3D building camera-to-model matching is still missing. This submission systematically evaluates the effectiveness of different feature-matching techniques in visual localization using textured CityGML LoD2 models. We use standard benchmark datasets (HPatches, MegaDepth-1500) and custom datasets consisting of facade textures and corresponding camera images (terrestrial and drone). For the latter, we evaluate the achievable accuracy of the absolute pose estimated using a Perspective-n-Point (PnP) algorithm, with geometric ground truth derived from geo-referenced trajectory data. The results indicate that the learnable feature matching methods vastly outperform traditional approaches regarding accuracy and robustness on our challenging custom datasets with zero to 12 RANSAC-inliers and zero to 0.16 area under the curve. We believe that this work will foster the development of model-based visual localization methods. Link to the code: https://github.com/simBauer/To\_Glue\_or\_not\_to\_Glue
中文: 本研究系统比较了基于三维建筑模型的视觉定位中传统与可学习特征匹配方法,发现在具有挑战性的数据集上,可学习方法在精度和鲁棒性方面显著优于传统技术。
English: This study systematically compares classical and learnable feature matching methods for visual localization using 3D building models, finding that learnable approaches significantly outperform traditional techniques in accuracy and robustness on challenging datasets.
Authors:Ionut-Vlad Modoranu, Mher Safaryan, Erik Schultheis, Max Ryabinin, Artem Chumachenko, Dan Alistarh
Abstract:
Low-rank optimization has emerged as a promising direction in training large language models (LLMs) to improve running time and reduce the memory usage of adaptive optimizers by constraining learning to a lower-dimensional space. Prior work typically projects gradients of linear layers using approaches based on Singular Value Decomposition (SVD) or QR-decomposition. Applying these techniques individually to each layer in large models is computationally expensive and incurs additional memory costs due to storing the projection matrices. In this work, we propose a computationally efficient and conceptually simple, two-step procedure to approximate SVD/QR-based gradient projections into lower-dimensional spaces by using a predefined orthogonal matrix of the Discrete Cosine Transform (DCT). We dynamically select columns from the DCT matrix based on their alignment with the gradient of each layer. The effective projection matrices are obtained via a simple matmul with the DCT matrix in $O(n^3)$ time, followed by a lightweight sorting step to identify the most relevant basis vectors. For large layers, DCT can be computed via Makhoul's $N$-point algorithm based on Fast Fourier Transform (FFT) in $O(n^2 \log(n))$ time. Due to the predefined nature of the orthogonal bases, they are computed once at the start of training. Our numerical experiments on both pre-training and fine-tuning tasks demonstrate the effectiveness of our dual strategy in approximating optimal low-rank projections, obtaining an approach with rank-independent running time that matches the performance of costly SVD/QR-based methods while achieving faster runtime and reduced memory usage by up to $25\%$ across different model sizes. Our code is available at \href{https://github.com/IST-DASLab/ISTA-DASLab-Optimizers}{\texttt{https://github.com/IST-DASLab/ISTA-DASLab-Optimizers}}.
Chinese: 本研究提出了一种利用离散余弦变换(DCT)矩阵的计算高效方法,通过近似低秩梯度投影来训练大语言模型,在实现与SVD/QR方法相当性能的同时,将运行时间和内存使用量降低了最高达25%。
English: This work introduces a computationally efficient method using Discrete Cosine Transform (DCT) matrices to approximate low-rank gradient projections for training large language models, achieving performance comparable to SVD/QR methods while reducing runtime and memory usage by up to 25%.
Authors:Zigeng Chen, Xinyin Ma, Gongfan Fang, Ruonan Yu, Xinchao Wang
Abstract:
Large Reasoning Models (LRMs) excel at complex tasks using Chain-of-Thought (CoT) reasoning. However, their tendency to overthinking leads to unnecessarily lengthy reasoning chains, dramatically increasing inference costs. To mitigate this issue, we introduce VeriThinker, a novel approach for CoT compression. Unlike conventional methods that fine-tune LRMs directly on the original reasoning task using synthetic concise CoT data, we innovatively fine-tune the model solely through an auxiliary verification task. By training LRMs to accurately verify the correctness of CoT solutions, the LRMs inherently become more discerning about the necessity of subsequent self-reflection steps, thereby effectively suppressing overthinking. Extensive experiments validate that VeriThinker substantially reduces reasoning chain lengths while maintaining or even slightly improving accuracy. When applied to DeepSeek-R1-Distill-Qwen-7B, our approach reduces reasoning tokens on MATH500 from 3790 to 2125 while improving accuracy by 0.8% (94.0% to 94.8%), and on AIME25, tokens decrease from 14321 to 10287 with a 2.1% accuracy gain (38.7% to 40.8%). Additionally, our experiments demonstrate that VeriThinker can also be zero-shot generalized to speculative reasoning. Code is available at https://github.com/czg1225/VeriThinker
中文: VeriThinker是一种创新的思维链压缩方法,通过仅基于辅助验证任务微调大型推理模型,有效抑制过度思考,在保持或提升准确率的同时显著缩短推理链长度。
English: VeriThinker is a novel CoT compression method that reduces overthinking in Large Reasoning Models by fine-tuning them solely through an auxiliary verification task, effectively shortening reasoning chains while maintaining or improving accuracy.
Authors:Nayoung Kim, Seongsu Kim, Sungsoo Ahn
Abstract:
Designing metal-organic frameworks (MOFs) with novel chemistries is a longstanding challenge due to their large combinatorial space and complex 3D arrangements of the building blocks. While recent deep generative models have enabled scalable MOF generation, they assume (1) a fixed set of building blocks and (2) known local 3D coordinates of building blocks. However, this limits their ability to (1) design novel MOFs and (2) generate the structure using novel building blocks. We propose a two-stage MOF generation framework that overcomes these limitations by modeling both chemical and geometric degrees of freedom. First, we train an SMILES-based autoregressive model to generate metal and organic building blocks, paired with a cheminformatics toolkit for 3D structure initialization. Second, we introduce a flow matching model that predicts translations, rotations, and torsional angles to assemble the blocks into valid 3D frameworks. Our experiments demonstrate improved reconstruction accuracy, the generation of valid, novel, and unique MOFs, and the ability to create novel building blocks. Our code is available at https://github.com/nayoung10/MOFFlow-2.
Chinese: 该研究提出的两阶段框架通过化学构建块生成和几何结构组装,克服了现有深度生成模型的局限,实现了新型金属有机框架材料的创新设计,并提高了生成精度和独特性。
English: The proposed two-stage framework overcomes the limitations of existing deep generative models by enabling the generation of novel metal-organic frameworks through chemical building block creation and geometric assembly, resulting in improved accuracy and novel MOF designs.
Authors:Bram Grooten, Farid Hasanov, Chenxiang Zhang, Qiao Xiao, Boqian Wu, Zahra Atashgahi, Ghada Sokar, Shiwei Liu, Lu Yin, Elena Mocanu, Mykola Pechenizkiy, Decebal Constantin Mocanu
Abstract:
Model ensembles have long been a cornerstone for improving generalization and robustness in deep learning. However, their effectiveness often comes at the cost of substantial computational overhead. To address this issue, state-of-the-art methods aim to replicate ensemble-class performance without requiring multiple independently trained networks. Unfortunately, these algorithms often still demand considerable compute at inference. In response to these limitations, we introduce $\textbf{NeuroTrails}$, a sparse multi-head architecture with dynamically evolving topology. This unexplored model-agnostic training paradigm improves ensemble performance while reducing the required resources. We analyze the underlying reason for its effectiveness and observe that the various neural trails induced by dynamic sparsity attain a $\textit{Goldilocks zone}$ of prediction diversity. NeuroTrails displays efficacy with convolutional and transformer-based architectures on computer vision and language tasks. Experiments on ResNet-50/ImageNet, LLaMA-350M/C4, among many others, demonstrate increased accuracy and stronger robustness in zero-shot generalization, while requiring significantly fewer parameters.
中文摘要:NeuroTrails采用稀疏多头架构和动态拓扑结构,在提升模型集成性能的同时显著减少资源需求,在多种架构和任务中实现了更高的准确性和更强的鲁棒性。
English Summary: NeuroTrails introduces a sparse multi-head architecture with dynamic topology to enhance ensemble performance efficiently, achieving greater accuracy and robustness with fewer parameters across various models and tasks.
Authors:Hongshu Guo, Zeyuan Ma, Yining Ma, Xinglin Zhang, Wei-Neng Chen, Yue-Jiao Gong
Abstract:
Designing effective black-box optimizers is hampered by limited problem-specific knowledge and manual control that spans months for almost every detail. In this paper, we present DesignX, the first automated algorithm design framework that generates an effective optimizer specific to a given black-box optimization problem within seconds. Rooted in the first principles, we identify two key sub-tasks: 1) algorithm structure generation and 2) hyperparameter control. To enable systematic construction, a comprehensive modular algorithmic space is first built, embracing hundreds of algorithm components collected from decades of research. We then introduce a dual-agent reinforcement learning system that collaborates on structural and parametric design through a novel cooperative training objective, enabling large-scale meta-training across 10k diverse instances. Remarkably, through days of autonomous learning, the DesignX-generated optimizers continuously surpass human-crafted optimizers by orders of magnitude, either on synthetic testbed or on realistic optimization scenarios such as Protein-docking, AutoML and UAV path planning. Further in-depth analysis reveals DesignX's capability to discover non-trivial algorithm patterns beyond expert intuition, which, conversely, provides valuable design insights for the optimization community. We provide DesignX's inference code at https://github.com/MetaEvo/DesignX.
中文摘要:DesignX是一个自动化算法设计框架,通过双智能体强化学习在数秒内生成针对特定黑盒优化问题的有效优化器,其性能在多种实际应用场景中显著超越人工设计的算法。
English Summary: DesignX is an automated framework that rapidly generates specialized black-box optimizers through dual-agent reinforcement learning, consistently outperforming human-designed algorithms across various complex scenarios.
Authors:Wenning Xu, Shiyu Fan, Paul Henderson, Edmond S. L. Ho
Abstract:
Generating realistic human motion with high-level controls is a crucial task for social understanding, robotics, and animation. With high-quality MOCAP data becoming more available recently, a wide range of data-driven approaches have been presented. However, modelling multi-person interactions still remains a less explored area. In this paper, we present Graph-driven Interaction Sampling, a method that can generate realistic and diverse multi-person interactions by leveraging existing two-person motion diffusion models as motion priors. Instead of training a new model specific to multi-person interaction synthesis, our key insight is to spatially and temporally separate complex multi-person interactions into a graph structure of two-person interactions, which we name the Pairwise Interaction Graph. We thus decompose the generation task into simultaneous single-person motion generation conditioned on one other's motion. In addition, to reduce artifacts such as interpenetrations of body parts in generated multi-person interactions, we introduce two graph-dependent guidance terms into the diffusion sampling scheme. Unlike previous work, our method can produce various high-quality multi-person interactions without having repetitive individual motions. Extensive experiments demonstrate that our approach consistently outperforms existing methods in reducing artifacts when generating a wide range of two-person and multi-person interactions.
Authors:Xuchen Pan, Yanxi Chen, Yushuo Chen, Yuchang Sun, Daoyuan Chen, Wenhao Zhang, Yuexiang Xie, Yilun Huang, Yilei Zhang, Dawei Gao, Weijie Shi, Yaliang Li, Bolin Ding, Jingren Zhou
Abstract:
Trinity-RFT is a general-purpose, unified and easy-to-use framework designed for reinforcement fine-tuning (RFT) of large language models. It is built with a modular and decoupled design, consisting of (1) an RFT-core that unifies and generalizes synchronous/asynchronous, on-policy/off-policy, and online/offline modes of RFT; (2) seamless integration for agent-environment interaction with high efficiency and robustness; and (3) systematic data pipelines optimized for RFT. Trinity-RFT can be easily adapted for diverse application scenarios, and serves as a unified platform for development and research of advanced reinforcement learning paradigms at both macroscopic and microscopic levels. This technical report outlines the vision, features, design and implementations of Trinity-RFT, accompanied by extensive examples, applications and experiments that demonstrate its functionalities and user-friendliness.
中文: Trinity-RFT 是一个通用、统一且易用的强化微调框架,采用模块化设计,整合了多种RFT模式,高效处理智能体与环境交互,并提供优化的数据管道,适用于广泛的应用场景和研究开发。
English: Trinity-RFT is a versatile and user-friendly framework for reinforcement fine-tuning of large language models, featuring a modular design that unifies various RFT modes, integrates agent-environment interactions efficiently, and provides optimized data pipelines for diverse applications and research.
Authors:Patrick Leask, Neel Nanda, Noura Al Moubayed
Abstract:
Sparse autoencoders (SAEs) are a popular method for decomposing Large Langage Models (LLM) activations into interpretable latents. However, due to their substantial training cost, most academic research uses open-source SAEs which are only available for a restricted set of models of up to 27B parameters. SAE latents are also learned from a dataset of activations, which means they do not transfer between models. Motivated by relative representation similarity measures, we introduce Inference-Time Decomposition of Activations (ITDA) models, an alternative method for decomposing language model activations. To train an ITDA, we greedily construct a dictionary of language model activations on a dataset of prompts, selecting those activations which were worst approximated by matching pursuit on the existing dictionary. ITDAs can be trained in just 1% of the time required for SAEs, using 1% of the data. This allowed us to train ITDAs on Llama-3.1 70B and 405B on a single consumer GPU. ITDAs can achieve similar reconstruction performance to SAEs on some target LLMs, but generally incur a performance penalty. However, ITDA dictionaries enable cross-model comparisons, and a simple Jaccard similarity index on ITDA dictionaries outperforms existing methods like CKA, SVCCA, and relative representation similarity metrics. ITDAs provide a cheap alternative to SAEs where computational resources are limited, or when cross model comparisons are necessary. Code available at https://github.com/pleask/itda.
中文: ITDA模型作为稀疏自编码器的高效替代方案,能以极低的计算成本分解大语言模型激活,并支持跨模型比较。
English: ITDA models offer a computationally efficient alternative to sparse autoencoders for decomposing LLM activations, enabling cross-model comparisons with minimal training time and data.
Authors:M. Emre Sahin, Edoardo Altamura, Oscar Wallis, Stephen P. Wood, Anton Dekusar, Declan A. Millar, Takashi Imamichi, Atsushi Matsuo, Stefano Mensa
Abstract:
We present Qiskit Machine Learning (ML), a high-level Python library that combines elements of quantum computing with traditional machine learning. The API abstracts Qiskit's primitives to facilitate interactions with classical simulators and quantum hardware. Qiskit ML started as a proof-of-concept code in 2019 and has since been developed to be a modular, intuitive tool for non-specialist users while allowing extensibility and fine-tuning controls for quantum computational scientists and developers. The library is available as a public, open-source tool and is distributed under the Apache version 2.0 license.
中文: Qiskit机器学习是一个Python库,它将量子计算与传统机器学习相结合,为普通用户和专家提供了直观的API,以便与模拟器和量子硬件进行交互。
English: Qiskit Machine Learning is a Python library that integrates quantum computing with classical machine learning, offering an intuitive API for both general users and experts to work with simulators and quantum hardware.
Authors:Zeyuan Ma, Yue-Jiao Gong, Hongshu Guo, Wenjie Qiu, Sijie Ma, Hongqiao Lian, Jiajun Zhan, Kaixu Chen, Chen Wang, Zhiyang Huang, Zechuan Huang, Guojun Peng, Ran Cheng, Yining Ma
Abstract:
Meta-Black-Box Optimization (MetaBBO) streamlines the automation of optimization algorithm design through meta-learning. It typically employs a bi-level structure: the meta-level policy undergoes meta-training to reduce the manual effort required in developing algorithms for low-level optimization tasks. The original MetaBox (2023) provided the first open-source framework for reinforcement learning-based single-objective MetaBBO. However, its relatively narrow scope no longer keep pace with the swift advancement in this field. In this paper, we introduce MetaBox-v2 (https://github.com/MetaEvo/MetaBox) as a milestone upgrade with four novel features: 1) a unified architecture supporting RL, evolutionary, and gradient-based approaches, by which we reproduce 23 up-to-date baselines; 2) efficient parallelization schemes, which reduce the training/testing time by 10-40x; 3) a comprehensive benchmark suite of 18 synthetic/realistic tasks (1900+ instances) spanning single-objective, multi-objective, multi-model, and multi-task optimization scenarios; 4) plentiful and extensible interfaces for custom analysis/visualization and integrating to external optimization tools/benchmarks. To show the utility of MetaBox-v2, we carry out a systematic case study that evaluates the built-in baselines in terms of the optimization performance, generalization ability and learning efficiency. Valuable insights are concluded from thorough and detailed analysis for practitioners and those new to the field.
中文:MetaBox-v2作为升级版开源框架,通过统一架构支持多种优化方法,显著提升并行效率,并扩展了涵盖多类优化场景的基准测试,为实践提供全面分析工具。
English: MetaBox-v2 is an upgraded open-source framework that introduces a unified architecture supporting multiple optimization approaches, significantly improves efficiency with faster parallelization, and expands benchmarking to diverse optimization scenarios for comprehensive analysis.
Authors:Dong-Hee Kim, Hyunjee Song, Donghyun Kim
Abstract:
Despite the advances in Referring Expression Segmentation (RES) benchmarks, their evaluation protocols remain constrained, primarily focusing on either single targets with short queries (containing minimal attributes) or multiple targets from distinctly different queries on a single domain. This limitation significantly hinders the assessment of more complex reasoning capabilities in RES models. We introduce WildRES, a novel benchmark that incorporates long queries with diverse attributes and non-distinctive queries for multiple targets. This benchmark spans diverse application domains, including autonomous driving environments and robotic manipulation scenarios, thus enabling more rigorous evaluation of complex reasoning capabilities in real-world settings. Our analysis reveals that current RES models demonstrate substantial performance deterioration when evaluated on WildRES. To address this challenge, we introduce SynRES, an automated pipeline generating densely paired compositional synthetic training data through three innovations: (1) a dense caption-driven synthesis for attribute-rich image-mask-expression triplets, (2) reliable semantic alignment mechanisms rectifying caption-pseudo mask inconsistencies via Image-Text Aligned Grouping, and (3) domain-aware augmentations incorporating mosaic composition and superclass replacement to emphasize generalization ability and distinguishing attributes over object categories. Experimental results demonstrate that models trained with SynRES achieve state-of-the-art performance, improving gIoU by 2.0% on WildRES-ID and 3.8% on WildRES-DS. Code and datasets are available at https://github.com/UTLLab/SynRES.
Chinese: WildRES作为一个新基准,通过引入复杂查询和多领域场景严格评估指代表达分割模型,揭示了现有模型的性能局限,并推动了SynRES自动化合成训练数据流程的开发,从而显著提升了模型的精确度。
English: WildRES is a new benchmark that introduces complex queries and diverse domains to rigorously evaluate referring expression segmentation models, revealing their performance limitations and prompting the development of SynRES, an automated pipeline that generates synthetic training data to significantly enhance model accuracy.
Authors:Tianheng Ling, Chao Qian, Lukas Johannes HaÃler, Gregor Schiele
Abstract:
Transformer-based models have shown strong performance across diverse time-series tasks, but their deployment on resource-constrained devices remains challenging due to high memory and computational demand. While prior work targeting Microcontroller Units (MCUs) has explored hardware-specific optimizations, such approaches are often task-specific and limited to 8-bit fixed-point precision. Field-Programmable Gate Arrays (FPGAs) offer greater flexibility, enabling fine-grained control over data precision and architecture. However, existing FPGA-based deployments of Transformers for time-series analysis typically focus on high-density platforms with manual configuration. This paper presents a unified and fully automated deployment framework for Tiny Transformers on embedded FPGAs. Our framework supports a compact encoder-only Transformer architecture across three representative time-series tasks (forecasting, classification, and anomaly detection). It combines quantization-aware training (down to 4 bits), hardware-aware hyperparameter search using Optuna, and automatic VHDL generation for seamless deployment. We evaluate our framework on six public datasets across two embedded FPGA platforms. Results show that our framework produces integer-only, task-specific Transformer accelerators achieving as low as 0.033 mJ per inference with millisecond latency on AMD Spartan-7, while also providing insights into deployment feasibility on Lattice iCE40. All source code will be released in the GitHub repository (https://github.com/Edwina1030/TinyTransformer4TS).
中文摘要:本文提出了一种自动化框架,可在嵌入式FPGA上部署精简量化Transformer,通过硬件感知优化在多种时序任务中实现低功耗推理与高效性能。
English Summary: This paper introduces an automated framework for deploying compact, quantized Transformers on embedded FPGAs, achieving low-energy inference across multiple time-series tasks with optimized hardware performance.
Authors:Zehua Pei, Ying Zhang, Hui-Ling Zhen, Xianzhi Yu, Wulong Liu, Sinno Jialin Pan, Mingxuan Yuan, Bei Yu
Abstract:
Mixture-of-experts (MoE) architectures enable scaling large language models (LLMs) to vast parameter counts without a proportional rise in computational costs. However, the significant memory demands of large MoE models hinder their deployment across various computational environments, from cloud servers to consumer devices. This study first demonstrates pronounced task-specific specialization in expert activation patterns within MoE layers. Building on this, we introduce PreMoe, a novel framework that enables efficient deployment of massive MoE models in memory-constrained environments. PreMoe features two main components: probabilistic expert pruning (PEP) and task-adaptive expert retrieval (TAER). PEP employs a new metric, the task-conditioned expected selection score (TCESS), derived from router logits to quantify expert importance for specific tasks, thereby identifying a minimal set of critical experts. TAER leverages these task-specific expert importance profiles for efficient inference. It pre-computes and stores compact expert patterns for diverse tasks. When a user query is received, TAER rapidly identifies the most relevant stored task pattern and reconstructs the model by loading only the small subset of experts crucial for that task. This approach dramatically reduces the memory footprint across all deployment scenarios. DeepSeek-R1 671B maintains 97.2\% accuracy on MATH500 when pruned to 8/128 configuration (50\% expert reduction), and still achieves 72.0\% with aggressive 8/32 pruning (87.5\% expert reduction). Pangu-Ultra-MoE 718B achieves 97.15\% on MATH500 and 81.3\% on AIME24 with 8/128 pruning, while even more aggressive pruning to 4/64 (390GB memory) preserves 96.95\% accuracy on MATH500. We make our code publicly available at https://github.com/JarvisPei/PreMoe.
Chinese: PreMoe框架通过概率性专家剪枝和任务自适应专家检索技术,仅动态加载任务关键专家,在显著降低内存占用的同时保持模型精度,实现了大规模混合专家模型在内存受限环境中的高效部署。
English: PreMoe introduces probabilistic expert pruning and task-adaptive expert retrieval to enable efficient deployment of large MoE models in memory-constrained environments by dynamically loading only task-critical experts, achieving near-original accuracy with dramatically reduced memory usage.
Authors:Joakim Edin, Róbert Csordás, Tuukka Ruotsalo, Zhengxuan Wu, Maria Maistro, Jing Huang, Lars Maaløe
Abstract:
Ensuring faithful interpretability in large language models is imperative for trustworthy and reliable AI. A key obstacle is self-repair, a phenomenon where networks compensate for reduced signal in one component by amplifying others, masking the true importance of the ablated component. While prior work attributes self-repair to layer normalization and back-up components that compensate for ablated components, we identify a novel form occurring within the attention mechanism, where softmax redistribution conceals the influence of important attention scores. This leads traditional ablation and gradient-based methods to underestimate the significance of all components contributing to these attention scores. We introduce Gradient Interaction Modifications (GIM), a technique that accounts for self-repair during backpropagation. Extensive experiments across multiple large language models (Gemma 2B/9B, LLAMA 1B/3B/8B, Qwen 1.5B/3B) and diverse tasks demonstrate that GIM significantly improves faithfulness over existing circuit identification and feature attribution methods. Our work is a significant step toward better understanding the inner mechanisms of LLMs, which is crucial for improving them and ensuring their safety. Our code is available at https://github.com/JoakimEdin/gim.
中文: 该摘要介绍了梯度交互修正(GIM)这一新技术,它通过解决注意力机制中的自我修复现象,显著提升了大型语言模型可解释方法的可靠性,并在多种模型和任务中验证了其有效性。
English: The abstract introduces Gradient Interaction Modifications (GIM), a novel technique that addresses self-repair within attention mechanisms to enhance the faithfulness of interpretability methods in large language models, demonstrating significant improvements across various models and tasks.
Authors:Haoran He, Jiajun Liang, Xintao Wang, Pengfei Wan, Di Zhang, Kun Gai, Ling Pan
Abstract:
As the marginal cost of scaling computation (data and parameters) during model pre-training continues to increase substantially, test-time scaling (TTS) has emerged as a promising direction for improving generative model performance by allocating additional computation at inference time. While TTS has demonstrated significant success across multiple language tasks, there remains a notable gap in understanding the test-time scaling behaviors of image and video generative models (diffusion-based or flow-based models). Although recent works have initiated exploration into inference-time strategies for vision tasks, these approaches face critical limitations: being constrained to task-specific domains, exhibiting poor scalability, or falling into reward over-optimization that sacrifices sample diversity. In this paper, we propose \textbf{Evo}lutionary \textbf{Search} (EvoSearch), a novel, generalist, and efficient TTS method that effectively enhances the scalability of both image and video generation across diffusion and flow models, without requiring additional training or model expansion. EvoSearch reformulates test-time scaling for diffusion and flow models as an evolutionary search problem, leveraging principles from biological evolution to efficiently explore and refine the denoising trajectory. By incorporating carefully designed selection and mutation mechanisms tailored to the stochastic differential equation denoising process, EvoSearch iteratively generates higher-quality offspring while preserving population diversity. Through extensive evaluation across both diffusion and flow architectures for image and video generation tasks, we demonstrate that our method consistently outperforms existing approaches, achieves higher diversity, and shows strong generalizability to unseen evaluation metrics. Our project is available at the website https://tinnerhrhe.github.io/evosearch.
中文: 测试时缩放(TTS)通过在推理阶段增加计算来经济地提升生成模型性能,而提出的EvoSearch方法无需额外训练即可有效增强扩散和流模型的图像与视频生成效果。
English: Test-time scaling (TTS) offers a cost-effective way to boost generative model performance by adding computation during inference, and the proposed EvoSearch method effectively enhances image and video generation for diffusion and flow models without extra training.
Authors:Vendi Ardianto Nugroho, Byung Moo Lee
Abstract:
Millimeter-wave (mmWave) communication enables high data rates for cellular-connected Unmanned Aerial Vehicles (UAVs). However, a robust beam management remains challenging due to significant path loss and the dynamic mobility of UAVs, which can destabilize the UAV-base station (BS) link. This research presents a GPS-aided deep learning (DL) model that simultaneously predicts current and future optimal beams for UAV mmWave communications, maintaining a Top-1 prediction accuracy exceeding 70% and an average power loss below 0.6 dB across all prediction steps. These outcomes stem from a proposed data set splitting method ensuring balanced label distribution, paired with a GPS preprocessing technique that extracts key positional features, and a DL architecture that maps sequential position data to beam index predictions. The model reduces overhead by approximately 93% (requiring the training of 2 ~ 3 beams instead of 32 beams) with 95% beam prediction accuracy guarantees, and ensures 94% to 96% of predictions exhibit mean power loss not exceeding 1 dB.
Chinese: 本研究提出一种GPS辅助的深度学习模型,用于预测无人机毫米波通信的最佳波束,实现了超过70%的准确率,在将开销降低93%的同时保持了极低的功率损耗。
English: This study introduces a GPS-assisted deep learning model that predicts optimal beams for UAV mmWave communications, achieving over 70% accuracy and reducing overhead by 93% while maintaining minimal power loss.
Authors:RafaŠKarczewski, Markus Heinonen, Alison Pouplin, Søren Hauberg, Vikas Garg
Abstract:
We present a novel perspective on diffusion models using the framework of information geometry. We show that the set of noisy samples, taken across all noise levels simultaneously, forms a statistical manifold -- a family of denoising probability distributions. Interpreting the noise level as a temporal parameter, we refer to this manifold as spacetime. This manifold naturally carries a Fisher-Rao metric, which defines geodesics -- shortest paths between noisy points. Notably, this family of distributions is exponential, enabling efficient geodesic computation even in high-dimensional settings without retraining or fine-tuning. We demonstrate the practical value of this geometric viewpoint in transition path sampling, where spacetime geodesics define smooth sequences of Boltzmann distributions, enabling the generation of continuous trajectories between low-energy metastable states. Code is available at: https://github.com/Aalto-QuML/diffusion-spacetime-geometry.
中文摘要:本研究提出了一种基于信息几何框架的扩散模型新视角,揭示了噪声样本构成具有Fisher-Rao度量的统计流形,可实现高效测地线计算,并在过渡路径采样等应用中展现实用价值。
English Summary: This study introduces a novel information geometry framework for diffusion models, revealing that noisy samples form a statistical manifold with a Fisher-Rao metric that enables efficient geodesic computation for practical applications like transition path sampling.
Authors:Yifan Zhang, Yifeng Liu, Huizhuo Yuan, Yang Yuan, Quanquan Gu, Andrew C Yao
Abstract:
Policy gradient algorithms have been successfully applied to enhance the reasoning capabilities of large language models (LLMs). Despite the widespread use of Kullback-Leibler (KL) regularization in policy gradient algorithms to stabilize training, the systematic exploration of how different KL divergence formulations can be estimated and integrated into surrogate loss functions for online reinforcement learning (RL) presents a nuanced and systematically explorable design space. In this paper, we propose regularized policy gradient (RPG), a systematic framework for deriving and analyzing KL-regularized policy gradient methods in the online RL setting. We derive policy gradients and corresponding surrogate loss functions for objectives regularized by both forward and reverse KL divergences, considering both normalized and unnormalized policy distributions. Furthermore, we present derivations for fully differentiable loss functions as well as REINFORCE-style gradient estimators, accommodating diverse algorithmic needs. We conduct extensive experiments on RL for LLM reasoning using these methods, showing improved or competitive results in terms of training stability and performance compared to strong baselines such as GRPO, REINFORCE++, and DAPO. The code is available at https://github.com/complex-reasoning/RPG.
策略梯度算法提升了大型语言模型的推理能力,本研究提出的正则化策略梯度(RPG)框架统一了KL正则化变体,修正了离策略加权问题,并在数学推理基准测试中显著提高了训练稳定性和准确性。
Policy gradient algorithms enhance large language models' reasoning, and this study introduces the Regularized Policy Gradient (RPG) framework to unify KL regularization variants, correct off-policy weighting issues, and improve training stability and accuracy on benchmarks.
Authors:Zhining Liu, Zihao Li, Ze Yang, Tianxin Wei, Jian Kang, Yada Zhu, Hendrik Hamann, Jingrui He, Hanghang Tong
Abstract:
Class-imbalanced learning (CIL) on tabular data is important in many real-world applications where the minority class holds the critical but rare outcomes. In this paper, we present CLIMB, a comprehensive benchmark for class-imbalanced learning on tabular data. CLIMB includes 73 real-world datasets across diverse domains and imbalance levels, along with unified implementations of 29 representative CIL algorithms. Built on a high-quality open-source Python package with unified API designs, detailed documentation, and rigorous code quality controls, CLIMB supports easy implementation and comparison between different CIL algorithms. Through extensive experiments, we provide practical insights on method accuracy and efficiency, highlighting the limitations of naive rebalancing, the effectiveness of ensembles, and the importance of data quality. Our code, documentation, and examples are available at https://github.com/ZhiningLiu1998/imbalanced-ensemble.
中文: 本文提出了CLIMB,一个针对表格数据类别不平衡学习的全面基准,包含73个数据集和29种算法,提供了方法有效性和效率的实用见解,同时揭示了简单重平衡的局限性和数据质量的重要性。
English: This paper introduces CLIMB, a comprehensive benchmark for class-imbalanced learning on tabular data that includes 73 datasets and 29 algorithms, offering practical insights into method effectiveness and efficiency while highlighting the limitations of naive rebalancing and the importance of data quality.
Authors:Xianzhong Ding, Yunkai Zhang, Binbin Chen, Donghao Ying, Tieying Zhang, Jianjun Chen, Lei Zhang, Alberto Cerpa, Wan Du
Abstract:
Modern industry-scale data centers need to manage a large number of virtual machines (VMs). Due to the continual creation and release of VMs, many small resource fragments are scattered across physical machines (PMs). To handle these fragments, data centers periodically reschedule some VMs to alternative PMs, a practice commonly referred to as VM rescheduling. Despite the increasing importance of VM rescheduling as data centers grow in size, the problem remains understudied. We first show that, unlike most combinatorial optimization tasks, the inference time of VM rescheduling algorithms significantly influences their performance, due to dynamic VM state changes during this period. This causes existing methods to scale poorly. Therefore, we develop a reinforcement learning system for VM rescheduling, VM2RL, which incorporates a set of customized techniques, such as a two-stage framework that accommodates diverse constraints and workload conditions, a feature extraction module that captures relational information specific to rescheduling, as well as a risk-seeking evaluation enabling users to optimize the trade-off between latency and accuracy. We conduct extensive experiments with data from an industry-scale data center. Our results show that VM2RL can achieve a performance comparable to the optimal solution but with a running time of seconds. Code and datasets are open-sourced: https://github.com/zhykoties/VMR2L_eurosys, https://drive.google.com/drive/folders/1PfRo1cVwuhH30XhsE2Np3xqJn2GpX5qy.
中文: 现代数据中心因虚拟机管理导致资源碎片化,为此开发了VM2RL强化学习系统,通过高效的重调度优化性能并显著缩短运行时间。
English: Modern data centers face challenges with resource fragmentation from virtual machine (VM) management, prompting the development of VM2RL, a reinforcement learning system that efficiently reschedules VMs to optimize performance and reduce runtime.
Authors:Hitesh Laxmichand Patel, Amit Agarwal, Arion Das, Bhargava Kumar, Srikant Panda, Priyaranjan Pattnayak, Taki Hasan Rafi, Tejaswini Kumar, Dong-Kyu Chae
Abstract:
Enterprise customers are increasingly adopting Large Language Models (LLMs) for critical communication tasks, such as drafting emails, crafting sales pitches, and composing casual messages. Deploying such models across different regions requires them to understand diverse cultural and linguistic contexts and generate safe and respectful responses. For enterprise applications, it is crucial to mitigate reputational risks, maintain trust, and ensure compliance by effectively identifying and handling unsafe or offensive language. To address this, we introduce SweEval, a benchmark simulating real-world scenarios with variations in tone (positive or negative) and context (formal or informal). The prompts explicitly instruct the model to include specific swear words while completing the task. This benchmark evaluates whether LLMs comply with or resist such inappropriate instructions and assesses their alignment with ethical frameworks, cultural nuances, and language comprehension capabilities. In order to advance research in building ethically aligned AI systems for enterprise use and beyond, we release the dataset and code: https://github.com/amitbcp/multilingual_profanity.
中文: 企业采用大语言模型处理通信任务时需确保其理解文化背景并安全回应,为此推出SweEval基准测试模型对不当指令的遵循情况,以评估伦理对齐并降低风险。
English: Enterprises are adopting LLMs for communication tasks but need them to handle cultural contexts safely, so SweEval benchmark tests model compliance with inappropriate instructions to ensure ethical alignment and reduce risks.
Authors:Amit Agarwal, Srikant Panda, Kulbhushan Pachauri
Abstract:
In this work, we propose Few Shot Domain Adapting Graph (FS-DAG), a scalable and efficient model architecture for visually rich document understanding (VRDU) in few-shot settings. FS-DAG leverages domain-specific and language/vision specific backbones within a modular framework to adapt to diverse document types with minimal data. The model is robust to practical challenges such as handling OCR errors, misspellings, and domain shifts, which are critical in real-world deployments. FS-DAG is highly performant with less than 90M parameters, making it well-suited for complex real-world applications for Information Extraction (IE) tasks where computational resources are limited. We demonstrate FS-DAG's capability through extensive experiments for information extraction task, showing significant improvements in convergence speed and performance compared to state-of-the-art methods. Additionally, this work highlights the ongoing progress in developing smaller, more efficient models that do not compromise on performance. Code : https://github.com/oracle-samples/fs-dag
中文: FS-DAG是一种针对少样本场景下视觉丰富文档理解的可扩展高效模型,通过模块化领域专用架构以不足9000万参数处理OCR错误和领域偏移,在信息抽取任务中展现出卓越性能和更快收敛速度。
English: FS-DAG is a scalable and efficient model for visually rich document understanding in few-shot settings, leveraging modular domain-specific backbones to handle OCR errors and domain shifts with under 90M parameters, demonstrating superior performance and faster convergence in information extraction tasks.
Authors:Harim Kim, Yuhan Wang, Minkyu Ahn, Heeyoul Choi, Yuyin Zhou, Charmgil Hong
Abstract:
Unsupervised anomaly detection (UAD) in medical imaging is crucial for identifying pathological abnormalities without requiring extensive labeled data. However, existing diffusion-based UAD models rely solely on imaging features, limiting their ability to distinguish between normal anatomical variations and pathological anomalies. To address this, we propose Diff3M, a multi-modal diffusion-based framework that integrates chest X-rays and structured Electronic Health Records (EHRs) for enhanced anomaly detection. Specifically, we introduce a novel image-EHR cross-attention module to incorporate structured clinical context into the image generation process, improving the model's ability to differentiate normal from abnormal features. Additionally, we develop a static masking strategy to enhance the reconstruction of normal-like images from anomalies. Extensive evaluations on CheXpert and MIMIC-CXR/IV demonstrate that Diff3M achieves state-of-the-art performance, outperforming existing UAD methods in medical imaging. Our code is available at this http URL https://github.com/nth221/Diff3M
中文摘要:Diff3M是一种创新的多模态框架,通过将胸部X光与电子健康记录相结合,并利用交叉注意力模块增强医学影像中的无监督异常检测,在基准数据集上实现了最先进的性能。
English Summary: Diff3M is a novel multi-modal framework that enhances unsupervised anomaly detection in medical imaging by integrating chest X-rays with Electronic Health Records through a cross-attention module, achieving state-of-the-art performance on benchmark datasets.
Authors:Phat Thanh Dang, Saahil Thoppay, Wang Yang, Qifan Wang, Vipin Chaudhary, Xiaotian Han
Abstract:
Large language models suffer issues when operated on long contexts that are larger than their training context length due to the standard position encoding for tokens in the attention layer. Tokens a long distance apart will rarely have an effect on each other and long prompts yield unexpected results. To solve this problem, we propose SELF (Self-Extend the Context Length With Logistic Growth Function): a solution of grouping consecutive tokens at varying group sizes using a logistic capacity equation combined with a constant group size at smaller relative distances. Our model had an increase in performance of up to 12% compared to the LongLM extension method in LEval (specifically on the Qwen model). On summarization related tasks in LongBench, our model performed up to 6.4% better than LongLM (specifically on the Llama-2-7b model). On reading comprehension tasks from LEval, our model performed up to 5.4% better than the LongLM. Our code is available at https://github.com/alexeipc/SELF-LLM.
中文:SELF方法通过逻辑函数对连续标记进行分组来扩展语言模型的有效上下文长度,在多项长文本任务中相比现有方法最高实现了12%的性能提升。
English: The SELF method extends language models' effective context length by grouping tokens with a logistic function, achieving performance improvements of up to 12% over existing methods on various long-context tasks.
Authors:Qihao Duan, Bingding Huang, Zhenqiao Song, Irina Lehmann, Lei Gu, Roland Eils, Benjamin Wild
Abstract:
Large language models (LLMs) have revolutionized natural language processing and are increasingly applied to other sequential data types, including genetic sequences. However, adapting LLMs to genomics presents significant challenges. Capturing complex genomic interactions requires modeling long-range dependencies within DNA sequences, where interactions often span over 10,000 base pairs, even within a single gene, posing substantial computational burdens under conventional model architectures and training paradigms. Moreover, standard LLM training approaches are suboptimal for DNA: autoregressive training, while efficient, supports only unidirectional understanding. However, DNA is inherently bidirectional, e.g., bidirectional promoters regulate transcription in both directions and account for nearly 11% of human gene expression. Masked language models (MLMs) allow bidirectional understanding but are inefficient, as only masked tokens contribute to the loss per step. To address these limitations, we introduce JanusDNA, the first bidirectional DNA foundation model built upon a novel pretraining paradigm that combines the optimization efficiency of autoregressive modeling with the bidirectional comprehension of masked modeling. JanusDNA adopts a hybrid Mamba, Attention and Mixture of Experts (MoE) architecture, combining long-range modeling of Attention with efficient sequential learning of Mamba. MoE layers further scale model capacity via sparse activation while keeping computational cost low. Notably, JanusDNA processes up to 1 million base pairs at single nucleotide resolution on a single 80GB GPU. Extensive experiments and ablations show JanusDNA achieves new SOTA results on three genomic representation benchmarks, outperforming models with 250x more activated parameters. Code: https://github.com/Qihao-Duan/JanusDNA
中文摘要:JanusDNA是一种创新的双向DNA基础模型,通过结合Mamba-注意力-专家混合的混合架构,将高效的自回归训练与双向理解能力相融合,能够处理长达100万个碱基对,并在基因组表征基准测试中取得了最先进的性能表现。
English Summary: JanusDNA is a novel bidirectional DNA foundation model that combines efficient autoregressive training with bidirectional comprehension using a hybrid Mamba-Attention-MoE architecture, enabling processing of up to 1 million base pairs while achieving state-of-the-art results on genomic benchmarks.
Authors:Razvan-Gabriel Dumitru, Darius Peteleaza, Vikas Yadav, Liangming Pan
Abstract:
Large language models excel at complex tasks by breaking down problems into structured reasoning steps. However, reasoning traces often extend beyond reaching a correct answer, causing wasted computation, reduced readability, and hallucinations. To address this, we introduce a novel hyperparameter-free conciseness score used as a reward signal within a reinforcement learning framework to guide models toward generating correct and concise reasoning traces. This score is evaluated by a large language model acting as a judge, enabling dynamic, context-aware feedback beyond simple token length. Our method achieves state-of-the-art efficiency-accuracy trade-offs on the MATH dataset, reducing token usage by up to 31x on simple problems while improving accuracy by 7%, and on the hardest problems, it outperforms full reasoning by +7.5% accuracy with up to 3.6x fewer tokens. On TheoremQA, our method improves accuracy by +2.2% using 12.5x fewer tokens. We also conduct ablation studies on the judge model, reward composition, and problem difficulty, showing that our method dynamically adapts reasoning length based on problem difficulty and benefits significantly from stronger judges. The code, model weights, and datasets are open-sourced at https://github.com/RazvanDu/ConciseRL.
中文: 本文提出一种无超参数简洁度评分方法,通过强化学习框架优化推理路径的准确性与简洁性,在多个数据集上实现最优效率-准确率平衡,并能根据问题难度动态调整推理长度。
English: This paper introduces a hyperparameter-free conciseness score integrated into a reinforcement learning framework to optimize reasoning traces for both correctness and brevity, achieving superior efficiency-accuracy trade-offs across multiple datasets while dynamically adapting to problem difficulty.
Authors:Mingxin Huang, Yongxin Shi, Dezhi Peng, Songxuan Lai, Zecheng Xie, Lianwen Jin
Abstract:
Recent advancements in multimodal slow-thinking systems have demonstrated remarkable performance across diverse visual reasoning tasks. However, their capabilities in text-rich image reasoning tasks remain understudied due to the lack of a systematic benchmark. To address this gap, we propose OCR-Reasoning, a comprehensive benchmark designed to systematically assess Multimodal Large Language Models on text-rich image reasoning tasks. The benchmark comprises 1,069 human-annotated examples spanning 6 core reasoning abilities and 18 practical reasoning tasks in text-rich visual scenarios. Furthermore, unlike other text-rich image understanding benchmarks that only annotate the final answers, OCR-Reasoning also annotates the reasoning process simultaneously. With the annotated reasoning process and the final answers, OCR-Reasoning evaluates not only the final answers generated by models but also their reasoning processes, enabling a holistic analysis of their problem-solving abilities. Leveraging this benchmark, we conducted a comprehensive evaluation of state-of-the-art MLLMs. Our results demonstrate the limitations of existing methodologies. Notably, even state-of-the-art MLLMs exhibit substantial difficulties, with none achieving accuracy surpassing 50\% across OCR-Reasoning, indicating that the challenges of text-rich image reasoning are an urgent issue to be addressed. The benchmark and evaluation scripts are available at https://github.com/SCUT-DLVCLab/OCR-Reasoning.
中文: OCR-Reasoning基准测试旨在系统评估多模态大语言模型在文本丰富图像推理任务中的表现,结果显示即使最先进的模型也面临巨大挑战,准确率均未超过50%,凸显了解决这一问题的紧迫性。
English: The OCR-Reasoning benchmark is introduced to systematically evaluate Multimodal Large Language Models on text-rich image reasoning tasks, revealing that even state-of-the-art models struggle significantly with accuracy below 50%, highlighting an urgent need for improvement in this area.
Authors:Subrata Biswas, Mohammad Nur Hossain Khan, Bashima Islam
Abstract:
Multimodal question answering (QA) often requires identifying which video, audio, or sensor tokens are relevant to the question. Yet modality disagreements are common: off-camera speech, background noise, or motion outside the field of view often mislead fusion models that weight all streams equally. We present RAVEN, a unified QA architecture whose core is QuART, a query-conditioned cross-modal gating module that assigns scalar relevance scores to each token across modalities, enabling the model to amplify informative signals and suppress distractors before fusion. RAVEN is trained through a three-stage pipeline comprising unimodal pretraining, query-aligned fusion, and disagreement-oriented fine-tuning -- each stage targeting a distinct challenge in multi-modal reasoning: representation quality, cross-modal relevance, and robustness to modality mismatch. To support training and evaluation, we release AVS-QA, a dataset of 300K synchronized Audio--Video-Sensor streams paired with automatically generated question-answer pairs. Experimental results on seven multi-modal QA benchmarks -- including egocentric and exocentric tasks -- show that RAVEN achieves up to 14.5\% and 8.0\% gains in accuracy compared to state-of-the-art multi-modal large language models, respectively. Incorporating sensor data provides an additional 16.4\% boost, and the model remains robust under modality corruption, outperforming SOTA baselines by 50.23\%. Our code and dataset are available at https://github.com/BASHLab/RAVEN.
中文: RAVEN通过其核心组件QuART实现了跨模态查询条件门控机制,能动态评估各模态标记的相关性以增强有效信号并抑制干扰,在多模态问答基准测试中显著提升准确率,并在模态受损情况下保持优异鲁棒性。
English: RAVEN introduces QuART, a query-conditioned cross-modal gating module that dynamically weights tokens across modalities to enhance relevant signals and suppress distractors, achieving significant accuracy improvements on multimodal QA benchmarks while maintaining robustness against modality corruption.
Authors:Minghao Shao, Haoran Xi, Nanda Rani, Meet Udeshi, Venkata Sai Charan Putrevu, Kimberly Milner, Brendan Dolan-Gavitt, Sandeep Kumar Shukla, Prashanth Krishnamurthy, Farshad Khorrami, Ramesh Karri, Muhammad Shafique
Abstract:
Large Language Model (LLM) agents can automate cybersecurity tasks and can adapt to the evolving cybersecurity landscape without re-engineering. While LLM agents have demonstrated cybersecurity capabilities on Capture-The-Flag (CTF) competitions, they have two key limitations: accessing latest cybersecurity expertise beyond training data, and integrating new knowledge into complex task planning. Knowledge-based approaches that incorporate technical understanding into the task-solving automation can tackle these limitations. We present CRAKEN, a knowledge-based LLM agent framework that improves cybersecurity capability through three core mechanisms: contextual decomposition of task-critical information, iterative self-reflected knowledge retrieval, and knowledge-hint injection that transforms insights into adaptive attack strategies. Comprehensive evaluations with different configurations show CRAKEN's effectiveness in multi-stage vulnerability detection and exploitation compared to previous approaches. Our extensible architecture establishes new methodologies for embedding new security knowledge into LLM-driven cybersecurity agentic systems. With a knowledge database of CTF writeups, CRAKEN obtained an accuracy of 22% on NYU CTF Bench, outperforming prior works by 3% and achieving state-of-the-art results. On evaluation of MITRE ATT&CK techniques, CRAKEN solves 25-30% more techniques than prior work, demonstrating improved cybersecurity capabilities via knowledge-based execution. We make our framework open source to public https://github.com/NYU-LLM-CTF/nyuctf_agents_craken.
中文: CRAKEN作为一种基于知识的大型语言模型代理框架,通过整合情境分解、迭代知识检索和自适应策略注入,提升了网络安全自动化能力,在漏洞检测和攻击技术执行方面实现了顶尖性能。
English: CRAKEN, a knowledge-based LLM agent framework, enhances cybersecurity automation by integrating contextual decomposition, iterative knowledge retrieval, and adaptive strategy injection, achieving state-of-the-art performance in vulnerability detection and attack technique execution.
Authors:Yiduo Guo, Zhen Guo, Chuanwei Huang, Zi-Ang Wang, Zekai Zhang, Haofei Yu, Huishuai Zhang, Yikang Shen
Abstract:
Reinforcement learning (RL) is a powerful way to adapt foundation models to specialized tasks, but its reliance on large-scale human-labeled data limits broad adoption. We introduce Synthetic Data RL, a simple and general framework that reinforcement fine-tunes models using only synthetic data generated from a task definition. Our method first generates question and answer pairs from the task definition and retrieved documents, then adapts the difficulty of the question based on model solvability, and selects questions using the average pass rate of the model across samples for RL training. On Qwen-2.5-7B, our method achieves a 29.2% absolute improvement over the base model on GSM8K (+2.9 pp vs. instruction-tuned, +6.6 pp vs. Self-Instruct), 8.7% on MATH, 13.1% on GPQA (+7.0 pp vs. SynthLLM), 8.9% on MedQA, 17.7% on CQA (law) and 13.7% on CFA (finance). It surpasses supervised fine-tuning under the same data budget and nearly matches RL with full human data across datasets (e.g., +17.2 pp on GSM8K). Adding 100 human demonstrations improves the performance of GSM8K only by 0.4 pp, showing a limited added value. By reducing human data annotation, Synthetic Data RL enables scalable and efficient RL-based model adaptation. Code and demos are available at https://github.com/gydpku/Data_Synthesis_RL/.
中文: 合成数据强化学习提出了一种仅通过任务定义生成合成数据来微调模型的框架,在多个基准测试中取得显著性能提升,同时减少了对人工标注数据的依赖。
English: Synthetic Data RL introduces a framework that fine-tunes models using only synthetic data generated from task definitions, achieving significant performance improvements across various benchmarks while reducing reliance on human-labeled data.
Authors:Jingzhi Hu, Geoffrey Ye Li
Abstract:
Future networks are envisioned to connect massive artificial intelligence (AI) agents, enabling their extensive collaboration on diverse tasks. Compared to traditional entities, these agents naturally suit the semantic communication (SC), which can significantly enhance the bandwidth efficiency. Nevertheless, SC requires the knowledge among agents to be aligned, while agents have distinct expert knowledge for their individual tasks in practice. In this paper, we propose a distillation-enabled knowledge alignment protocol (DeKAP), which distills the expert knowledge of each agent into parameter-efficient low-rank matrices, allocates them across the network, and allows agents to simultaneously maintain aligned knowledge for multiple tasks. We formulate the joint minimization of alignment loss, communication overhead, and storage cost as a large-scale integer linear programming problem and develop a highly efficient greedy algorithm. From computer simulation, the DeKAP establishes knowledge alignment with the lowest communication and computation resources compared to conventional approaches.
中文: 未来网络将连接大量AI代理以协作完成任务,得益于语义通信提升带宽效率,但需解决知识对齐问题,本文提出的DeKAP协议通过知识蒸馏实现资源最小化的高效对齐。
English: Future networks will connect numerous AI agents for collaborative tasks, benefiting from semantic communication that enhances bandwidth efficiency, but require knowledge alignment which is addressed by the proposed DeKAP protocol using distillation to minimize resources while maintaining performance.
Authors:Chengqi Duan, Rongyao Fang, Yuqing Wang, Kun Wang, Linjiang Huang, Xingyu Zeng, Hongsheng Li, Xihui Liu
Abstract:
Visual generation models have made remarkable progress in creating realistic images from text prompts, yet struggle with complex prompts that specify multiple objects with precise spatial relationships and attributes. Effective handling of such prompts requires explicit reasoning about the semantic content and spatial layout. We present GoT-R1, a framework that applies reinforcement learning to enhance semantic-spatial reasoning in visual generation. Building upon the Generation Chain-of-Thought approach, GoT-R1 enables models to autonomously discover effective reasoning strategies beyond predefined templates through carefully designed reinforcement learning. To achieve this, we propose a dual-stage multi-dimensional reward framework that leverages MLLMs to evaluate both the reasoning process and final output, enabling effective supervision across the entire generation pipeline. The reward system assesses semantic alignment, spatial accuracy, and visual quality in a unified approach. Experimental results demonstrate significant improvements on T2I-CompBench benchmark, particularly in compositional tasks involving precise spatial relationships and attribute binding. GoT-R1 advances the state-of-the-art in image generation by successfully transferring sophisticated reasoning capabilities to the visual generation domain. To facilitate future research, we make our code and pretrained models publicly available at https://github.com/gogoduan/GoT-R1.
中文摘要:GoT-R1是一个强化学习框架,通过让模型自主开发复杂文本提示的推理策略,并采用统一奖励机制评估语义对齐与空间精度,显著提升了多对象空间关系和属性绑定的图像生成能力。
English Summary: GoT-R1 is a reinforcement learning framework that enhances visual generation by enabling models to autonomously develop reasoning strategies for complex text prompts, achieving superior performance in spatial relationships and attribute binding through a unified reward system.
Authors:Chengzhuo Tong, Ziyu Guo, Renrui Zhang, Wenyu Shan, Xinyu Wei, Zhenghao Xing, Hongsheng Li, Pheng-Ann Heng
Abstract:
Recent advancements underscore the significant role of Reinforcement Learning (RL) in enhancing the Chain-of-Thought (CoT) reasoning capabilities of large language models (LLMs). Two prominent RL algorithms, Direct Preference Optimization (DPO) and Group Relative Policy Optimization (GRPO), are central to these developments, showcasing different pros and cons. Autoregressive image generation, also interpretable as a sequential CoT reasoning process, presents unique challenges distinct from LLM-based CoT reasoning. These encompass ensuring text-image consistency, improving image aesthetic quality, and designing sophisticated reward models, rather than relying on simpler rule-based rewards. While recent efforts have extended RL to this domain, these explorations typically lack an in-depth analysis of the domain-specific challenges and the characteristics of different RL strategies. To bridge this gap, we provide the first comprehensive investigation of the GRPO and DPO algorithms in autoregressive image generation, evaluating their in-domain performance and out-of-domain generalization, while scrutinizing the impact of different reward models on their respective capabilities. Our findings reveal that GRPO and DPO exhibit distinct advantages, and crucially, that reward models possessing stronger intrinsic generalization capabilities potentially enhance the generalization potential of the applied RL algorithms. Furthermore, we systematically explore three prevalent scaling strategies to enhance both their in-domain and out-of-domain proficiency, deriving unique insights into efficiently scaling performance for each paradigm. We hope our study paves a new path for inspiring future work on developing more effective RL algorithms to achieve robust CoT reasoning in the realm of autoregressive image generation. Code is released at https://github.com/ZiyuGuo99/Image-Generation-CoT
中文: 本研究首次对自回归图像生成中的GRPO和DPO强化学习算法进行全面分析,揭示了它们各自的优势,并证明具有更强泛化能力的奖励模型能够同时提升域内性能和跨域泛化能力。
English: This study provides the first comprehensive analysis of GRPO and DPO reinforcement learning algorithms in autoregressive image generation, revealing their distinct advantages and demonstrating how reward models with stronger generalization capabilities can enhance both in-domain performance and out-of-domain generalization.
Authors:Shuhan Tan, Kairan Dou, Yue Zhao, Philipp Krähenbühl
Abstract:
We introduce RIPT-VLA, a simple and scalable reinforcement-learning-based interactive post-training paradigm that fine-tunes pretrained Vision-Language-Action (VLA) models using only sparse binary success rewards. Existing VLA training pipelines rely heavily on offline expert demonstration data and supervised imitation, limiting their ability to adapt to new tasks and environments under low-data regimes. RIPT-VLA addresses this by enabling interactive post-training with a stable policy optimization algorithm based on dynamic rollout sampling and leave-one-out advantage estimation.
RIPT-VLA has the following characteristics. First, it applies to various VLA models, resulting in an improvement on the lightweight QueST model by 21.2%, and the 7B OpenVLA-OFT model to an unprecedented 97.5% success rate. Second, it is computationally efficient and data-efficient: with only one demonstration, RIPT-VLA enables an unworkable SFT model (4%) to succeed with a 97% success rate within 15 iterations. Furthermore, we demonstrate that the policy learned by RIPT-VLA generalizes across different tasks and scenarios and is robust to the initial state context. These results highlight RIPT-VLA as a practical and effective paradigm for post-training VLA models through minimal supervision.
Authors:Kevin Lu, Nicky Kriplani, Rohit Gandikota, Minh Pham, David Bau, Chinmay Hegde, Niv Cohen
Abstract:
Concept erasure, the ability to selectively prevent a model from generating specific concepts, has attracted growing interest, with various approaches emerging to address the challenge. However, it remains unclear how thoroughly these methods erase the target concept. We begin by proposing two conceptual models for the erasure mechanism in diffusion models: (i) reducing the likelihood of generating the target concept, and (ii) interfering with the model's internal guidance mechanisms. To thoroughly assess whether a concept has been truly erased from the model, we introduce a suite of independent evaluations. Our evaluation framework includes adversarial attacks, novel probing techniques, and analysis of the model's alternative generations in place of the erased concept. Our results shed light on the tension between minimizing side effects and maintaining robustness to adversarial prompts. Broadly, our work underlines the importance of comprehensive evaluation for erasure in diffusion models.
Authors:Jiachen Yao, Abbas Mammadov, Julius Berner, Gavin Kerrigan, Jong Chul Ye, Kamyar Azizzadenesheli, Anima Anandkumar
Abstract:
We propose a general framework for conditional sampling in PDE-based inverse problems, targeting the recovery of whole solutions from extremely sparse or noisy measurements. This is accomplished by a function-space diffusion model and plug-and-play guidance for conditioning. Our method first trains an unconditional discretization-agnostic denoising model using neural operator architectures. At inference, we refine the samples to satisfy sparse observation data via a gradient-based guidance mechanism. Through rigorous mathematical analysis, we extend Tweedie's formula to infinite-dimensional Hilbert spaces, providing the theoretical foundation for our posterior sampling approach. Our method (FunDPS) accurately captures posterior distributions in function spaces under minimal supervision and severe data scarcity. Across five PDE tasks with only 3% observation, our method achieves an average 32% accuracy improvement over state-of-the-art fixed-resolution diffusion baselines while reducing sampling steps by 4x. Furthermore, multi-resolution fine-tuning ensures strong cross-resolution generalizability. To the best of our knowledge, this is the first diffusion-based framework to operate independently of discretization, offering a practical and flexible solution for forward and inverse problems in the context of PDEs. Code is available at https://github.com/neuraloperator/FunDPS
中文摘要:本文提出FunDPS框架,通过函数空间扩散模型和即插即用引导机制,在仅有3%观测数据的极端条件下实现PDE反问题的高精度求解,其无网格特性为前后问题提供了首个独立于离散化的实用解决方案。
English Summary: This paper introduces FunDPS, a discretization-agnostic diffusion framework for solving PDE inverse problems that accurately recovers full solutions from minimal data using neural operators and gradient guidance, achieving significant accuracy improvements with fewer sampling steps.
Authors:Aleksandra Franz, Hao Wei, Luca Guastoni, Nils Thuerey
Abstract:
Despite decades of advancements, the simulation of fluids remains one of the most challenging areas of in scientific computing. Supported by the necessity of gradient information in deep learning, differentiable simulators have emerged as an effective tool for optimization and learning in physics simulations. In this work, we present our fluid simulator PICT, a differentiable pressure-implicit solver coded in PyTorch with Graphics-processing-unit (GPU) support. We first verify the accuracy of both the forward simulation and our derived gradients in various established benchmarks like lid-driven cavities and turbulent channel flows before we show that the gradients provided by our solver can be used to learn complicated turbulence models in 2D and 3D. We apply both supervised and unsupervised training regimes using physical priors to match flow statistics. In particular, we learn a stable sub-grid scale (SGS) model for a 3D turbulent channel flow purely based on reference statistics. The low-resolution corrector trained with our solver runs substantially faster than the highly resolved references, while keeping or even surpassing their accuracy. Finally, we give additional insights into the physical interpretation of different solver gradients, and motivate a physically informed regularization technique. To ensure that the full potential of PICT can be leveraged, it is published as open source: https://github.com/tum-pbs/PICT.
中文: PICT是一种基于PyTorch的GPU加速可微分流体模拟器,通过监督和无监督训练有效学习湍流模型,在保持甚至超越高分辨率基准精度的同时大幅提升计算速度,并已开源发布。
English: PICT is a GPU-accelerated differentiable fluid simulator in PyTorch that enables efficient learning of turbulence models through supervised and unsupervised training, achieving higher speed and accuracy than high-resolution references while being open-sourced.
Authors:Moru Liu, Hao Dong, Jessica Kelly, Olga Fink, Mario Trapp
Abstract:
Out-of-distribution (OOD) detection and segmentation are crucial for deploying machine learning models in safety-critical applications such as autonomous driving and robot-assisted surgery. While prior research has primarily focused on unimodal image data, real-world applications are inherently multimodal, requiring the integration of multiple modalities for improved OOD detection. A key challenge is the lack of supervision signals from unknown data, leading to overconfident predictions on OOD samples. To address this challenge, we propose Feature Mixing, an extremely simple and fast method for multimodal outlier synthesis with theoretical support, which can be further optimized to help the model better distinguish between in-distribution (ID) and OOD data. Feature Mixing is modality-agnostic and applicable to various modality combinations. Additionally, we introduce CARLA-OOD, a novel multimodal dataset for OOD segmentation, featuring synthetic OOD objects across diverse scenes and weather conditions. Extensive experiments on SemanticKITTI, nuScenes, CARLA-OOD datasets, and the MultiOOD benchmark demonstrate that Feature Mixing achieves state-of-the-art performance with a $10 \times$ to $370 \times$ speedup. Our source code and dataset will be available at https://github.com/mona4399/FeatureMixing.
Chinese: 本文提出特征混合法,一种快速通用的多模态异常合成方法,有效提升多数据集上的OOD检测性能,在实现最优表现的同时大幅提升运算速度。
English: This paper introduces Feature Mixing, a fast and versatile method for multimodal outlier synthesis that enhances OOD detection across various datasets, achieving top performance with significant speed improvements.
Authors:Csaba Dékány, Stefan Balauca, Robin Staab, Dimitar I. Dimitrov, Martin Vechev
Abstract:
Despite recent efforts in Large Language Models (LLMs) safety and alignment, current adversarial attacks on frontier LLMs are still able to force harmful generations consistently. Although adversarial training has been widely studied and shown to significantly improve the robustness of traditional machine learning models, its strengths and weaknesses in the context of LLMs are less understood. Specifically, while existing discrete adversarial attacks are effective at producing harmful content, training LLMs with concrete adversarial prompts is often computationally expensive, leading to reliance on continuous relaxations. As these relaxations do not correspond to discrete input tokens, such latent training methods often leave models vulnerable to a diverse set of discrete attacks. In this work, we aim to bridge this gap by introducing MixAT, a novel method that combines stronger discrete and faster continuous attacks during training. We rigorously evaluate MixAT across a wide spectrum of state-of-the-art attacks, proposing the At Least One Attack Success Rate (ALO-ASR) metric to capture the worst-case vulnerability of models. We show MixAT achieves substantially better robustness (ALO-ASR < 20%) compared to prior defenses (ALO-ASR > 50%), while maintaining a runtime comparable to methods based on continuous relaxations. We further analyze MixAT in realistic deployment settings, exploring how chat templates, quantization, low-rank adapters, and temperature affect both adversarial training and evaluation, revealing additional blind spots in current methodologies. Our results demonstrate that MixAT's discrete-continuous defense offers a principled and superior robustness-accuracy tradeoff with minimal computational overhead, highlighting its promise for building safer LLMs. We provide our code and models at https://github.com/insait-institute/MixAT.
中文: 尽管大型语言模型的安全性有所进步,对抗性攻击仍能有效引发有害输出,而提出的MixAT方法在训练中结合离散和连续攻击,以最小计算开销显著提升模型鲁棒性。
English: Despite advancements in Large Language Model (LLM) safety, adversarial attacks still effectively induce harmful outputs, and the proposed MixAT method combines discrete and continuous attacks during training to significantly enhance robustness with minimal computational overhead.
Authors:Zebin You, Shen Nie, Xiaolu Zhang, Jun Hu, Jun Zhou, Zhiwu Lu, Ji-Rong Wen, Chongxuan Li
Abstract:
In this work, we introduce LLaDA-V, a purely diffusion-based Multimodal Large Language Model (MLLM) that integrates visual instruction tuning with masked diffusion models, representing a departure from the autoregressive paradigms dominant in current multimodal approaches. Built upon LLaDA, a representative large language diffusion model, LLaDA-V incorporates a vision encoder and MLP connector that projects visual features into the language embedding space, enabling effective multimodal alignment. Our empirical investigation reveals several intriguing results: First, LLaDA-V demonstrates promising multimodal performance despite its language model being weaker on purely textual tasks than counterparts like LLaMA3-8B and Qwen2-7B. When trained on the same instruction data, LLaDA-V is highly competitive to LLaMA3-V across multimodal tasks with better data scalability. It also narrows the performance gap to Qwen2-VL, suggesting the effectiveness of its architecture for multimodal tasks. Second, LLaDA-V achieves state-of-the-art performance in multimodal understanding compared to existing hybrid autoregressive-diffusion and purely diffusion-based MLLMs. Our findings suggest that large language diffusion models show promise in multimodal contexts and warrant further investigation in future research. Project page and codes: https://ml-gsai.github.io/LLaDA-V-demo/.
中文: LLaDA-V是一种基于扩散的多模态模型,通过将视觉指令调整与掩码扩散模型相结合,在文本任务能力较弱的情况下仍展现出竞争力的多模态性能,并在同类扩散模型中实现了最先进的效果。
English: LLaDA-V is a diffusion-based multimodal model that integrates visual instruction tuning with masked diffusion models, demonstrating competitive performance in multimodal tasks despite weaker text-only capabilities and achieving state-of-the-art results among diffusion-based MLLMs.
Authors:Haonian Ji, Shi Qiu, Siyang Xin, Siwei Han, Zhaorun Chen, Dake Zhang, Hongyi Wang, Huaxiu Yao
Abstract:
While foundation models (FMs), such as diffusion models and large vision-language models (LVLMs), have been widely applied in educational contexts, their ability to generate pedagogically effective visual explanations remains limited. Most existing approaches focus primarily on textual reasoning, overlooking the critical role of structured and interpretable visualizations in supporting conceptual understanding. To better assess the visual reasoning capabilities of FMs in educational settings, we introduce EduVisBench, a multi-domain, multi-level benchmark. EduVisBench features diverse STEM problem sets requiring visually grounded solutions, along with a fine-grained evaluation rubric informed by pedagogical theory. Our empirical analysis reveals that existing models frequently struggle with the inherent challenge of decomposing complex reasoning and translating it into visual representations aligned with human cognitive processes. To address these limitations, we propose EduVisAgent, a multi-agent collaborative framework that coordinates specialized agents for instructional planning, reasoning decomposition, metacognitive prompting, and visualization design. Experimental results show that EduVisAgent substantially outperforms all baselines, achieving a 40.2% improvement and delivering more educationally aligned visualizations. EduVisBench and EduVisAgent are available at https://github.com/aiming-lab/EduVisBench and https://github.com/aiming-lab/EduVisAgent.
Chinese: 本文提出了用于评估教育领域基础模型视觉推理能力的基准EduVisBench,并开发了多智能体框架EduVisAgent,该框架通过协同合作显著提升了符合教学需求的可视化内容生成效果。
English: This paper introduces EduVisBench, a benchmark for evaluating the visual reasoning capabilities of foundation models in education, and proposes EduVisAgent, a multi-agent framework that significantly enhances the generation of pedagogically effective visualizations.
Authors:Xiaoyu Xu, Xiang Yue, Yang Liu, Qingqing Ye, Haibo Hu, Minxin Du
Abstract:
Unlearning in large language models (LLMs) is intended to remove the influence of specific data, yet current evaluations rely heavily on token-level metrics such as accuracy and perplexity. We show that these metrics can be misleading: models often appear to forget, but their original behavior can be rapidly restored with minimal fine-tuning, revealing that unlearning may obscure information rather than erase it. To diagnose this phenomenon, we introduce a representation-level evaluation framework using PCA-based similarity and shift, centered kernel alignment, and Fisher information. Applying this toolkit across six unlearning methods, three domains (text, code, math), and two open-source LLMs, we uncover a critical distinction between reversible and irreversible forgetting. In reversible cases, models suffer token-level collapse yet retain latent features; in irreversible cases, deeper representational damage occurs. We further provide a theoretical account linking shallow weight perturbations near output layers to misleading unlearning signals, and show that reversibility is modulated by task type and hyperparameters. Our findings reveal a fundamental gap in current evaluation practices and establish a new diagnostic foundation for trustworthy unlearning in LLMs. We provide a unified toolkit for analyzing LLM representation changes under unlearning and relearning: https://github.com/XiaoyuXU1/Representational_Analysis_Tools.git.
中文: 当前大语言模型遗忘效果评估依赖的词汇级指标存在误导性,因为模型可能只是表面遗忘而信息仍可恢复,这凸显了需要建立表征分析框架来区分可逆与不可逆遗忘的必要性。
English: Current token-level metrics for evaluating unlearning in LLMs can be misleading, as models may only superficially forget information that remains recoverable, prompting the need for a new representational analysis framework to distinguish between reversible and irreversible forgetting.
Authors:Punya Syon Pandey, Samuel Simko, Kellin Pelrine, Zhijing Jin
Abstract:
As large language models (LLMs) gain popularity, their vulnerability to adversarial attacks emerges as a primary concern. While fine-tuning models on domain-specific datasets is often employed to improve model performance, it can inadvertently introduce vulnerabilities within the underlying model. In this work, we investigate Accidental Vulnerability, unexpected vulnerabilities arising from characteristics of fine-tuning data. We begin by identifying potential correlation factors such as linguistic features, semantic similarity, and toxicity across multiple experimental datasets. We then evaluate the adversarial robustness of these fine-tuned models, analyzing persona shifts and interpretability traits to understand how dataset factors contribute to attack success rates. Lastly, we explore causal relationships that offer new insights into adversarial defense strategies, highlighting the crucial role of dataset design in preserving model alignment. Our code is available at https://github.com/psyonp/accidental_vulnerability.
中文摘要:针对特定领域数据微调大型语言模型可能意外引入脆弱性,这些脆弱性受数据集的语言特征和毒性等因素影响,从而削弱模型鲁棒性,并凸显了策略性数据集设计对防御的重要性。
English Summary: Fine-tuning large language models on domain-specific data can inadvertently introduce accidental vulnerabilities, which are influenced by factors like linguistic features and toxicity in the datasets, ultimately affecting model robustness and highlighting the importance of strategic dataset design for defense.
Authors:Florentin Beck, William Rudman, Carsten Eickhoff
Abstract:
Large Language Models (LLMs) present significant computational and memory challenges due to their extensive size, making pruning essential for their efficient deployment. Existing one-shot pruning methods often apply uniform sparsity constraints across layers or within each layer, resulting in suboptimal performance, especially at high sparsity ratios. This work introduces TRIM (Targeted Row-wise Iterative Metric-driven pruning), a novel approach that applies varying sparsity ratios to individual output dimensions (rows) within each layer. TRIM employs an iterative adjustment process guided by quality metrics to optimize dimension-wise sparsity allocation, focusing on reducing variance in quality retention across outputs to preserve critical information. TRIM can be seamlessly integrated with existing layer-wise pruning strategies. Our evaluations on perplexity and zero-shot tasks across diverse LLM families (Qwen2.5, LLaMA-2, and OPT) and sparsity levels demonstrate that TRIM achieves new state-of-the-art results and enhances stability. For instance, at 80% sparsity, TRIM reduces perplexity by 48% for Qwen2.5-14B and over 90% for OPT-13B compared to baseline methods. We conclude that fine-grained, dimension-wise sparsity adaptation is crucial for pushing the limits of extreme LLM compression. Code available at: https://github.com/flobk/TRIM
中文:TRIM提出了一种针对性的迭代剪枝方法,通过为各层内部维度分配差异化稀疏度,在多类大语言模型压缩中实现了最优性能与稳定性。
English: TRIM introduces a targeted, iterative pruning method that applies varying sparsity to individual dimensions within layers, achieving state-of-the-art performance and stability in LLM compression across multiple models.
Authors:Chengcan Wu, Zhixin Zhang, Zeming Wei, Yihao Zhang, Meng Sun
Abstract:
The significant progress of large language models (LLMs) has led to remarkable achievements across numerous applications. However, their ability to generate harmful content has sparked substantial safety concerns. Despite the implementation of safety alignment techniques during the pre-training phase, recent research indicates that fine-tuning LLMs on adversarial or even benign data can inadvertently compromise their safety. In this paper, we re-examine the fundamental issue of why fine-tuning on non-harmful data still results in safety degradation. We introduce a safety-aware probing (SAP) optimization framework designed to mitigate the safety risks of fine-tuning LLMs. Specifically, SAP incorporates a safety-aware probe into the gradient propagation process, mitigating the model's risk of safety degradation by identifying potential pitfalls in gradient directions, thereby enhancing task-specific performance while successfully preserving model safety. Our extensive experimental results demonstrate that SAP effectively reduces harmfulness below the original fine-tuned model and achieves comparable test loss to standard fine-tuning methods. Our code is available at https://github.com/ChengcanWu/SAP.
Chinese: 本文提出了一种安全感知探测(SAP)框架,通过在梯度传播过程中引入安全感知探针,有效减轻大型语言模型在微调过程中的安全性退化,在保持性能的同时显著降低有害性。
English: The paper introduces a safety-aware probing (SAP) framework that mitigates safety degradation in large language models during fine-tuning by incorporating safety-aware probes into gradient propagation, effectively reducing harmfulness while maintaining performance.
Authors:Ziwei Luo, Fredrik K. Gustafsson, Jens Sjölund, Thomas B. Schön
Abstract:
This work presents a forward-only diffusion (FoD) approach for generative modelling. In contrast to traditional diffusion models that rely on a coupled forward-backward diffusion scheme, FoD directly learns data generation through a single forward diffusion process, yielding a simple yet efficient generative framework. The core of FoD is a state-dependent linear stochastic differential equation that involves a mean-reverting term in both the drift and diffusion functions. This mean-reversion property guarantees the convergence to clean data, naturally simulating a stochastic interpolation between source and target distributions. More importantly, FoD is analytically tractable and is trained using a simple stochastic flow matching objective, enabling a few-step non-Markov chain sampling during inference. The proposed FoD model, despite its simplicity, achieves competitive performance on various image-conditioned (e.g., image restoration) and unconditional generation tasks, demonstrating its effectiveness in generative modelling. Our code is available at https://github.com/Algolzw/FoD.
Chinese: 本文提出了一种仅前向扩散(FoD)模型,通过采用具有均值回复特性的随机微分方程的单向前向扩散过程,简化了生成建模,并在图像修复和图像转换任务中实现了最先进的性能。
English: This paper introduces a forward-only diffusion (FoD) model that simplifies generative modeling by using a single forward diffusion process with a mean-reverting stochastic differential equation, achieving state-of-the-art performance in image restoration and image-to-image translation tasks.
Authors:Wenhao Li, Yuxin Zhang, Gen Luo, Daohai Yu, Rongrong Ji
Abstract:
While long-context large language models (LLMs) exhibit remarkable document processing capabilities, their prohibitively high training costs often hinder customized applications. To mitigate this issue, we propose \textit{Sequential Chunk-wise Optimization} (SeCO), a memory-efficient training paradigm that partitions lengthy inputs into manageable chunks. Each chunk independently constructs its computational graph and performs localized backpropagation, ensuring that only one chunk's forward activations are stored in memory. Building on SeCO, we further introduce \textit{Sparse Chunk-wise Optimization} (SpaCO), which reduces computational overhead by selectively propagating gradients to specific chunks and incorporates a carefully designed compensation factor to ensure unbiased gradient estimation. SpaCO decouples the computational cost of backpropagation from the context length, enabling training time to gradually converge to inference time as sequences become longer. Implemented as lightweight training wrappers, both SeCO and SpaCO offer substantial practical benefits. For example, when fine-tuning an 8B model with LoRA on a single RTX 3090 GPU, SeCO expands maximum sequence length from 1K to 16K tokens, while SpaCO demonstrates accelerated training speed -- achieving up to 3x faster than SeCO under the same experimental setup. These innovations provide new insights into optimizing long-context models, making them more accessible for practical applications. We have open-sourced the code at \href{https://github.com/wenhaoli-xmu/seco}{here}.
Chinese: 本文提出SeCO和SpaCO两种内存高效的训练方法,通过分块处理长输入来优化大语言模型的长上下文能力,在降低计算成本的同时实现了更长序列处理和更快训练速度。
English: This paper introduces SeCO and SpaCO, two memory-efficient training methods that optimize long-context LLMs by processing input in chunks, enabling longer sequence handling and faster training speeds while reducing computational costs.
Authors:Zhichao Zhu, Yang Qi, Hengyuan Ma, Wenlian Lu, Jianfeng Feng
Abstract:
The Forward-Forward (FF) algorithm provides a bottom-up alternative to backpropagation (BP) for training neural networks, relying on a layer-wise "goodness" function to guide learning. Existing goodness functions, inspired by energy-based learning (EBL), are typically defined as the sum of squared post-synaptic activations, neglecting the correlations between neurons. In this work, we propose a novel goodness function termed dimensionality compression that uses the effective dimensionality (ED) of fluctuating neural responses to incorporate second-order statistical structure. Our objective minimizes ED for clamped inputs when noise is considered while maximizing it across the sample distribution, promoting structured representations without the need to prepare negative samples. We demonstrate that this formulation achieves competitive performance compared to other non-BP methods. Moreover, we show that noise plays a constructive role that can enhance generalization and improve inference when predictions are derived from the mean of squared outputs, which is equivalent to making predictions based on the energy term. Our findings contribute to the development of more biologically plausible learning algorithms and suggest a natural fit for neuromorphic computing, where stochasticity is a computational resource rather than a nuisance. The code is available at https://github.com/ZhichaoZhu/StochasticForwardForward
中文: 本研究为前向-前向算法提出了一种新颖的“维度压缩”优度函数,通过有效维度捕捉神经元相关性,无需负样本即可实现有竞争力的性能,同时揭示了噪声在增强泛化能力和推理中的建设性作用。
English: The study introduces a novel "dimensionality compression" goodness function for the Forward-Forward algorithm, using effective dimensionality to capture neuron correlations and achieve competitive performance without negative samples, while highlighting noise's constructive role in enhancing generalization and inference.
Authors:Benjamin Herdeanu, Juan Nathaniel, Carla Roesch, Jatan Buch, Gregor Ramien, Johannes Haux, Pierre Gentine
Abstract:
Causal discovery for dynamical systems poses a major challenge in fields where active interventions are infeasible. Most methods used to investigate these systems and their associated benchmarks are tailored to deterministic, low-dimensional and weakly nonlinear time-series data. To address these limitations, we present CausalDynamics, a large-scale benchmark and extensible data generation framework to advance the structural discovery of dynamical causal models. Our benchmark consists of true causal graphs derived from thousands of coupled ordinary and stochastic differential equations as well as two idealized climate models. We perform a comprehensive evaluation of state-of-the-art causal discovery algorithms for graph reconstruction on systems with noisy, confounded, and lagged dynamics. CausalDynamics consists of a plug-and-play, build-your-own coupling workflow that enables the construction of a hierarchy of physical systems. We anticipate that our framework will facilitate the development of robust causal discovery algorithms that are broadly applicable across domains while addressing their unique challenges. We provide a user-friendly implementation and documentation on https://kausable.github.io/CausalDynamics.
Authors:Jisu Han, Jaemin Na, Wonjun Hwang
Abstract:
Test-time adaptation aims to adapt to realistic environments in an online manner by learning during test time. Entropy minimization has emerged as a principal strategy for test-time adaptation due to its efficiency and adaptability. Nevertheless, it remains underexplored in continual test-time adaptation, where stability is more important. We observe that the entropy minimization method often suffers from model collapse, where the model converges to predicting a single class for all images due to a trivial solution. We propose ranked entropy minimization to mitigate the stability problem of the entropy minimization method and extend its applicability to continuous scenarios. Our approach explicitly structures the prediction difficulty through a progressive masking strategy. Specifically, it gradually aligns the model's probability distributions across different levels of prediction difficulty while preserving the rank order of entropy. The proposed method is extensively evaluated across various benchmarks, demonstrating its effectiveness through empirical results. Our code is available at https://github.com/pilsHan/rem
中文: 本研究提出排序熵最小化方法,通过渐进式结构化预测难度并保持熵的排序,有效防止持续测试时适应中的模型崩溃,在多个基准测试中验证了其有效性。
English: The study introduces ranked entropy minimization to prevent model collapse in continual test-time adaptation by progressively structuring prediction difficulty while maintaining entropy rank order, demonstrating effectiveness across benchmarks.
Authors:Ruizhe Li, Chen Chen, Yuchen Hu, Yanjun Gao, Xi Wang, Emine Yilmaz
Abstract:
Retrieval-Augmented Generation (RAG) leverages large language models (LLMs) combined with external contexts to enhance the accuracy and reliability of generated responses. However, reliably attributing generated content to specific context segments, context attribution, remains challenging due to the computationally intensive nature of current methods, which often require extensive fine-tuning or human annotation. In this work, we introduce a novel Jensen-Shannon Divergence driven method to Attribute Response to Context (ARC-JSD), enabling efficient and accurate identification of essential context sentences without additional fine-tuning or surrogate modelling. Evaluations on a wide range of RAG benchmarks, such as TyDi QA, Hotpot QA, and Musique, using instruction-tuned LLMs in different scales demonstrate superior accuracy and significant computational efficiency improvements compared to the previous surrogate-based method. Furthermore, our mechanistic analysis reveals specific attention heads and multilayer perceptron (MLP) layers responsible for context attribution, providing valuable insights into the internal workings of RAG models. Our code is available at https://github.com/ruizheliUOA/ARC_JSD
This paper introduces ARC-JSD, a novel method that uses Jensen-Shannon Divergence to efficiently attribute generated responses to specific context segments in Retrieval-Augmented Generation systems, eliminating the need for fine-tuning while demonstrating superior accuracy and computational efficiency across multiple benchmarks.
English Summary:
Authors:Sreetama Sarkar, Yue Che, Alex Gavin, Peter A. Beerel, Souvik Kundu
Abstract:
Despite their remarkable progress in multimodal understanding tasks, large vision language models (LVLMs) often suffer from "hallucinations", generating texts misaligned with the visual context. Existing methods aimed at reducing hallucinations through inference time intervention incur a significant increase in latency. To mitigate this, we present SPIN, a task-agnostic attention-guided head suppression strategy that can be seamlessly integrated during inference, without incurring any significant compute or latency overhead. We investigate whether hallucination in LVLMs can be linked to specific model components. Our analysis suggests that hallucinations can be attributed to a dynamic subset of attention heads in each layer. Leveraging this insight, for each text query token, we selectively suppress attention heads that exhibit low attention to image tokens, keeping the top-K attention heads intact. Extensive evaluations on visual question answering and image description tasks demonstrate the efficacy of SPIN in reducing hallucination scores up to 2.7x while maintaining F1, and improving throughput by 1.8x compared to existing alternatives. Code is available at https://github.com/YUECHE77/SPIN.
Chinese: SPIN是一种新颖的注意力引导头抑制方法,通过在推理过程中选择性抑制低注意力头来减少大型视觉语言模型的幻觉现象,在实现幻觉分数降低2.7倍的同时保持F1分数,并将吞吐量提升1.8倍且不增加延迟。
English: SPIN is a novel attention-guided head suppression method that reduces hallucinations in large vision-language models by selectively suppressing low-attention heads during inference, achieving up to 2.7x lower hallucination scores with 1.8x higher throughput and no latency overhead.
Authors:Guanting Dong, Yifei Chen, Xiaoxi Li, Jiajie Jin, Hongjin Qian, Yutao Zhu, Hangyu Mao, Guorui Zhou, Zhicheng Dou, Ji-Rong Wen
Abstract:
Recently, large language models (LLMs) have shown remarkable reasoning capabilities via large-scale reinforcement learning (RL). However, leveraging the RL algorithm to empower effective multi-tool collaborative reasoning in LLMs remains an open challenge. In this paper, we introduce Tool-Star, an RL-based framework designed to empower LLMs to autonomously invoke multiple external tools during stepwise reasoning. Tool-Star integrates six types of tools and incorporates systematic designs in both data synthesis and training. To address the scarcity of tool-use data, we propose a general tool-integrated reasoning data synthesis pipeline, which combines tool-integrated prompting with hint-based sampling to automatically and scalably generate tool-use trajectories. A subsequent quality normalization and difficulty-aware classification process filters out low-quality samples and organizes the dataset from easy to hard. Furthermore, we propose a two-stage training framework to enhance multi-tool collaborative reasoning by: (1) cold-start fine-tuning, which guides LLMs to explore reasoning patterns via tool-invocation feedback; and (2) a multi-tool self-critic RL algorithm with hierarchical reward design, which reinforces reward understanding and promotes effective tool collaboration. Experimental analyses on over 10 challenging reasoning benchmarks highlight the effectiveness and efficiency of Tool-Star. The code is available at https://github.com/dongguanting/Tool-Star.
中文:Tool-Star是一个基于强化学习的框架,通过两阶段训练和分层奖励设计,使大语言模型能够在推理过程中自主调用多种外部工具,在多项基准测试中展现出卓越性能。
English: Tool-Star is a reinforcement learning framework that enables large language models to autonomously use multiple external tools during reasoning through a two-stage training process and hierarchical reward design, demonstrating superior performance across various benchmarks.
Authors:Huazi Pan, Yanjun Zhang, Leo Yu Zhang, Scott Adams, Abbas Kouzani, Suiyang Khoo
Abstract:
Manipulation of local training data and local updates, i.e., the poisoning attack, is the main threat arising from the collaborative nature of the federated learning (FL) paradigm. Most existing poisoning attacks aim to manipulate local data/models in a way that causes denial-of-service (DoS) issues. In this paper, we introduce a novel attack method, named Federated Learning Sliding Attack (FedSA) scheme, aiming at precisely introducing the extent of poisoning in a subtle controlled manner. It operates with a predefined objective, such as reducing global model's prediction accuracy by 10%. FedSA integrates robust nonlinear control-Sliding Mode Control (SMC) theory with model poisoning attacks. It can manipulate the updates from malicious clients to drive the global model towards a compromised state, achieving this at a controlled and inconspicuous rate. Additionally, leveraging the robust control properties of FedSA allows precise control over the convergence bounds, enabling the attacker to set the global accuracy of the poisoned model to any desired level. Experimental results demonstrate that FedSA can accurately achieve a predefined global accuracy with fewer malicious clients while maintaining a high level of stealth and adjustable learning rates.
Chinese: 联邦学习滑动攻击(FedSA)是一种新型投毒方法,它利用滑模控制理论精细操控恶意客户端更新,以隐蔽方式将全局模型准确率精准降低至预设水平,且只需较少恶意客户端即可实现。
English: The Federated Learning Sliding Attack (FedSA) is a novel poisoning method that subtly manipulates malicious client updates using Sliding Mode Control to precisely degrade the global model's accuracy to a predefined level while remaining stealthy and requiring fewer malicious clients.
Authors:Estelle Chigot, Dennis G. Wilson, Meriem Ghrib, Thomas Oberlin
Abstract:
Semantic segmentation models trained on synthetic data often perform poorly on real-world images due to domain gaps, particularly in adverse conditions where labeled data is scarce. Yet, recent foundation models enable to generate realistic images without any training. This paper proposes to leverage such diffusion models to improve the performance of vision models when learned on synthetic data. We introduce two novel techniques for semantically consistent style transfer using diffusion models: Class-wise Adaptive Instance Normalization and Cross-Attention (CACTI) and its extension with selective attention Filtering (CACTIF). CACTI applies statistical normalization selectively based on semantic classes, while CACTIF further filters cross-attention maps based on feature similarity, preventing artifacts in regions with weak cross-attention correspondences. Our methods transfer style characteristics while preserving semantic boundaries and structural coherence, unlike approaches that apply global transformations or generate content without constraints. Experiments using GTA5 as source and Cityscapes/ACDC as target domains show that our approach produces higher quality images with lower FID scores and better content preservation. Our work demonstrates that class-aware diffusion-based style transfer effectively bridges the synthetic-to-real domain gap even with minimal target domain data, advancing robust perception systems for challenging real-world applications. The source code is available at: https://github.com/echigot/cactif.
中文: 本文提出CACTI和CACTIF两种基于扩散模型的语义一致性风格迁移方法,通过类别自适应归一化和注意力过滤机制,在保持语义边界和结构连贯性的同时有效缩小合成数据与真实数据的领域差距,仅需少量真实数据即可显著提升视觉模型性能。
English: This paper introduces CACTI and CACTIF, two diffusion-based techniques that perform semantically consistent style transfer to bridge the synthetic-to-real domain gap, improving vision model performance with minimal real data by preserving semantic boundaries and structural coherence.
Authors:Qian Tan, Dongzhan Zhou, Peng Xia, Wanhao Liu, Wanli Ouyang, Lei Bai, Yuqiang Li, Tianfan Fu
Abstract:
Multimodal large language models (MLLMs) have made impressive progress in many applications in recent years. However, chemical MLLMs that can handle cross-modal understanding and generation remain underexplored. To fill this gap, we propose ChemMLLM, a unified chemical multimodal large language model for molecule understanding and generation. Also, we design five multimodal tasks across text, molecular SMILES strings, and image, and curate the datasets. We benchmark ChemMLLM against a range of general leading MLLMs and Chemical LLMs on these tasks. Experimental results show that ChemMLLM achieves superior performance across all evaluated tasks. For example, in molecule image optimization task, ChemMLLM outperforms the best baseline (GPT-4o) by 116.75\% (4.27 vs 1.97 property improvement). The code is publicly available at https://github.com/bbsbz/ChemMLLM.git.
Chinese: ChemMLLM是一种创新的化学多模态大语言模型,专为分子理解和生成而设计,在多项任务中表现卓越,如在分子图像优化任务上比GPT-4o提升了116.75%,代码已公开。
English: ChemMLLM is a novel multimodal large language model designed for chemical applications that excels in molecule understanding and generation, significantly outperforming existing models across multiple tasks, such as achieving a 116.75% improvement in molecule image optimization over GPT-4o.
Authors:Yangyang Wang, Jiawei Gu, Li Long, Xin Li, Li Shen, Zhouyu Fu, Xiangjun Zhou, Xu Jiang
Abstract:
Accurate demand estimation is critical for the retail business in guiding the inventory and pricing policies of perishable products. However, it faces fundamental challenges from censored sales data during stockouts, where unobserved demand creates systemic policy biases. Existing datasets lack the temporal resolution and annotations needed to address this censoring effect. To fill this gap, we present FreshRetailNet-50K, the first large-scale benchmark for censored demand estimation. It comprises 50,000 store-product time series of detailed hourly sales data from 898 stores in 18 major cities, encompassing 863 perishable SKUs meticulously annotated for stockout events. The hourly stock status records unique to this dataset, combined with rich contextual covariates, including promotional discounts, precipitation, and temporal features, enable innovative research beyond existing solutions. We demonstrate one such use case of two-stage demand modeling: first, we reconstruct the latent demand during stockouts using precise hourly annotations. We then leverage the recovered demand to train robust demand forecasting models in the second stage. Experimental results show that this approach achieves a 2.73% improvement in prediction accuracy while reducing the systematic demand underestimation from 7.37% to near-zero bias. With unprecedented temporal granularity and comprehensive real-world information, FreshRetailNet-50K opens new research directions in demand imputation, perishable inventory optimization, and causal retail analytics. The unique annotation quality and scale of the dataset address long-standing limitations in retail AI, providing immediate solutions and a platform for future methodological innovation. The data (https://huggingface.co/datasets/Dingdong-Inc/FreshRetailNet-50K) and code (https://github.com/Dingdong-Inc/frn-50k-baseline}) are openly released.
中文摘要:FreshRetailNet-50K作为首个包含5万条时序数据的大规模基准数据集,通过精确标注缺货事件实现两阶段需求建模,将系统性需求低估从7.37%降至接近零偏差,同时提升预测精度2.73%。
English Summary: FreshRetailNet-50K introduces the first large-scale benchmark with 50,000 hourly sales series and precise stockout annotations, enabling two-stage demand modeling that reduces prediction bias from 7.37% to near-zero while improving accuracy by 2.73%.
Authors:Arjhun Swaminathan, Mete Akgün
Abstract:
Deep neural networks for image classification remain vulnerable to adversarial examples -- small, imperceptible perturbations that induce misclassifications. In black-box settings, where only the final prediction is accessible, crafting targeted attacks that aim to misclassify into a specific target class is particularly challenging due to narrow decision regions. Current state-of-the-art methods often exploit the geometric properties of the decision boundary separating a source image and a target image rather than incorporating information from the images themselves. In contrast, we propose Targeted Edge-informed Attack (TEA), a novel attack that utilizes edge information from the target image to carefully perturb it, thereby producing an adversarial image that is closer to the source image while still achieving the desired target classification. Our approach consistently outperforms current state-of-the-art methods across different models in low query settings (nearly 70\% fewer queries are used), a scenario especially relevant in real-world applications with limited queries and black-box access. Furthermore, by efficiently generating a suitable adversarial example, TEA provides an improved target initialization for established geometry-based attacks.
Chinese: 针对图像分类的深度神经网络易受对抗样本攻击,本文提出的目标边缘信息攻击(TEA)通过利用目标图像边缘信息,在黑盒设置中高效生成对抗样本,显著减少查询次数并提升误分类效果。
English: Deep neural networks for image classification are vulnerable to adversarial examples, and the proposed Targeted Edge-informed Attack (TEA) effectively crafts these in black-box settings by using target image edges to reduce queries and enhance misclassification.
Authors:Sampanna Yashwant Kahu, Naman Ahuja
Abstract:
Social media and online forums are increasingly becoming popular. Unfortunately, these platforms are being used for spreading hate speech. In this paper, we design black-box techniques to protect users from hate-speech on online platforms by generating perturbations that can fool state of the art deep learning based hate speech detection models thereby decreasing their efficiency. We also ensure a minimal change in the original meaning of hate-speech. Our best perturbation attack is successfully able to evade hate-speech detection for 86.8 % of hateful text.
Chinese: 本文设计了黑盒技术,通过生成扰动来规避最先进的仇恨言论检测模型,在保持原意基本不变的同时,成功使86.8%的仇恨文本逃过检测。
English: This paper develops black-box techniques to generate perturbations that evade state-of-the-art hate speech detection models, reducing their effectiveness by 86.8% while minimally altering the original text's meaning.
Authors:Zhi Zhong, Akira Takahashi, Shuyang Cui, Keisuke Toyama, Shusuke Takahashi, Yuki Mitsufuji
Abstract:
Foley synthesis aims to synthesize high-quality audio that is both semantically and temporally aligned with video frames. Given its broad application in creative industries, the task has gained increasing attention in the research community. To avoid the non-trivial task of training audio generative models from scratch, adapting pretrained audio generative models for video-synchronized foley synthesis presents an attractive direction. ControlNet, a method for adding fine-grained controls to pretrained generative models, has been applied to foley synthesis, but its use has been limited to handcrafted human-readable temporal conditions. In contrast, from-scratch models achieved success by leveraging high-dimensional deep features extracted using pretrained video encoders. We have observed a performance gap between ControlNet-based and from-scratch foley models. To narrow this gap, we propose SpecMaskFoley, a method that steers the pretrained SpecMaskGIT model toward video-synchronized foley synthesis via ControlNet. To unlock the potential of a single ControlNet branch, we resolve the discrepancy between the temporal video features and the time-frequency nature of the pretrained SpecMaskGIT via a frequency-aware temporal feature aligner, eliminating the need for complicated conditioning mechanisms widely used in prior arts. Evaluations on a common foley synthesis benchmark demonstrate that SpecMaskFoley could even outperform strong from-scratch baselines, substantially advancing the development of ControlNet-based foley synthesis models. Demo page: https://zzaudio.github.io/SpecMaskFoley_Demo/
Chinese Summary: SpecMaskFoley通过引入频率感知时序特征对齐器,将预训练的SpecMaskGIT模型适配于ControlNet框架,显著提升了基于视频的拟音合成性能,甚至优于从头训练的基线模型。
English Summary: SpecMaskFoley bridges the performance gap in video-synchronized foley synthesis by adapting the pretrained SpecMaskGIT model through ControlNet with a frequency-aware temporal feature aligner, outperforming from-scratch models.
Authors:Zhenglin Hua, Jinghan He, Zijun Yao, Tianxu Han, Haiyun Guo, Yuheng Jia, Junfeng Fang
Abstract:
Large vision-language models (LVLMs) have achieved remarkable performance on multimodal tasks. However, they still suffer from hallucinations, generating text inconsistent with visual input, posing significant risks in real-world applications. Existing approaches to address this issue focus on incorporating external knowledge bases, alignment training, or decoding strategies, all of which require substantial computational cost and time. Recent works try to explore more efficient alternatives by adjusting LVLMs' internal representations. Although promising, these methods may cause hallucinations to be insufficiently suppressed or lead to excessive interventions that negatively affect normal semantics. In this work, we leverage sparse autoencoders (SAEs) to identify semantic directions closely associated with faithfulness or hallucination, extracting more precise and disentangled hallucination-related representations. Our analysis demonstrates that interventions along the identified faithful direction can mitigate hallucinations, while those along the hallucinatory direction can exacerbate them. Building on these insights, we propose Steering LVLMs via SAE Latent Directions (SSL), a plug-and-play method based on SAE-derived latent directions to mitigate hallucinations in LVLMs. Extensive experiments demonstrate that SSL significantly outperforms existing decoding approaches in mitigating hallucinations, while maintaining transferability across different model architectures with negligible additional time overhead. The code is available at https://github.com/huazhenglin2003/SSL.
中文摘要:本文提出SSL方法,通过稀疏自编码器识别并调控大视觉语言模型中的潜在语义方向,在可忽略的额外时间成本下有效缓解幻觉现象,同时保持语义完整性和跨模型架构的迁移能力。
English Summary: This paper introduces SSL, a plug-and-play method using sparse autoencoders to identify and steer latent directions in large vision-language models, effectively reducing hallucinations while maintaining semantic integrity and transferability across architectures with minimal computational overhead.
Authors:Yuke Zhang
Abstract:
This study introduces an interpretable machine learning (ML) framework to extract macroeconomic alpha from global news sentiment. We process the Global Database of Events, Language, and Tone (GDELT) Project's worldwide news feed using FinBERT -- a Bidirectional Encoder Representations from Transformers (BERT) based model pretrained on finance-specific language -- to construct daily sentiment indices incorporating mean tone, dispersion, and event impact. These indices drive an XGBoost classifier, benchmarked against logistic regression, to predict next-day returns for EUR/USD, USD/JPY, and 10-year U.S. Treasury futures (ZN). Rigorous out-of-sample (OOS) backtesting (5-fold expanding-window cross-validation, OOS period: c. 2017-April 2025) demonstrates exceptional, cost-adjusted performance for the XGBoost strategy: Sharpe ratios achieve 5.87 (EUR/USD), 4.65 (USD/JPY), and 4.65 (Treasuries), with respective compound annual growth rates (CAGRs) exceeding 50% in Foreign Exchange (FX) and 22% in bonds. Shapley Additive Explanations (SHAP) affirm that sentiment dispersion and article impact are key predictive features. Our findings establish that integrating domain-specific Natural Language Processing (NLP) with interpretable ML offers a potent and explainable source of macro alpha.
本研究通过结合领域特定的自然语言处理与可解释机器学习,构建了一个基于新闻情感的宏观经济预测框架,在汇率和债券市场中实现了卓越且可解释的投资回报。
This study develops an interpretable machine learning framework using news sentiment analysis to predict financial market returns, demonstrating exceptional performance in foreign exchange and bond markets through rigorous testing.
Authors:Nathan Brady, David Tennyson, Thomas Vandermeulen
Abstract:
In this paper, we apply both supervised and unsupervised machine learning algorithms to the study of the string landscape and swampland in 6-dimensions. Our data are the (almost) anomaly-free 6-dimensional $\mathcal{N} = (1,0)$ supergravity models, characterised by the Gram matrix of anomaly coefficients. Our work demonstrates the ability of machine learning algorithms to efficiently learn highly complex features of the landscape and swampland. Employing an autoencoder for unsupervised learning, we provide an auto-classification of these models by compressing the Gram matrix data to 2-dimensions. Through compression, similar models cluster together, and we identify prominent features of these clusters. The autoencoder also identifies outlier models which are difficult to reconstruct. One of these outliers proves to be incredibly difficult to combine with other models such that the $\text{tr}R^{4}$ anomaly vanishes, making its presence in the landscape extremely rare. Further, we utilise supervised learning to build two classifiers predicting (1) model consistency under probe string insertion (precision: 0.78, predicting consistency for 214,837 models with reasonable certainty) and (2) inconsistency under anomaly inflow (precision: 0.91, predicting inconsistency for 1,909,359 models). Notably, projecting these predictions onto the autoencoder's 2-dimensional latent layer shows consistent models clustering together, further indicating that the autoencoder has learnt interesting and complex features of the set of models and potentially offers a novel approach to mapping the landscape and swampland of 6-dimensional supergravity theories.
中文: 本研究运用监督和无监督机器学习分析六维超引力模型,通过降维和预测聚类展示了算法如何有效分类景观特征并识别罕见异常值。
English: This study employs supervised and unsupervised machine learning to analyze 6D supergravity models, demonstrating how algorithms can classify landscape features and identify rare outliers through dimensionality reduction and predictive clustering.
Authors:Zehong Wang, Zheyuan Zhang, Tianyi Ma, Chuxu Zhang, Yanfang Ye
Abstract:
Graph neural networks (GNNs) has been predominantly driven by message-passing, where node representations are iteratively updated via local neighborhood aggregation. Despite their success, message-passing suffers from fundamental limitations -- including constrained expressiveness, over-smoothing, over-squashing, and limited capacity to model long-range dependencies. These issues hinder scalability: increasing data size or model size often fails to yield improved performance, limiting the viability of GNNs as backbones for graph foundation models. In this work, we explore pathways beyond message-passing and introduce Generative Graph Pattern Machine (G$^2$PM), a generative Transformer pre-training framework for graphs. G$^2$PM represents graph instances (nodes, edges, or entire graphs) as sequences of substructures, and employs generative pre-training over the sequences to learn generalizable, transferable representations. Empirically, G$^2$PM demonstrates strong scalability: on the ogbn-arxiv benchmark, it continues to improve with model sizes up to 60M parameters, outperforming prior generative approaches that plateau at significantly smaller scales (e.g., 3M). In addition, we systematically analyze the model design space, highlighting key architectural choices that contribute to its scalability and generalization. Across diverse tasks -- including node classification, graph classification, and transfer learning -- G$^2$PM consistently outperforms strong baselines, establishing a compelling foundation for scalable graph learning. The code and dataset are available at https://github.com/Zehong-Wang/G2PM.
中文: 研究者提出G$^2$PM生成式Transformer框架,通过将图表示为子结构序列来克服图神经网络中消息传递的局限性,在多项任务中展现出卓越的可扩展性和性能优势。
English: The authors propose G$^2$PM, a generative Transformer framework that overcomes message-passing limitations in graph neural networks by representing graphs as sequences of substructures, demonstrating superior scalability and performance across diverse tasks.
Authors:Junhong Lin, Xinyue Zeng, Jie Zhu, Song Wang, Julian Shun, Jun Wu, Dawei Zhou
Abstract:
Large Language Models (LLMs) have achieved remarkable success in complex reasoning tasks, but their inference remains computationally inefficient. We observe a common failure mode in many prevalent LLMs, overthinking, where models generate verbose and tangential reasoning traces even for simple queries. Recent works have tried to mitigate this by enforcing fixed token budgets, however, this can lead to underthinking, especially on harder problems. Through empirical analysis, we identify that this inefficiency often stems from unclear problem-solving strategies. To formalize this, we develop a theoretical model, BBAM (Bayesian Budget Allocation Model), which models reasoning as a sequence of sub-questions with varying uncertainty, and introduce the $E^3$ metric to capture the trade-off between correctness and computation efficiency. Building on theoretical results from BBAM, we propose Plan-and-Budget, a model-agnostic, test-time framework that decomposes complex queries into sub-questions and allocates token budgets based on estimated complexity using adaptive scheduling. Plan-and-Budget improves reasoning efficiency across a range of tasks and models, achieving up to +70% accuracy gains, -39% token reduction, and +187.5% improvement in $E^3$. Notably, it elevates a smaller model (DS-Qwen-32B) to match the efficiency of a larger model (DS-LLaMA-70B)-demonstrating Plan-and-Budget's ability to close performance gaps without retraining. Our code is available at https://github.com/junhongmit/P-and-B.
中文: 大语言模型常因过度思考导致推理效率低下,而提出的Plan-and-Budget框架通过分解查询并自适应分配计算资源,显著提升了任务处理的效率和准确性。
English: Large Language Models often suffer from overthinking, leading to inefficient reasoning, but the proposed Plan-and-Budget framework addresses this by decomposing queries and adaptively allocating token budgets, significantly improving efficiency and accuracy across tasks.
Authors:Ziqing Wang, Kexin Zhang, Zihan Zhao, Yibo Wen, Abhishek Pandey, Han Liu, Kaize Ding
Abstract:
Large language models (LLMs) are introducing a paradigm shift in molecular discovery by enabling text-guided interaction with chemical spaces through natural language, symbolic notations, with emerging extensions to incorporate multi-modal inputs. To advance the new field of LLM for molecular discovery, this survey provides an up-to-date and forward-looking review of the emerging use of LLMs for two central tasks: molecule generation and molecule optimization. Based on our proposed taxonomy for both problems, we analyze representative techniques in each category, highlighting how LLM capabilities are leveraged across different learning settings. In addition, we include the commonly used datasets and evaluation protocols. We conclude by discussing key challenges and future directions, positioning this survey as a resource for researchers working at the intersection of LLMs and molecular science. A continuously updated reading list is available at https://github.com/REAL-Lab-NU/Awesome-LLM-Centric-Molecular-Discovery.
大语言模型通过文本引导探索化学空间,并支持分子生成与优化等核心任务,正在彻底改变分子发现领域,本综述对此进行了系统梳理与前瞻展望。
Large language models are revolutionizing molecular discovery by enabling text-guided exploration of chemical spaces and supporting key tasks like molecule generation and optimization, as detailed in this comprehensive survey.
Authors:Jingcong Liang, Siyuan Wang, Miren Tian, Yitong Li, Duyu Tang, Zhongyu Wei
Abstract:
Mixture-of-Experts (MoE) enables efficient scaling of large language models (LLMs) with sparsely activated experts during inference. To effectively deploy large MoE models on memory-constrained devices, many systems introduce *expert offloading* that caches a subset of experts in fast memory, leaving others on slow memory to run on CPU or load on demand. While some research has exploited the locality of expert activations, where consecutive tokens activate similar experts, the degree of this **local routing consistency** varies across models and remains understudied. In this paper, we propose two metrics to measure local routing consistency of MoE models: (1) **Segment Routing Best Performance (SRP)**, which evaluates how well a fixed group of experts can cover the needs of a segment of tokens, and (2) **Segment Cache Best Hit Rate (SCH)**, which measures the optimal segment-level cache hit rate under a given cache size limit. We analyzed 20 MoE LLMs with diverse sizes and architectures and found that models that apply MoE on every layer and do not use shared experts exhibit the highest local routing consistency. We further showed that domain-specialized experts contribute more to routing consistency than vocabulary-specialized ones, and that most models can balance between cache effectiveness and efficiency with cache sizes approximately 2x the active experts. These findings pave the way for memory-efficient MoE design and deployment without compromising inference speed. We publish the code for replicating experiments at https://github.com/ljcleo/moe-lrc .
中文: 本文提出了两种衡量混合专家模型局部路由一致性的指标,发现每层都采用MoE且无共享专家的模型具有最高一致性,通过缓存大小约为活跃专家两倍即可实现内存高效部署。
English: This paper introduces two metrics to measure local routing consistency in Mixture-of-Experts models, revealing that models with MoE on every layer and no shared experts show the highest consistency, enabling memory-efficient deployment with cache sizes about twice the active experts.
Authors:Linxi Zhao, Sofian Zalouk, Christian K. Belardi, Justin Lovelace, Jin Peng Zhou, Kilian Q. Weinberger, Yoav Artzi, Jennifer J. Sun
Abstract:
Neural language models are black-boxes -- both linguistic patterns and factual knowledge are distributed across billions of opaque parameters. This entangled encoding makes it difficult to reliably inspect, verify, or update specific facts. We propose a new class of language models, Large Memory Language Models (LMLM) with a pre-training recipe that stores factual knowledge in both internal weights and an external database. Our approach strategically masks externally retrieved factual values from the training loss, thereby teaching the model to perform targeted lookups rather than relying on memorization in model weights. Our experiments demonstrate that LMLMs achieve competitive performance compared to significantly larger, knowledge-dense LLMs on standard benchmarks, while offering the advantages of explicit, editable, and verifiable knowledge bases. This work represents a fundamental shift in how language models interact with and manage factual knowledge.
Chinese: 有限记忆语言模型(LMLM)在预训练期间将事实知识外部化存储到数据库,既实现了与大型模型相媲美的性能,又提供了可编辑和可验证的知识库。
English: Limited Memory Language Models (LMLM) externalize factual knowledge to databases during pre-training, enabling competitive performance with larger models while providing editable and verifiable knowledge bases.
Authors:Yuxiang Wei, Yanteng Zhang, Xi Xiao, Tianyang Wang, Xiao Wang, Vince D. Calhoun
Abstract:
Decoding visual experiences from fMRI offers a powerful avenue to understand human perception and develop advanced brain-computer interfaces. However, current progress often prioritizes maximizing reconstruction fidelity while overlooking interpretability, an essential aspect for deriving neuroscientific insight. To address this gap, we propose MoRE-Brain, a neuro-inspired framework designed for high-fidelity, adaptable, and interpretable visual reconstruction. MoRE-Brain uniquely employs a hierarchical Mixture-of-Experts architecture where distinct experts process fMRI signals from functionally related voxel groups, mimicking specialized brain networks. The experts are first trained to encode fMRI into the frozen CLIP space. A finetuned diffusion model then synthesizes images, guided by expert outputs through a novel dual-stage routing mechanism that dynamically weighs expert contributions across the diffusion process. MoRE-Brain offers three main advancements: First, it introduces a novel Mixture-of-Experts architecture grounded in brain network principles for neuro-decoding. Second, it achieves efficient cross-subject generalization by sharing core expert networks while adapting only subject-specific routers. Third, it provides enhanced mechanistic insight, as the explicit routing reveals precisely how different modeled brain regions shape the semantic and spatial attributes of the reconstructed image. Extensive experiments validate MoRE-Brain's high reconstruction fidelity, with bottleneck analyses further demonstrating its effective utilization of fMRI signals, distinguishing genuine neural decoding from over-reliance on generative priors. Consequently, MoRE-Brain marks a substantial advance towards more generalizable and interpretable fMRI-based visual decoding. Code will be publicly available soon: https://github.com/yuxiangwei0808/MoRE-Brain.
中文: MoRE-Brain提出了一种基于脑网络原理的混合专家框架,通过分层处理和双阶段路由机制,实现了从fMRI信号到图像的高保真、可适应且可解释的视觉重建,显著提升了重建质量与神经机制的可解释性。
English: MoRE-Brain introduces a neuro-inspired Mixture-of-Experts framework that achieves high-fidelity, adaptable, and interpretable visual reconstruction from fMRI through hierarchical processing and dual-stage routing, advancing both reconstruction quality and mechanistic insight.
Authors:Danna Zheng, Mirella Lapata, Jeff Z. Pan
Abstract:
Information alignment evaluators are vital for various NLG evaluation tasks and trustworthy LLM deployment, reducing hallucinations and enhancing user trust. Current fine-grained methods, like FactScore, verify facts individually but neglect inter-fact dependencies, enabling subtle vulnerabilities. In this work, we introduce MontageLie, a challenging benchmark that constructs deceptive narratives by "montaging" truthful statements without introducing explicit hallucinations. We demonstrate that both coarse-grained LLM-based evaluators and current fine-grained frameworks are susceptible to this attack, with AUC-ROC scores falling below 65%. To enable more robust fine-grained evaluation, we propose DoveScore, a novel framework that jointly verifies factual accuracy and event-order consistency. By modeling inter-fact relationships, DoveScore outperforms existing fine-grained methods by over 8%, providing a more robust solution for long-form text alignment evaluation. Our code and datasets are available at https://github.com/dannalily/DoveScore.
中文: 本文提出MontageLie基准测试,通过组合真实陈述构建欺骗性叙述来揭示现有事实核查方法的脆弱性,并推出DoveScore框架,通过联合验证事实准确性和事件顺序一致性,显著提升了评估鲁棒性。
English: This paper introduces MontageLie, a benchmark revealing vulnerabilities in current fact-checking methods by creating deceptive narratives from truthful statements, and proposes DoveScore, a robust framework that improves evaluation accuracy by verifying both factual accuracy and event-order consistency.
Authors:Pingqing Zheng, Jiayin Qin, Fuqi Zhang, Shang Wu, Yu Cao, Caiwen Ding, Yang, Zhao
Abstract:
Large Language Models (LLMs) have demonstrated their potential in hardware design tasks, such as Hardware Description Language (HDL) generation and debugging. Yet, their performance in real-world, repository-level HDL projects with thousands or even tens of thousands of code lines is hindered. To this end, we propose HDLxGraph, a novel framework that integrates Graph Retrieval Augmented Generation (Graph RAG) with LLMs, introducing HDL-specific graph representations by incorporating Abstract Syntax Trees (ASTs) and Data Flow Graphs (DFGs) to capture both code graph view and hardware graph view. HDLxGraph utilizes a dual-retrieval mechanism that not only mitigates the limited recall issues inherent in similarity-based semantic retrieval by incorporating structural information, but also enhances its extensibility to various real-world tasks by a task-specific retrieval finetuning. Additionally, to address the lack of comprehensive HDL search benchmarks, we introduce HDLSearch, a multi-granularity evaluation dataset derived from real-world repository-level projects. Experimental results demonstrate that HDLxGraph significantly improves average search accuracy, debugging efficiency and completion quality by 12.04%, 12.22% and 5.04% compared to similarity-based RAG, respectively. The code of HDLxGraph and collected HDLSearch benchmark are available at https://github.com/Nick-Zheng-Q/HDLxGraph.
中文摘要:HDLxGraph是一个创新框架,将图检索增强生成与大语言模型相结合,通过硬件专用图表示显著提升了现实硬件设计任务中的代码搜索、调试和完成性能。
English Summary: HDLxGraph is a novel framework that integrates Graph Retrieval Augmented Generation with Large Language Models, using hardware-specific graph representations to significantly enhance performance in real-world hardware design tasks such as code search, debugging, and completion.
Authors:Wei Liu, Ruochen Zhou, Yiyun Deng, Yuzhen Huang, Junteng Liu, Yuntian Deng, Yizhe Zhang, Junxian He
Abstract:
Large Reasoning Models (LRMs) have shown remarkable capabilities in solving complex problems through reinforcement learning (RL), particularly by generating long reasoning traces. However, these extended outputs often exhibit substantial redundancy, which limits the efficiency of LRMs. In this paper, we investigate RL-based approaches to promote reasoning efficiency. Specifically, we first present a unified framework that formulates various efficient reasoning methods through the lens of length-based reward shaping. Building on this perspective, we propose a novel Length-bAsed StEp Reward shaping method (LASER), which employs a step function as the reward, controlled by a target length. LASER surpasses previous methods, achieving a superior Pareto-optimal balance between performance and efficiency. Next, we further extend LASER based on two key intuitions: (1) The reasoning behavior of the model evolves during training, necessitating reward specifications that are also adaptive and dynamic; (2) Rather than uniformly encouraging shorter or longer chains of thought (CoT), we posit that length-based reward shaping should be difficulty-aware i.e., it should penalize lengthy CoTs more for easy queries. This approach is expected to facilitate a combination of fast and slow thinking, leading to a better overall tradeoff. The resulting method is termed LASER-D (Dynamic and Difficulty-aware). Experiments on DeepSeek-R1-Distill-Qwen-1.5B, DeepSeek-R1-Distill-Qwen-7B, and DeepSeek-R1-Distill-Qwen-32B show that our approach significantly enhances both reasoning performance and response length efficiency. For instance, LASER-D and its variant achieve a +6.1 improvement on AIME2024 while reducing token usage by 63%. Further analysis reveals our RL-based compression produces more concise reasoning patterns with less redundant "self-reflections". Resources are at https://github.com/hkust-nlp/Laser.
中文: 本文提出LASER及其增强版LASER-D方法,通过基于长度的奖励塑造机制减少大推理模型的输出冗余,在显著压缩推理过程的同时实现了更优的性能与效率平衡。
English: This paper introduces LASER and its enhanced version LASER-D, RL-based methods that use length-based reward shaping to reduce redundancy in Large Reasoning Models, achieving superior efficiency and performance with significantly shorter reasoning traces.
Authors:Haocheng Ju, Bin Dong
Abstract:
Mathematical Information Retrieval (MIR) is the task of retrieving information from mathematical documents and plays a key role in various applications, including theorem search in mathematical libraries, answer retrieval on math forums, and premise selection in automated theorem proving. However, a unified benchmark for evaluating these diverse retrieval tasks has been lacking. In this paper, we introduce MIRB (Mathematical Information Retrieval Benchmark) to assess the MIR capabilities of retrieval models. MIRB includes four tasks: semantic statement retrieval, question-answer retrieval, premise retrieval, and formula retrieval, spanning a total of 12 datasets. We evaluate 13 retrieval models on this benchmark and analyze the challenges inherent to MIR. We hope that MIRB provides a comprehensive framework for evaluating MIR systems and helps advance the development of more effective retrieval models tailored to the mathematical domain.
中文: 本文提出MIRB基准,通过四项任务和12个数据集统一评估数学信息检索系统,填补了标准化测评空白,并对13种模型进行分析以推动数学领域检索技术的发展。
English: This paper introduces MIRB, a unified benchmark for evaluating Mathematical Information Retrieval across four tasks and 12 datasets, addressing the lack of standardized assessment and analyzing 13 models to advance domain-specific retrieval systems.
Authors:Xin Huang, Ruibin Li, Tong Jia, Wei Zheng, Ya Wang
Abstract:
Vision-Language Models (VLMs) are essential for multimodal tasks, especially compositional reasoning (CR) tasks, which require distinguishing fine-grained semantic differences between visual and textual embeddings. However, existing methods primarily fine-tune the model by generating text-based hard negative samples, neglecting the importance of image-based negative samples, which results in insufficient training of the visual encoder and ultimately impacts the overall performance of the model. Moreover, negative samples are typically treated uniformly, without considering their difficulty levels, and the alignment of positive samples is insufficient, which leads to challenges in aligning difficult sample pairs. To address these issues, we propose Adaptive Hard Negative Perturbation Learning (AHNPL). AHNPL translates text-based hard negatives into the visual domain to generate semantically disturbed image-based negatives for training the model, thereby enhancing its overall performance. AHNPL also introduces a contrastive learning approach using a multimodal hard negative loss to improve the model's discrimination of hard negatives within each modality and a dynamic margin loss that adjusts the contrastive margin according to sample difficulty to enhance the distinction of challenging sample pairs. Experiments on three public datasets demonstrate that our method effectively boosts VLMs' performance on complex CR tasks. The source code is available at https://github.com/nynu-BDAI/AHNPL.
中文: 提出的自适应难负样本扰动学习(AHNPL)方法通过生成基于图像的难负样本并采用自适应对比学习,有效提升了视觉语言模型在组合推理任务上的性能。
English: The proposed Adaptive Hard Negative Perturbation Learning (AHNPL) method enhances Vision-Language Models by generating image-based hard negatives and employing adaptive contrastive learning to improve performance on compositional reasoning tasks.
Authors:Naiqi Li, Yuqiu Xie, Peiyuan Liu, Tao Dai, Yong Jiang, Shu-Tao Xia
Abstract:
Low-rank regularization (LRR) has been widely applied in various machine learning tasks, but the associated optimization is challenging. Directly optimizing the rank function under constraints is NP-hard in general. To overcome this difficulty, various relaxations of the rank function were studied. However, optimization of these relaxed LRRs typically depends on singular value decomposition, which is a time-consuming and nondifferentiable operator that cannot be optimized with gradient-based techniques. To address these challenges, in this paper we propose an efficient differentiable approximation of the generalized LRR. The considered LRR form subsumes many popular choices like the nuclear norm, the Schatten-$p$ norm, and various nonconvex relaxations. Our method enables LRR terms to be appended to loss functions in a plug-and-play fashion, and the GPU-friendly operations enable efficient and convenient implementation. Furthermore, convergence analysis is presented, which rigorously shows that both the bias and the variance of our rank estimator rapidly reduce with increased sample size and iteration steps. In the experimental study, the proposed method is applied to various tasks, which demonstrates its versatility and efficiency. Code is available at https://github.com/naiqili/EDLRR.
中文: 本文提出了一种高效可微分的广义低秩正则化近似方法,能够以即插即用方式整合到损失函数中,具备GPU友好操作和严格收敛性证明,实验结果表明该方法兼具多功能性和高效性。
English: This paper introduces an efficient differentiable approximation for generalized low-rank regularization, enabling plug-and-play integration into loss functions with GPU-friendly operations and proven convergence, while experimental results validate its versatility and efficiency.
Authors:Jacob E. Kooi, Zhao Yang, Vincent François-Lavet
Abstract:
Neural network architectures have a large impact in machine learning. In reinforcement learning, network architectures have remained notably simple, as changes often lead to small gains in performance. This work introduces a novel encoder architecture for pixel-based model-free reinforcement learning. The Hadamax (\textbf{Hada}mard \textbf{max}-pooling) encoder achieves state-of-the-art performance by max-pooling Hadamard products between GELU-activated parallel hidden layers. Based on the recent PQN algorithm, the Hadamax encoder achieves state-of-the-art model-free performance in the Atari-57 benchmark. Specifically, without applying any algorithmic hyperparameter modifications, Hadamax-PQN achieves an 80\% performance gain over vanilla PQN and significantly surpasses Rainbow-DQN. For reproducibility, the full code is available on \href{https://github.com/Jacobkooi/Hadamax}{GitHub}.
中文: 本研究提出Hadamax编码器,这是一种用于像素强化学习的新型神经网络架构,通过对并行隐藏层间的Hadamard乘积进行最大池化,在Atari-57基准测试中实现了最先进的性能。
English: This work introduces the Hadamax encoder, a novel neural network architecture for pixel-based reinforcement learning that achieves state-of-the-art performance on the Atari-57 benchmark by max-pooling Hadamard products between parallel hidden layers.
Authors:Yuxuan Shu, Vasileios Lampos
Abstract:
Multivariable time series forecasting methods can integrate information from exogenous variables, leading to significant prediction accuracy gains. Transformer architecture has been widely applied in various time series forecasting models due to its ability to capture long-range sequential dependencies. However, a naïve application of transformers often struggles to effectively model complex relationships among variables over time. To mitigate against this, we propose a novel architecture, namely the Spectral Operator Neural Network (Sonnet). Sonnet applies learnable wavelet transformations to the input and incorporates spectral analysis using the Koopman operator. Its predictive skill relies on the Multivariable Coherence Attention (MVCA), an operation that leverages spectral coherence to model variable dependencies. Our empirical analysis shows that Sonnet yields the best performance on $34$ out of $47$ forecasting tasks with an average mean absolute error (MAE) reduction of $1.1\%$ against the most competitive baseline (different per task). We further show that MVCA -- when put in place of the naïve attention used in various deep learning models -- can remedy its deficiencies, reducing MAE by $10.7\%$ on average in the most challenging forecasting tasks.
Chinese: 提出的谱算子神经网络(Sonnet)通过引入可学习小波变换和谱相干注意力机制,有效提升了多变量时间序列预测的准确性,在多项任务中显著降低了平均绝对误差。
English: The proposed Spectral Operator Neural Network (Sonnet) enhances multivariable time series forecasting by integrating learnable wavelet transformations and spectral coherence attention, achieving superior accuracy with reduced mean absolute error compared to existing methods.
Authors:Om Khangaonkar, Hamed Pirsiavash
Abstract:
By pretraining to synthesize coherent images from perturbed inputs, generative models inherently learn to understand object boundaries and scene compositions. How can we repurpose these generative representations for general-purpose perceptual organization? We finetune Stable Diffusion and MAE (encoder+decoder) for category-agnostic instance segmentation using our instance coloring loss exclusively on a narrow set of object types (indoor furnishings and cars). Surprisingly, our models exhibit strong zero-shot generalization, accurately segmenting objects of types and styles unseen in finetuning (and in many cases, MAE's ImageNet-1K pretraining too). Our best-performing models closely approach the heavily supervised SAM when evaluated on unseen object types and styles, and outperform it when segmenting fine structures and ambiguous boundaries. In contrast, existing promptable segmentation architectures or discriminatively pretrained models fail to generalize. This suggests that generative models learn an inherent grouping mechanism that transfers across categories and domains, even without internet-scale pretraining. Code, pretrained models, and demos are available on our website.
Authors:Sampanna Yashwant Kahu
Abstract:
Efficient task scheduling is paramount in the Linux kernel, where the Completely Fair Scheduler (CFS) meticulously manages CPU resources to balance high utilization with interactive responsiveness. This research pioneers the use of deep learning techniques to predict the sequence of tasks selected by CFS, aiming to evaluate the feasibility of a more generalized and potentially more adaptive task scheduler for diverse workloads. Our core contributions are twofold: first, the systematic generation and curation of a novel scheduling dataset from a running Linux kernel, capturing real-world CFS behavior; and second, the development, training, and evaluation of a Long Short-Term Memory (LSTM) network designed to accurately forecast the next task to be scheduled. This paper further discusses the practical pathways and implications of integrating such a predictive model into the kernel's scheduling framework. The findings and methodologies presented herein open avenues for data-driven advancements in kernel scheduling, with the full source code provided for reproducibility and further exploration.
中文: 本研究开创性地利用深度学习技术,通过LSTM网络预测Linux内核中完全公平调度器的任务选择序列,构建了新型调度数据集并探讨了实现自适应任务调度的可行路径。
English: This research introduces a deep learning approach using an LSTM network to predict the Completely Fair Scheduler's task selection in the Linux kernel, creating a novel dataset and exploring integration possibilities for adaptive scheduling.
Authors:Bo-Han Lai, Pin-Han Huang, Bo-Han Kung, Shang-Tse Chen
Abstract:
Lipschitz neural networks are well-known for providing certified robustness in deep learning. In this paper, we present a novel, efficient Block Reflector Orthogonal (BRO) layer that enhances the capability of orthogonal layers on constructing more expressive Lipschitz neural architectures. In addition, by theoretically analyzing the nature of Lipschitz neural networks, we introduce a new loss function that employs an annealing mechanism to increase margin for most data points. This enables Lipschitz models to provide better certified robustness. By employing our BRO layer and loss function, we design BRONet - a simple yet effective Lipschitz neural network that achieves state-of-the-art certified robustness. Extensive experiments and empirical analysis on CIFAR-10/100, Tiny-ImageNet, and ImageNet validate that our method outperforms existing baselines. The implementation is available at https://github.com/ntuaislab/BRONet.
Chinese: 本文提出了一种新颖的块反射正交(BRO)层和基于退火机制的新损失函数,增强了Lipschitz神经网络的表达能力,由此设计的BRONet在多个数据集上实现了最先进的认证鲁棒性。
English: This paper introduces a novel Block Reflector Orthogonal (BRO) layer and a new annealing-based loss function to enhance Lipschitz neural networks, resulting in BRONet, which achieves state-of-the-art certified robustness across multiple datasets.
Authors:Yuante Li, Xu Yang, Xiao Yang, Minrui Xu, Xisen Wang, Weiqing Liu, Jiang Bian
Abstract:
Financial markets pose fundamental challenges for asset return prediction due to their high dimensionality, non-stationarity, and persistent volatility. Despite advances in large language models and multi-agent systems, current quantitative research pipelines suffer from limited automation, weak interpretability, and fragmented coordination across key components such as factor mining and model innovation. In this paper, we propose R&D-Agent for Quantitative Finance, in short RD-Agent(Q), the first data-centric multi-agent framework designed to automate the full-stack research and development of quantitative strategies via coordinated factor-model co-optimization. RD-Agent(Q) decomposes the quant process into two iterative stages: a Research stage that dynamically sets goal-aligned prompts, formulates hypotheses based on domain priors, and maps them to concrete tasks, and a Development stage that employs a code-generation agent, Co-STEER, to implement task-specific code, which is then executed in real-market backtests. The two stages are connected through a feedback stage that thoroughly evaluates experimental outcomes and informs subsequent iterations, with a multi-armed bandit scheduler for adaptive direction selection. Empirically, RD-Agent(Q) achieves up to 2X higher annualized returns than classical factor libraries using 70% fewer factors, and outperforms state-of-the-art deep time-series models on real markets. Its joint factor-model optimization delivers a strong balance between predictive accuracy and strategy robustness. Our code is available at: https://github.com/microsoft/RD-Agent.
中文: 本文提出RD-Agent(Q)这一首创的多智能体框架,通过迭代式的因子-模型协同优化实现量化策略全栈自动化研发,在真实市场中以更少因子获得更高收益,并超越现有先进模型。
English: This paper introduces RD-Agent(Q), a pioneering multi-agent framework that automates full-stack quantitative strategy development through iterative factor-model co-optimization, achieving superior returns with fewer factors and outperforming existing models in real-market tests.
Authors:Tong Cheng, Jie Fu, Xinpeng Ling, Huifa Li, Zhili Chen, Haifeng Qian, Junqing Gong
Abstract:
Graph Neural Networks (GNNs) have been widely used for graph analysis. Federated Graph Learning (FGL) is an emerging learning framework to collaboratively train graph data from various clients. Although FGL allows client data to remain localized, a malicious server can still steal client private data information through uploaded gradient. In this paper, we for the first time propose label distribution attacks (LDAs) on FGL that aim to infer the label distributions of the client-side data. Firstly, we observe that the effectiveness of LDA is closely related to the variance of node embeddings in GNNs. Next, we analyze the relation between them and propose a new attack named EC-LDA, which significantly improves the attack effectiveness by compressing node embeddings. Then, extensive experiments on node classification and link prediction tasks across six widely used graph datasets show that EC-LDA outperforms the SOTA LDAs. Specifically, EC-LDA can achieve the Cos-sim as high as 1.0 under almost all cases. Finally, we explore the robustness of EC-LDA under differential privacy protection and discuss the potential effective defense methods to EC-LDA. Our code is available at https://github.com/cheng-t/EC-LDA.
中文: 本文提出EC-LDA,一种通过压缩节点嵌入来提升联邦图学习中标签分布推断效果的新型攻击方法,在多个数据集和任务中显著优于现有技术。
English: This paper introduces EC-LDA, a novel label distribution attack that enhances privacy inference in Federated Graph Learning by compressing node embeddings, achieving superior performance over existing methods across multiple datasets and tasks.
Authors:Haiduo Huang, Jiangcheng Song, Yadong Zhang, Pengju Ren
Abstract:
Recent advances in knowledge distillation have emphasized the importance of decoupling different knowledge components. While existing methods utilize momentum mechanisms to separate task-oriented and distillation gradients, they overlook the inherent conflict between target-class and non-target-class knowledge flows. Furthermore, low-confidence dark knowledge in non-target classes introduces noisy signals that hinder effective knowledge transfer. To address these limitations, we propose DeepKD, a novel training framework that integrates dual-level decoupling with adaptive denoising. First, through theoretical analysis of gradient signal-to-noise ratio (GSNR) characteristics in task-oriented and non-task-oriented knowledge distillation, we design independent momentum updaters for each component to prevent mutual interference. We observe that the optimal momentum coefficients for task-oriented gradient (TOG), target-class gradient (TCG), and non-target-class gradient (NCG) should be positively related to their GSNR. Second, we introduce a dynamic top-k mask (DTM) mechanism that gradually increases K from a small initial value to incorporate more non-target classes as training progresses, following curriculum learning principles. The DTM jointly filters low-confidence logits from both teacher and student models, effectively purifying dark knowledge during early training. Extensive experiments on CIFAR-100, ImageNet, and MS-COCO demonstrate DeepKD's effectiveness. Our code is available at https://github.com/haiduo/DeepKD.
Chinese: DeepKD提出了一种双级解耦与自适应去噪的训练框架,通过独立动量更新器和动态掩码机制解决知识流冲突并过滤噪声信号,在多个数据集上验证了其有效性。
English: DeepKD introduces a dual-level decoupling framework with adaptive denoising to address conflicts between knowledge components and filter noisy signals in knowledge distillation, achieving superior performance across multiple datasets.
Authors:Muniba Noreen, Furqan Shaukat
Abstract:
Lung cancer remains among the deadliest types of cancer in recent decades, and early lung nodule detection is crucial for improving patient outcomes. The limited availability of annotated medical imaging data remains a bottleneck in developing accurate computer-aided diagnosis (CAD) systems. Self-supervised learning can help leverage large amounts of unlabeled data to develop more robust CAD systems. With the recent advent of transformer-based architecture and their ability to generalize to unseen tasks, there has been an effort within the healthcare community to adapt them to various medical downstream tasks. Thus, we propose a novel "LungNodule-SSM" method, which utilizes selfsupervised learning with DINOv2 as a backbone to enhance lung nodule detection and classification without annotated data. Our methodology has two stages: firstly, the DINOv2 model is pre-trained on unlabeled CT scans to learn robust feature representations, then secondly, these features are fine-tuned using transformer-based architectures for lesionlevel detection and accurate lung nodule diagnosis. The proposed method has been evaluated on the challenging LUNA 16 dataset, consisting of 888 CT scans, and compared with SOTA methods. Our experimental results show the superiority of our proposed method with an accuracy of 98.37%, explaining its effectiveness in lung nodule detection. The source code, datasets, and pre-processed data can be accessed using the link:https://github.com/EMeRALDsNRPU/Lung-Nodule-SSM-Self-Supervised-Lung-Nodule-Detection-and-Classification/tree/main
中文: 提出的"LungNodule-SSM"方法采用基于DINOv2的自监督学习,在LUNA 16数据集上实现了98.37%的肺结节检测准确率,性能优于现有最先进方法。
English: The proposed "LungNodule-SSM" method leverages self-supervised learning with DINOv2 to achieve 98.37% accuracy in lung nodule detection on the LUNA 16 dataset, outperforming existing state-of-the-art approaches.
Authors:Zehong Wang, Zheyuan Liu, Tianyi Ma, Jiazheng Li, Zheyuan Zhang, Xingbo Fu, Yiyang Li, Zhengqing Yuan, Wei Song, Yijun Ma, Qingkai Zeng, Xiusi Chen, Jianan Zhao, Jundong Li, Meng Jiang, Pietro Lio, Nitesh Chawla, Chuxu Zhang, Yanfang Ye
Abstract:
Graph-structured data pervades domains such as social networks, biological systems, knowledge graphs, and recommender systems. While foundation models have transformed natural language processing, vision, and multimodal learning through large-scale pretraining and generalization, extending these capabilities to graphs -- characterized by non-Euclidean structures and complex relational semantics -- poses unique challenges and opens new opportunities. To this end, Graph Foundation Models (GFMs) aim to bring scalable, general-purpose intelligence to structured data, enabling broad transfer across graph-centric tasks and domains. This survey provides a comprehensive overview of GFMs, unifying diverse efforts under a modular framework comprising three key components: backbone architectures, pretraining strategies, and adaptation mechanisms. We categorize GFMs by their generalization scope -- universal, task-specific, and domain-specific -- and review representative methods, key innovations, and theoretical insights within each category. Beyond methodology, we examine theoretical foundations including transferability and emergent capabilities, and highlight key challenges such as structural alignment, heterogeneity, scalability, and evaluation. Positioned at the intersection of graph learning and general-purpose AI, GFMs are poised to become foundational infrastructure for open-ended reasoning over structured data. This survey consolidates current progress and outlines future directions to guide research in this rapidly evolving field. Resources are available at https://github.com/Zehong-Wang/Awesome-Foundation-Models-on-Graphs.
中文摘要:图基础模型(GFMs)通过骨干架构、预训练策略和适应机制,旨在为图结构数据提供可扩展的通用智能,在解决独特挑战的同时实现跨图中心任务和领域的广泛迁移。
English Summary: Graph Foundation Models (GFMs) aim to bring scalable, general-purpose intelligence to graph-structured data through backbone architectures, pretraining strategies, and adaptation mechanisms, addressing unique challenges while enabling broad transfer across graph-centric tasks and domains.
Authors:Jeremy Qin
Abstract:
Time series forecasting plays a crucial role in various applications, particularly in healthcare, where accurate predictions of future health trajectories can significantly impact clinical decision-making. Ensuring transparency and explainability of the models responsible for these tasks is essential for their adoption in critical settings. Recent work has explored a top-down approach to bi-level transparency, focusing on understanding trends and properties of predicted time series using static features. In this work, we extend this framework by incorporating exogenous time series features alongside static features in a structured manner, while maintaining cohesive interpretation. Our approach leverages the insights of trajectory comprehension to introduce an encoding mechanism for exogenous time series, where they are decomposed into meaningful trends and properties, enabling the extraction of interpretable patterns. Through experiments on several synthetic datasets, we demonstrate that our approach remains predictive while preserving interpretability and robustness. This work represents a step towards developing robust, and generalized time series forecasting models. The code is available at https://github.com/jeremy-qin/TIMEVIEW
Chinese: 本研究通过将外生时间序列特征与静态特征结构化结合,在保持可解释性和鲁棒性的同时改进了时间序列预测,并在合成数据集上得到验证。
English: This study enhances time series forecasting by integrating exogenous time series features with static features in a structured way to maintain interpretability and robustness, as validated on synthetic datasets.
Authors:Yuhang Zhou, Jing Zhu, Shengyi Qian, Zhuokai Zhao, Xiyao Wang, Xiaoyu Liu, Ming Li, Paiheng Xu, Wei Ai, Furong Huang
Abstract:
Large Language Models (LLMs) are increasingly aligned with human preferences through Reinforcement Learning from Human Feedback (RLHF). Among RLHF methods, Group Relative Policy Optimization (GRPO) has gained attention for its simplicity and strong performance, notably eliminating the need for a learned value function. However, GRPO implicitly assumes a balanced domain distribution and uniform semantic alignment across groups, assumptions that rarely hold in real-world datasets. When applied to multi-domain, imbalanced data, GRPO disproportionately optimizes for dominant domains, neglecting underrepresented ones and resulting in poor generalization and fairness. We propose Domain-Informed Self-Consistency Policy Optimization (DISCO), a principled extension to GRPO that addresses inter-group imbalance with two key innovations. Domain-aware reward scaling counteracts frequency bias by reweighting optimization based on domain prevalence. Difficulty-aware reward scaling leverages prompt-level self-consistency to identify and prioritize uncertain prompts that offer greater learning value. Together, these strategies promote more equitable and effective policy learning across domains. Extensive experiments across multiple LLMs and skewed training distributions show that DISCO improves generalization, outperforms existing GRPO variants by 5% on Qwen3 models, and sets new state-of-the-art results on multi-domain alignment benchmarks. Our code and data are available at https://github.com/Tonyzhou98/disco_grpo.
中文: 提出的DISCO方法通过引入领域感知和难度感知的奖励缩放机制,有效解决GRPO在多领域数据中的群体不平衡问题,显著提升模型的泛化能力与公平性,并在多领域对齐基准测试中创下最新最优表现。
English: The proposed DISCO method enhances GRPO by incorporating domain-aware and difficulty-aware reward scaling to address inter-group imbalance, improving generalization and fairness across multi-domain datasets while achieving state-of-the-art performance.
Authors:Yingming Pu, Tao Lin, Hongyu Chen
Abstract:
Large Language Model (LLM)-based multi-agent systems (MAS) demonstrate remarkable potential for scientific discovery. Existing approaches, however, often automate scientific discovery using predefined workflows that lack rationality constraints. This often leads to aimless hypothesizing and a failure to consistently link hypotheses with evidence, thereby hindering systematic uncertainty reduction. Overcoming these limitations fundamentally requires systematic uncertainty reduction. We introduce \texttt{PiFlow}, an information-theoretical framework, treating automated scientific discovery as a structured uncertainty reduction problem guided by principles (e.g., scientific laws). In evaluations across three distinct scientific domains -- discovering nanomaterial structures, bio-molecules, and superconductor candidates with targeted properties -- our method significantly improves discovery efficiency, reflected by a 73.55\% increase in the Area Under the Curve (AUC) of property values versus exploration steps, and enhances solution quality by 94.06\% compared to a vanilla agent system. Overall, \texttt{PiFlow} serves as a Plug-and-Play method, establishing a novel paradigm shift in highly efficient automated scientific discovery, paving the way for more robust and accelerated AI-driven research. Code is publicly available at our \href{https://github.com/amair-lab/PiFlow}{GitHub}.
Chinese: PiFlow提出了一种信息理论框架,将自动化科学发现视为结构化的不确定性减少过程,在多个科学领域中相比现有方法显著提高了发现效率和解决方案质量。
English: PiFlow introduces an information-theoretic framework that treats automated scientific discovery as structured uncertainty reduction, significantly improving efficiency and solution quality across multiple scientific domains compared to existing methods.
Authors:Ze Wang, Jingang Qu, Zhenyu Gao, Pascal Morin
Abstract:
This work demonstrates an airflow inertial based odometry system with multi-sensor data fusion, including thermal anemometer, IMU, ESC, and barometer. This goal is challenging because low-cost IMUs and barometers have significant bias, and anemometer measurements are very susceptible to interference from spinning propellers and ground effects. We employ a GRU-based deep neural network to estimate relative air speed from noisy and disturbed anemometer measurements, and an observer with bias model to fuse the sensor data and thus estimate the state of aerial vehicle. A complete flight data, including takeoff and landing on the ground, shows that the approach is able to decouple the downwash induced wind speed caused by propellers and the ground effect, and accurately estimate the flight speed in a wind-free indoor environment. IMU, and barometer bias are effectively estimated, which significantly reduces the position integration drift, which is only 5.7m for 203s manual random flight. The open source is available on https://github.com/SyRoCo-ISIR/Flight-Speed-Estimation-Airflow.
中文: 本研究提出了一种基于多传感器融合的气流惯性里程系统,采用GRU神经网络解耦螺旋桨引起的干扰,在无风环境下将203秒飞行的位置漂移降至5.7米,实现了精确的速度估计。
English: This study presents an airflow inertial odometry system using multi-sensor fusion and a GRU-based network to accurately estimate flight speed by decoupling propeller-induced disturbances and reducing position drift to 5.7m over 203 seconds in wind-free conditions.
Authors:Ivan Smirnov, Shangding Gu
Abstract:
Reinforcement learning (RL) has seen significant advancements through the application of various neural network architectures. In this study, we systematically investigate the performance of several neural networks in RL tasks, including Long Short-Term Memory (LSTM), Multi-Layer Perceptron (MLP), Mamba/Mamba-2, Transformer-XL, Gated Transformer-XL, and Gated Recurrent Unit (GRU). Through comprehensive evaluation across continuous control, discrete decision-making, and memory-based environments, we identify architecture-specific strengths and limitations. Our results reveal that: (1) MLPs excel in fully observable continuous control tasks, providing an optimal balance of performance and efficiency; (2) recurrent architectures like LSTM and GRU offer robust performance in partially observable environments with moderate memory requirements; (3) Mamba models achieve a 4.5x higher throughput compared to LSTM and a 3.9x increase over GRU, all while maintaining comparable performance; and (4) only Transformer-XL, Gated Transformer-XL, and Mamba-2 successfully solve the most challenging memory-intensive tasks, with Mamba-2 requiring 8x less memory than Transformer-XL. These findings provide insights for researchers and practitioners, enabling more informed architecture selection based on specific task characteristics and computational constraints. Code is available at: https://github.com/SafeRL-Lab/RLBenchNet
中文: 本研究系统评估了多种神经网络在强化学习任务中的表现,发现MLP在完全可观测任务中表现卓越,循环网络在部分可观测环境中稳健高效,Mamba模型吞吐量显著提升,而仅Transformer-XL变体和Mamba-2能解决最高内存需求任务且效率优势突出。
English: This study systematically evaluates various neural network architectures in reinforcement learning tasks, revealing that MLPs excel in continuous control, recurrent networks handle partial observability efficiently, Mamba models offer superior throughput, and only Transformer-XL variants and Mamba-2 solve the most memory-intensive challenges with significant efficiency gains.
Authors:Kaiwen Zha, Zhengqi Gao, Maohao Shen, Zhang-Wei Hong, Duane S. Boning, Dina Katabi
Abstract:
Reinforcement learning (RL) has recently emerged as a compelling approach for enhancing the reasoning capabilities of large language models (LLMs), where an LLM generator serves as a policy guided by a verifier (reward model). However, current RL post-training methods for LLMs typically use verifiers that are fixed (rule-based or frozen pretrained) or trained discriminatively via supervised fine-tuning (SFT). Such designs are susceptible to reward hacking and generalize poorly beyond their training distributions. To overcome these limitations, we propose Tango, a novel framework that uses RL to concurrently train both an LLM generator and a verifier in an interleaved manner. A central innovation of Tango is its generative, process-level LLM verifier, which is trained via RL and co-evolves with the generator. Importantly, the verifier is trained solely based on outcome-level verification correctness rewards without requiring explicit process-level annotations. This generative RL-trained verifier exhibits improved robustness and superior generalization compared to deterministic or SFT-trained verifiers, fostering effective mutual reinforcement with the generator. Extensive experiments demonstrate that both components of Tango achieve state-of-the-art results among 7B/8B-scale models: the generator attains best-in-class performance across five competition-level math benchmarks and four challenging out-of-domain reasoning tasks, while the verifier leads on the ProcessBench dataset. Remarkably, both components exhibit particularly substantial improvements on the most difficult mathematical reasoning problems. Code is at: https://github.com/kaiwenzha/rl-tango.
中文摘要:Tango提出了一种新颖的强化学习框架,通过协同训练LLM生成器和生成式验证器,在无需过程级标注的情况下实现两者相互增强,并在复杂推理任务上取得了最先进的性能。
English Summary: Tango introduces a novel reinforcement learning framework that co-trains an LLM generator and a generative verifier, achieving state-of-the-art performance on complex reasoning tasks through their mutual reinforcement without requiring process-level annotations.
Authors:Qingyu Song, Peiyu Liao, Wenqian Zhao, Yiwen Wang, Shoubo Hu, Hui-Ling Zhen, Ning Jiang, Mingxuan Yuan
Abstract:
The increasing deployment of Large Language Models (LLMs) on edge devices, driven by model advancements and hardware improvements, offers significant privacy benefits. However, these on-device LLMs inherently face performance limitations due to reduced model capacity and necessary compression techniques. To address this, we introduce a systematic methodology -- encompassing model capability, development efficiency, and system resources -- for evaluating on-device LLMs. Our comprehensive evaluation, encompassing models from 0.5B to 14B parameters and seven post-training quantization (PTQ) methods on commodity laptops, yields several critical insights: 1) System-level metrics exhibit near-linear scaling with effective bits-per-weight (BPW). 2) A practical threshold exists around $\sim$3.5 effective BPW, larger models subjected to low-bit quantization consistently outperform smaller models utilizing higher bit-precision. 3) Quantization with low BPW incurs marginal accuracy loss but significant memory savings. 4) Determined by low-level implementation specifics power consumption on CPU, where computation-intensive operations spend more power than memory-intensive ones. These findings offer crucial insights and practical guidelines for the efficient deployment and optimized configuration of LLMs on resource-constrained edge devices. Our codebase is available at https://github.com/simmonssong/LLMOnDevice.
中文: 本研究提出了针对设备端大语言模型的系统评估方法,发现采用低比特量化的大型模型性能优于高精度小型模型,并为资源受限的边缘设备部署提供了实用指导。
English: The study introduces a systematic evaluation methodology for on-device LLMs, revealing that larger models with low-bit quantization outperform smaller high-precision ones and offering practical deployment guidelines for edge devices.
Authors:Alvin Heng, Harold Soh
Abstract:
Selective classification enhances the reliability of predictive models by allowing them to abstain from making uncertain predictions. In this work, we revisit the design of optimal selection functions through the lens of the Neyman--Pearson lemma, a classical result in statistics that characterizes the optimal rejection rule as a likelihood ratio test. We show that this perspective not only unifies the behavior of several post-hoc selection baselines, but also motivates new approaches to selective classification which we propose here. A central focus of our work is the setting of covariate shift, where the input distribution at test time differs from that at training. This realistic and challenging scenario remains relatively underexplored in the context of selective classification. We evaluate our proposed methods across a range of vision and language tasks, including both supervised learning and vision-language models. Our experiments demonstrate that our Neyman--Pearson-informed methods consistently outperform existing baselines, indicating that likelihood ratio-based selection offers a robust mechanism for improving selective classification under covariate shifts. Our code is publicly available at https://github.com/clear-nus/sc-likelihood-ratios.
中文: 本研究通过奈曼-皮尔逊引理重新审视选择性分类,提出的似然比方法在协变量偏移场景下的视觉和语言任务中优于现有基线。
English: This study revisits selective classification through the Neyman-Pearson lemma, proposing likelihood ratio-based methods that outperform existing baselines under covariate shift scenarios in vision and language tasks.
Authors:Susav Shrestha, Brad Settlemyer, Nikoli Dryden, Narasimha Reddy
Abstract:
Accelerating large language model (LLM) inference is critical for real-world deployments requiring high throughput and low latency. Contextual sparsity, where each token dynamically activates only a small subset of the model parameters, shows promise but does not scale to large batch sizes due to union of active neurons quickly approaching dense computation. We introduce Polar Sparsity, highlighting a key shift in sparsity importance from MLP to Attention layers as we scale batch size and sequence length. While MLP layers become more compute-efficient under batching, their sparsity vanishes. In contrast, attention becomes increasingly more expensive at scale, while their head sparsity remains stable and batch-invariant. We develop hardware-efficient, sparsity-aware GPU kernels for selective MLP and Attention computations, delivering up to \(2.2\times\) end-to-end speedups for models like OPT, LLaMA-2 \& 3, across various batch sizes and sequence lengths without compromising accuracy. To our knowledge, this is the first work to demonstrate that contextual sparsity can scale effectively to large batch sizes, delivering substantial inference acceleration with minimal changes, making Polar Sparsity practical for large-scale, high-throughput LLM deployment systems. Our code is available at: https://github.com/susavlsh10/Polar-Sparsity.
中文: Polar Sparsity通过将稀疏计算重点转向注意力层,实现了大规模语言模型推理加速,在不同批处理规模下达到2.2倍提速且保持精度无损。
English: Polar Sparsity accelerates large language model inference by shifting focus to attention layer sparsity, achieving up to 2.2× speedup across various models and batch sizes without accuracy loss.
Authors:So Won Jeong, Claire Donnat
Abstract:
Graph Neural Networks (GNNs) are increasingly used in conjunction with unsupervised learning techniques to learn powerful node representations, but their deployment is hindered by their high sensitivity to hyperparameter tuning and the absence of established methodologies for selecting the optimal models. To address these challenges, we propose LOBSTUR-GNN ({\bf Lo}cal {\bf B}oot{\bf s}trap for {\bf T}uning {\bf U}nsupervised {\bf R}epresentations in GNNs) i), a novel framework designed to adapt bootstrapping techniques for unsupervised graph representation learning. LOBSTUR-GNN tackles two main challenges: (a) adapting the bootstrap edge and feature resampling process to account for local graph dependencies in creating alternative versions of the same graph, and (b) establishing robust metrics for evaluating learned representations without ground-truth labels. Using locally bootstrapped resampling and leveraging Canonical Correlation Analysis (CCA) to assess embedding consistency, LOBSTUR provides a principled approach for hyperparameter tuning in unsupervised GNNs. We validate the effectiveness and efficiency of our proposed method through extensive experiments on established academic datasets, showing an 65.9\% improvement in the classification accuracy compared to an uninformed selection of hyperparameters. Finally, we deploy our framework on a real-world application, thereby demonstrating its validity and practical utility in various settings. \footnote{The code is available at \href{https://github.com/sowonjeong/lobstur-graph-bootstrap}{github.com/sowonjeong/lobstur-graph-bootstrap}.}
中文: LOBSTUR-GNN框架通过局部引导重采样和典型相关分析,解决了无监督图神经网络对超参数敏感的问题,相比随机参数选择实现了65.9%的分类准确率提升。
English: The LOBSTUR-GNN framework addresses hyperparameter sensitivity in unsupervised graph neural networks by employing local bootstrap resampling and canonical correlation analysis for robust model evaluation, achieving 65.9% higher classification accuracy than random parameter selection.
Authors:Juan Nathaniel, Carla Roesch, Jatan Buch, Derek DeSantis, Adam Rupe, Kara Lamb, Pierre Gentine
Abstract:
We use a deep Koopman operator-theoretic formalism to develop a novel causal discovery algorithm, Kausal. Causal discovery aims to identify cause-effect mechanisms for better scientific understanding, explainable decision-making, and more accurate modeling. Standard statistical frameworks, such as Granger causality, lack the ability to quantify causal relationships in nonlinear dynamics due to the presence of complex feedback mechanisms, timescale mixing, and nonstationarity. This presents a challenge in studying many real-world systems, such as the Earth's climate. Meanwhile, Koopman operator methods have emerged as a promising tool for approximating nonlinear dynamics in a linear space of observables. In Kausal, we propose to leverage this powerful idea for causal analysis where optimal observables are inferred using deep learning. Causal estimates are then evaluated in a reproducing kernel Hilbert space, and defined as the distance between the marginal dynamics of the effect and the joint dynamics of the cause-effect observables. Our numerical experiments demonstrate Kausal's superior ability in discovering and characterizing causal signals compared to existing approaches of prescribed observables. Lastly, we extend our analysis to observations of El Niño-Southern Oscillation highlighting our algorithm's applicability to real-world phenomena. Our code is available at https://github.com/juannat7/kausal.
中文摘要:Kausal算法采用深度Koopman算子方法,通过深度学习推断最优观测量并在再生核希尔伯特空间中评估因果关系,在非线性系统中展现出优于现有方法的因果发现能力,并成功应用于厄尔尼诺-南方涛动等实际气候现象分析。
English Summary: The Kausal algorithm employs a deep Koopman operator approach to discover causal relationships in nonlinear systems by learning optimal observables through deep learning and evaluating causality in a reproducing kernel Hilbert space, demonstrating superior performance in both simulations and real-world climate data analysis.
Authors:Ben Cohen, Emaad Khwaja, Youssef Doubli, Salahidine Lemaachi, Chris Lettieri, Charles Masson, Hugo Miccinilli, Elise Ramé, Qiqi Ren, Afshin Rostamizadeh, Jean Ogier du Terrail, Anna-Monica Toon, Kan Wang, Stephan Xie, Zongzhe Xu, Viktoriya Zhukova, David Asker, Ameet Talwalkar, Othmane Abou-Amal
Abstract:
We introduce Toto, a time series forecasting foundation model with 151 million parameters. Toto uses a modern decoder-only architecture coupled with architectural innovations designed to account for specific challenges found in multivariate observability time series data. Toto's pre-training corpus is a mixture of observability data, open datasets, and synthetic data, and is 4-10$\times$ larger than those of leading time series foundation models. Additionally, we introduce BOOM, a large-scale benchmark consisting of 350 million observations across 2,807 real-world time series. For both Toto and BOOM, we source observability data exclusively from Datadog's own telemetry and internal observability metrics. Extensive evaluations demonstrate that Toto achieves state-of-the-art performance on both BOOM and on established general purpose time series forecasting benchmarks. Toto's model weights, inference code, and evaluation scripts, as well as BOOM's data and evaluation code, are all available as open source under the Apache 2.0 License available at https://huggingface.co/Datadog/Toto-Open-Base-1.0 and https://github.com/DataDog/toto.
中文: Toto是一个拥有1.51亿参数的时间序列预测基础模型,在可观测性和通用基准测试中均达到最优性能,其模型和评估代码已开源。
English: Toto is a 151-million-parameter time series forecasting foundation model that achieves state-of-the-art performance on observability and general benchmarks, with its model and evaluation code being open-sourced.
Authors:Chih-Yu Chang, Milad Azvar, Chinedum Okwudire, Raed Al Kontar
Abstract:
Bayesian optimization (BO) is a sequential decision-making tool widely used for optimizing expensive black-box functions. Recently, Large Language Models (LLMs) have shown remarkable adaptability in low-data regimes, making them promising tools for black-box optimization by leveraging contextual knowledge to propose high-quality query points. However, relying solely on LLMs as optimization agents introduces risks due to their lack of explicit surrogate modeling and calibrated uncertainty, as well as their inherently opaque internal mechanisms. This structural opacity makes it difficult to characterize or control the exploration-exploitation trade-off, ultimately undermining theoretical tractability and reliability. To address this, we propose LLINBO: LLM-in-the-Loop BO, a hybrid framework for BO that combines LLMs with statistical surrogate experts (e.g., Gaussian Processes (GP)). The core philosophy is to leverage contextual reasoning strengths of LLMs for early exploration, while relying on principled statistical models to guide efficient exploitation. Specifically, we introduce three mechanisms that enable this collaboration and establish their theoretical guarantees. We end the paper with a real-life proof-of-concept in the context of 3D printing. The code to reproduce the results can be found at https://github.com/UMDataScienceLab/LLM-in-the-Loop-BO.
中文: 本文提出LLINBO混合框架,通过结合大语言模型的语境探索能力和高斯过程等统计代理模型的原理化开发,实现了贝叶斯优化的理论保障,并提供了3D打印的实际应用验证。
English: The paper introduces LLINBO, a hybrid Bayesian optimization framework that integrates Large Language Models for contextual exploration and statistical surrogate models like Gaussian Processes for principled exploitation, with theoretical guarantees and a 3D printing application.
Authors:Hong Huang, Dapeng Wu
Abstract:
Large language models (LLMs) have made exciting achievements across various domains, yet their deployment on resource-constrained personal devices remains hindered by the prohibitive computational and memory demands of task-specific fine-tuning. While quantization offers a pathway to efficiency, existing methods struggle to balance performance and overhead, either incurring high computational/memory costs or failing to address activation outliers, a critical bottleneck in quantized fine-tuning. To address these challenges, we propose the Outlier Spatial Stability Hypothesis (OSSH): During fine-tuning, certain activation outlier channels retain stable spatial positions across training iterations. Building on OSSH, we propose Quaff, a Quantized parameter-efficient fine-tuning framework for LLMs, optimizing low-precision activation representations through targeted momentum scaling. Quaff dynamically suppresses outliers exclusively in invariant channels using lightweight operations, eliminating full-precision weight storage and global rescaling while reducing quantization errors. Extensive experiments across ten benchmarks validate OSSH and demonstrate Quaff's efficacy. Specifically, on the GPQA reasoning benchmark, Quaff achieves a 1.73x latency reduction and 30% memory savings over full-precision fine-tuning while improving accuracy by 0.6% on the Phi-3 model, reconciling the triple trade-off between efficiency, performance, and deployability. By enabling consumer-grade GPU fine-tuning (e.g., RTX 2080 Super) without sacrificing model utility, Quaff democratizes personalized LLM deployment. The code is available at https://github.com/Little0o0/Quaff.git.
中文: 提出的Quaff框架基于异常值空间稳定性假说,实现了大型语言模型的高效量化微调,在显著降低延迟和内存占用的同时保持或提升精度,从而推动模型在资源受限设备上的部署。
English: The proposed Quaff framework leverages the Outlier Spatial Stability Hypothesis to enable efficient quantized fine-tuning of large language models, achieving significant latency reduction and memory savings while maintaining or improving accuracy, thus facilitating deployment on resource-constrained devices.
Authors:Xigui Li, Yuanye Zhou, Feiyang Xiao, Xin Guo, Chen Jiang, Tan Pan, Xingmeng Zhang, Cenyu Liu, Zeyun Miao, Jianchao Ge, Xiansheng Wang, Qimeng Wang, Yichi Zhang, Wenbo Zhang, Fengping Zhu, Limei Han, Yuan Qi, Chensen Lin, Yuan Cheng
Abstract:
Intracranial aneurysms (IAs) are serious cerebrovascular lesions found in approximately 5\% of the general population. Their rupture may lead to high mortality. Current methods for assessing IA risk focus on morphological and patient-specific factors, but the hemodynamic influences on IA development and rupture remain unclear. While accurate for hemodynamic studies, conventional computational fluid dynamics (CFD) methods are computationally intensive, hindering their deployment in large-scale or real-time clinical applications. To address this challenge, we curated a large-scale, high-fidelity aneurysm CFD dataset to facilitate the development of efficient machine learning algorithms for such applications. Based on 427 real aneurysm geometries, we synthesized 10,660 3D shapes via controlled deformation to simulate aneurysm evolution. The authenticity of these synthetic shapes was confirmed by neurosurgeons. CFD computations were performed on each shape under eight steady-state mass flow conditions, generating a total of 85,280 blood flow dynamics data covering key parameters. Furthermore, the dataset includes segmentation masks, which can support tasks that use images, point clouds or other multimodal data as input. Additionally, we introduced a benchmark for estimating flow parameters to assess current modeling methods. This dataset aims to advance aneurysm research and promote data-driven approaches in biofluids, biomedical engineering, and clinical risk assessment. The code and dataset are available at: https://github.com/Xigui-Li/Aneumo.
Chinese: 本研究通过合成3D形状进行流体动力学模拟,构建了一个大规模颅内动脉瘤血流动力学数据集,旨在支持机器学习应用,以促进临床风险评估和生物流体研究的高效发展。
English: This study presents a large-scale hemodynamic dataset of intracranial aneurysms, created through computational fluid dynamics simulations on synthetic 3D shapes, to enable machine learning applications for efficient clinical risk assessment and biofluid research.
Authors:Xiaojie Gu, Guangxu Chen, Jungang Li, Jia-Chen Gu, Xuming Hu, Kai Zhang
Abstract:
Lifelong learning enables large language models (LLMs) to adapt to evolving information by continually updating their internal knowledge. An ideal system should support efficient, wide-ranging updates while preserving existing capabilities and ensuring reliable deployment. Model editing stands out as a promising solution for this goal, offering a focused and efficient way to revise a model's internal knowledge. Although recent paradigms have made notable progress, they often struggle to meet the demands of practical lifelong adaptation at scale. To bridge this gap, we propose ULTRAEDIT-a fundamentally new editing solution that is training-, subject- and memory-free, making it particularly well-suited for ultra-scalable, real-world lifelong model editing. ULTRAEDIT performs editing through a self-contained process that relies solely on lightweight linear algebra operations to compute parameter shifts, enabling fast and consistent parameter modifications with minimal overhead. To improve scalability in lifelong settings, ULTRAEDIT employs a lifelong normalization strategy that continuously updates feature statistics across turns, allowing it to adapt to distributional shifts and maintain consistency over time. ULTRAEDIT achieves editing speeds over 7x faster than the previous state-of-the-art method-which was also the fastest known approach-while consuming less than 1/3 the VRAM, making it the only method currently capable of editing a 7B LLM on a 24GB consumer-grade GPU. Furthermore, we construct ULTRAEDITBENCH-the largest dataset in the field to date, with over 2M editing pairs-and demonstrate that our method supports up to 1M edits while maintaining high accuracy. Comprehensive experiments on four datasets and six models show that ULTRAEDIT consistently achieves superior performance across diverse model editing scenarios. Our code is available at: https://github.com/XiaojieGu/UltraEdit.
中文: UltraEdit提出了一种高效且可扩展的终身模型编辑方法,通过单步参数调整和持续归一化策略,在显著提升编辑速度的同时降低内存消耗,使大型语言模型能在消费级硬件上实现大规模知识更新。
English: UltraEdit introduces a highly efficient and scalable lifelong model editing approach that significantly accelerates editing speeds while reducing memory usage, enabling extensive updates to large language models on consumer-grade hardware.
Authors:Roberto L. Castro, Andrei Panferov, Soroush Tabesh, Oliver Sieberling, Jiale Chen, Mahdi Nikdan, Saleh Ashkboos, Dan Alistarh
Abstract:
Training large language models (LLMs) models directly in low-precision offers a way to address computational costs by improving both throughput and energy efficiency. For those purposes, NVIDIA's recent Blackwell architecture facilitates very low-precision operations using FP4 variants. Yet, current algorithms for training LLMs in FP4 precision face significant accuracy degradation and often rely on mixed-precision fallbacks. In this paper, we investigate hardware-supported FP4 training and introduce a new approach for accurate, end-to-end FP4 training with all the major computations (i.e., linear layers) in low precision. Through extensive evaluations on Llama-type models, we reveal a new low-precision scaling law that quantifies performance trade-offs across bit-widths and training setups. Guided by this investigation, we design an "optimal" technique in terms of accuracy-vs-computation, called Quartet. We implement Quartet using optimized CUDA kernels tailored for Blackwell, demonstrating that fully FP4-based training is a competitive alternative to FP16 half-precision and to FP8 training. Our code is available at https://github.com/IST-DASLab/Quartet.
中文摘要:本文提出Quartet方法,通过硬件支持的FP4精度实现大语言模型的端到端训练,在保持与FP16和FP8相当性能的同时显著提升计算效率。
English Summary: This paper introduces Quartet, a hardware-supported method for accurate end-to-end FP4 training of large language models, enabling competitive performance with FP16 and FP8 while improving computational efficiency.
Authors:Yilin Ye, Junchao Huang, Xingchen Zeng, Jiazhi Xia, Wei Zeng
Abstract:
Cross-modal embeddings form the foundation for multi-modal models. However, visualization methods for interpreting cross-modal embeddings have been primarily confined to traditional dimensionality reduction (DR) techniques like PCA and t-SNE. These DR methods primarily focus on feature distributions within a single modality, whilst failing to incorporate metrics (e.g., CLIPScore) across multiple modalities. This paper introduces AKRMap, a new DR technique designed to visualize cross-modal embeddings metric with enhanced accuracy by learning kernel regression of the metric landscape in the projection space. Specifically, AKRMap constructs a supervised projection network guided by a post-projection kernel regression loss, and employs adaptive generalized kernels that can be jointly optimized with the projection. This approach enables AKRMap to efficiently generate visualizations that capture complex metric distributions, while also supporting interactive features such as zoom and overlay for deeper exploration. Quantitative experiments demonstrate that AKRMap outperforms existing DR methods in generating more accurate and trustworthy visualizations. We further showcase the effectiveness of AKRMap in visualizing and comparing cross-modal embeddings for text-to-image models. Code and demo are available at https://github.com/yilinye/AKRMap.
Chinese Summary: 本文提出AKRMap这一新型降维技术,通过学习投影空间中度量景观的核回归,能够更准确地可视化跨模态嵌入,在定量实验中超越了PCA和t-SNE等传统方法。
English Summary: This paper introduces AKRMap, a novel dimensionality reduction technique that visualizes cross-modal embeddings more accurately by learning kernel regression of the metric landscape, outperforming traditional methods like PCA and t-SNE.
Authors:Yu Ying Chiu, Zhilin Wang, Sharan Maiya, Yejin Choi, Kyle Fish, Sydney Levine, Evan Hubinger
Abstract:
Detecting AI risks becomes more challenging as stronger models emerge and find novel methods such as Alignment Faking to circumvent these detection attempts. Inspired by how risky behaviors in humans (i.e., illegal activities that may hurt others) are sometimes guided by strongly-held values, we believe that identifying values within AI models can be an early warning system for AI's risky behaviors. We create LitmusValues, an evaluation pipeline to reveal AI models' priorities on a range of AI value classes. Then, we collect AIRiskDilemmas, a diverse collection of dilemmas that pit values against one another in scenarios relevant to AI safety risks such as Power Seeking. By measuring an AI model's value prioritization using its aggregate choices, we obtain a self-consistent set of predicted value priorities that uncover potential risks. We show that values in LitmusValues (including seemingly innocuous ones like Care) can predict for both seen risky behaviors in AIRiskDilemmas and unseen risky behaviors in HarmBench.
中文: 随着AI模型通过“对齐伪装”等新方法规避检测,识别风险愈发困难,因此我们开发了LitmusValues评估流程,通过分析AI在价值困境中的优先选择来预测其潜在危险行为。
English: As AI models evolve with tactics like Alignment Faking, detecting risks grows more difficult, prompting the development of LitmusValues, an evaluation pipeline that identifies AI value priorities to predict risky behaviors through dilemmas and real-world benchmarks.
Authors:Fnu Mohbat, Mohammed J Zaki
Abstract:
Recent advances in large language models (LLMs) and the abundance of food data have resulted in studies to improve food understanding using LLMs. Despite several recommendation systems utilizing LLMs and Knowledge Graphs (KGs), there has been limited research on integrating food related KGs with LLMs. We introduce KERL, a unified system that leverages food KGs and LLMs to provide personalized food recommendations and generates recipes with associated micro-nutritional information. Given a natural language question, KERL extracts entities, retrieves subgraphs from the KG, which are then fed into the LLM as context to select the recipes that satisfy the constraints. Next, our system generates the cooking steps and nutritional information for each recipe. To evaluate our approach, we also develop a benchmark dataset by curating recipe related questions, combined with constraints and personal preferences. Through extensive experiments, we show that our proposed KG-augmented LLM significantly outperforms existing approaches, offering a complete and coherent solution for food recommendation, recipe generation, and nutritional analysis. Our code and benchmark datasets are publicly available at https://github.com/mohbattharani/KERL.
中文摘要:KERL系统创新性地将食物知识图谱与大语言模型结合,通过个性化推荐、菜谱生成和营养分析功能,经实验验证其性能显著优于现有方法,提供了完整的饮食解决方案。
English Summary: KERL is a novel system that integrates food knowledge graphs with large language models to deliver personalized food recommendations, generate detailed recipes, and provide nutritional analysis, outperforming existing methods through comprehensive evaluation.
Authors:Zhangchen Xu, Yuetai Li, Fengqing Jiang, Bhaskar Ramasubramanian, Luyao Niu, Bill Yuchen Lin, Radha Poovendran
Abstract:
Reinforcement Learning (RL) has become a powerful tool for enhancing the reasoning abilities of large language models (LLMs) by optimizing their policies with reward signals. Yet, RL's success relies on the reliability of rewards, which are provided by verifiers. In this paper, we expose and analyze a widespread problem--false negatives--where verifiers wrongly reject correct model outputs. Our in-depth study of the Big-Math-RL-Verified dataset reveals that over 38% of model-generated responses suffer from false negatives, where the verifier fails to recognize correct answers. We show, both empirically and theoretically, that these false negatives severely impair RL training by depriving the model of informative gradient signals and slowing convergence. To mitigate this, we propose tinyV, a lightweight LLM-based verifier that augments existing rule-based methods, which dynamically identifies potential false negatives and recovers valid responses to produce more accurate reward estimates. Across multiple math-reasoning benchmarks, integrating TinyV boosts pass rates by up to 10% and accelerates convergence relative to the baseline. Our findings highlight the critical importance of addressing verifier false negatives and offer a practical approach to improve RL-based fine-tuning of LLMs. Our code is available at https://github.com/uw-nsl/TinyV.
中文摘要:本文揭示了强化学习中验证器错误否定正确模型输出的普遍问题,并提出轻量级验证器TinyV来动态识别潜在误判,从而提升模型训练效果与收敛速度。
English Summary: This paper identifies the problem of false negatives in verifiers used for reinforcement learning of large language models, where correct answers are incorrectly rejected, and proposes TinyV, a lightweight verifier that mitigates this issue to improve training efficiency and accuracy.
Authors:Anjiang Wei, Yuheng Wu, Yingjia Wan, Tarun Suresh, Huanmi Tan, Zhanke Zhou, Sanmi Koyejo, Ke Wang, Alex Aiken
Abstract:
We introduce SATBench, a benchmark for evaluating the logical reasoning capabilities of large language models (LLMs) through logical puzzles derived from Boolean satisfiability (SAT) problems. Unlike prior work that focuses on inference rule-based reasoning, which often involves deducing conclusions from a set of premises, our approach leverages the search-based nature of SAT problems, where the objective is to find a solution that fulfills a specified set of logical constraints. Each instance in SATBench is generated from a SAT formula, then translated into a puzzle using LLMs. The generation process is fully automated and allows for adjustable difficulty by varying the number of clauses. All 2100 puzzles are validated through both LLM-based and solver-based consistency checks, with human validation on a subset. Experimental results show that even the strongest model, o4-mini, achieves only 65.0% accuracy on hard UNSAT problems, close to the random baseline of 50%. Our error analysis reveals systematic failures such as satisfiability bias, context inconsistency, and condition omission, highlighting limitations of current LLMs in search-based logical reasoning. Our code and data are publicly available at https://github.com/Anjiang-Wei/SATBench
中文: SATBench是一个通过基于布尔可满足性问题生成的逻辑谜题来评估大语言模型逻辑推理能力的基准,揭示了模型在搜索式推理中的显著局限性和系统性错误。
English: SATBench is a benchmark that assesses the logical reasoning of large language models using SAT-derived puzzles, revealing significant limitations in search-based reasoning and systematic errors like satisfiability bias.
Authors:Maksim Zhdanov, Vladislav Kurenkov
Abstract:
Recent advances in neural network interatomic potentials have emerged as a promising research direction. However, popular deep learning models often lack auxiliary constraints grounded in physical laws, which could accelerate training and improve fidelity through physics-based regularization. In this work, we introduce $Φ$-Module, a universal plugin module that enforces Poisson's equation within the message-passing framework to learn electrostatic interactions in a self-supervised manner. Specifically, each atom-wise representation is encouraged to satisfy a discretized Poisson's equation, making it possible to acquire a potential $\boldsymbolÏ$ and a corresponding charge density $\boldsymbolÏ$ linked to the learnable Laplacian eigenbasis coefficients of a given molecular graph. We then derive an electrostatic energy term, crucial for improved total energy predictions. This approach integrates seamlessly into any existing neural potential with insignificant computational overhead. Experiments on the OE62 and MD22 benchmarks confirm that models combined with $Φ$-Module achieve robust improvements over baseline counterparts. For OE62 error reduction ranges from 4.5\% to 17.8\%, and for MD22, baseline equipped with $Φ$-Module achieves best results on 5 out of 14 cases. Our results underscore how embedding a first-principles constraint in neural interatomic potentials can significantly improve performance while remaining hyperparameter-friendly, memory-efficient and lightweight in training. Code will be available at \href{https://github.com/dunnolab/phi-module}{dunnolab/phi-module}.
中文:$Φ$-模块是一种通用插件,通过在神经网络原子间势中强制执行泊松方程来自监督学习静电相互作用,以最小的计算开销在基准测试中显著提升了性能。
English: The $Φ$-Module is a universal plugin that enforces Poisson's equation in neural network interatomic potentials to learn electrostatic interactions self-supervised, improving performance on benchmarks with minimal computational overhead.
Authors:Isabella Degen, Zahraa S Abdallah, Henry W J Reeve, Kate Robson Brown
Abstract:
Time series clustering promises to uncover hidden structural patterns in data with applications across healthcare, finance, industrial systems, and other critical domains. However, without validated ground truth information, researchers cannot objectively assess clustering quality or determine whether poor results stem from absent structures in the data, algorithmic limitations, or inappropriate validation methods, raising the question whether clustering is "more art than science" (Guyon et al., 2009). To address these challenges, we introduce CSTS (Correlation Structures in Time Series), a synthetic benchmark for evaluating the discovery of correlation structures in multivariate time series data. CSTS provides a clean benchmark that enables researchers to isolate and identify specific causes of clustering failures by differentiating between correlation structure deterioration and limitations of clustering algorithms and validation methods. Our contributions are: (1) a comprehensive benchmark for correlation structure discovery with distinct correlation structures, systematically varied data conditions, established performance thresholds, and recommended evaluation protocols; (2) empirical validation of correlation structure preservation showing moderate distortion from downsampling and minimal effects from distribution shifts and sparsification; and (3) an extensible data generation framework enabling structure-first clustering evaluation. A case study demonstrates CSTS's practical utility by identifying an algorithm's previously undocumented sensitivity to non-normal distributions, illustrating how the benchmark enables precise diagnosis of methodological limitations. CSTS advances rigorous evaluation standards for correlation-based time series clustering.
中文:时间序列聚类旨在揭示隐藏模式,但缺乏真实基准难以评估,因此我们开发了CSTS合成基准,它能区分相关结构退化与算法局限,从而精确诊断方法缺陷。
English: Time series clustering aims to reveal hidden patterns but faces evaluation challenges without ground truth, prompting the development of CSTS, a synthetic benchmark that isolates clustering failures and enables precise diagnosis of methodological limitations.
Authors:Menglin Yang, Yifei Zhang, Jialin Chen, Melanie Weber, Rex Ying
Abstract:
In the era of foundation models and Large Language Models (LLMs), Euclidean space is the de facto geometric setting of our machine learning architectures. However, recent literature has demonstrated that this choice comes with fundamental limitations. To that end, non-Euclidean learning is quickly gaining traction, particularly in web-related applications where complex relationships and structures are prevalent. Non-Euclidean spaces, such as hyperbolic, spherical, and mixed-curvature spaces, have been shown to provide more efficient and effective representations for data with intrinsic geometric properties, including web-related data like social network topology, query-document relationships, and user-item interactions. Integrating foundation models with non-Euclidean geometries has great potential to enhance their ability to capture and model the underlying structures, leading to better performance in search, recommendations, and content understanding. This workshop focuses on the intersection of Non-Euclidean Foundation Models and Geometric Learning (NEGEL), exploring its potential benefits, including the potential benefits for advancing web-related technologies, challenges, and future directions. Workshop page: [https://hyperboliclearning.github.io/events/www2025workshop](https://hyperboliclearning.github.io/events/www2025workshop)
Authors:Peter Baile Chen, Yi Zhang, Dan Roth, Samuel Madden, Jacob Andreas, Michael Cafarella
Abstract:
While humans naturally learn and adapt from past experiences, large language models (LLMs) and their agentic counterparts struggle to retain reasoning from previous tasks and apply them in future contexts. To address this limitation, we propose a novel framework, log-augmented generation (LAG) that directly reuses prior computation and reasoning from past logs at test time to enhance model's ability to learn from previous tasks and perform better on new, unseen challenges, all while keeping the system efficient and scalable. Specifically, our system represents task logs using key-value (KV) caches, encoding the full reasoning context of prior tasks while storing KV caches for only a selected subset of tokens. When a new task arises, LAG retrieves the KV values from relevant logs to augment generation. Our approach differs from reflection-based memory mechanisms by directly reusing prior reasoning and computations without requiring additional steps for knowledge extraction or distillation. Our method also goes beyond existing KV caching techniques, which primarily target efficiency gains rather than improving accuracy. Experiments on knowledge- and reasoning-intensive datasets demonstrate that our method significantly outperforms standard agentic systems that do not utilize logs, as well as existing solutions based on reflection and KV cache techniques.
Authors:Siqiao Huang, Jialong Wu, Qixing Zhou, Shangchen Miao, Mingsheng Long
Abstract:
World models, which predict transitions based on history observation and action sequences, have shown great promise in improving data efficiency for sequential decision making. However, existing world models often require extensive domain-specific training and still produce low-fidelity, coarse predictions, limiting their applicability in complex environments. In contrast, video diffusion models trained on large, internet-scale datasets have demonstrated impressive capabilities in generating high-quality videos that capture diverse real-world dynamics. In this work, we present Vid2World, a general approach for leveraging and transferring pre-trained video diffusion models into interactive world models. To bridge the gap, Vid2World performs casualization of a pre-trained video diffusion model by crafting its architecture and training objective to enable autoregressive generation. Furthermore, it introduces a causal action guidance mechanism to enhance action controllability in the resulting interactive world model. Extensive experiments in robot manipulation and game simulation domains show that our method offers a scalable and effective approach for repurposing highly capable video diffusion models to interactive world models.
Authors:Hiroki Shiraishi, Hisao Ishibuchi, Masaya Nakata
Abstract:
Function approximation is a critical task in various fields. However, existing neural network approaches struggle with locally complex or discontinuous functions due to their reliance on a single global model covering the entire problem space. We propose X-KAN, a novel method that optimizes multiple local Kolmogorov-Arnold Networks (KANs) through an evolutionary rule-based machine learning framework called XCSF. X-KAN combines KAN's high expressiveness with XCSF's adaptive partitioning capability by implementing local KAN models as rule consequents and defining local regions via rule antecedents. Our experimental results on artificial test functions and real-world datasets demonstrate that X-KAN significantly outperforms conventional methods, including XCSF, Multi-Layer Perceptron, and KAN, in terms of approximation accuracy. Notably, X-KAN effectively handles functions with locally complex or discontinuous structures that are challenging for conventional KAN, using a compact set of rules (average 7.2 $\pm$ 2.3 rules). These results validate the effectiveness of using KAN as a local model in XCSF, which evaluates the rule fitness based on both accuracy and generality. Our X-KAN implementation is available at https://github.com/YNU-NakataLab/X-KAN.
中文: X-KAN是一种将科尔莫戈罗夫-阿诺德网络与进化框架相结合的新方法,通过优化局部模型,以紧凑的规则集显著优于传统方法,能有效处理复杂或不连续函数并实现高精度逼近。
English: X-KAN is a novel method that combines Kolmogorov-Arnold Networks with an evolutionary framework to optimize local models, significantly outperforming conventional approaches in handling complex or discontinuous functions with high accuracy using compact rule sets.
Authors:Raghav Singhal, Kaustubh Ponkshe, Rohit Vartak, Praneeth Vepakomma
Abstract:
Large Language Models have demonstrated strong performance across a wide range of tasks, but adapting them efficiently to new domains remains a key challenge. Parameter-Efficient Fine-Tuning (PEFT) methods address this by introducing lightweight, trainable modules while keeping most pre-trained weights fixed. The prevailing approach, LoRA, models updates using a low-rank decomposition, but its expressivity is inherently constrained by the rank. Recent methods like HiRA aim to increase expressivity by incorporating a Hadamard product with the frozen weights, but still rely on the structure of the pre-trained model. We introduce ABBA, a new PEFT architecture that reparameterizes the update as a Hadamard product of two independently learnable low-rank matrices. In contrast to prior work, ABBA fully decouples the update from the pre-trained weights, enabling both components to be optimized freely. This leads to significantly higher expressivity under the same parameter budget. We formally analyze ABBA's expressive capacity and validate its advantages through matrix reconstruction experiments. Empirically, ABBA achieves state-of-the-art results on arithmetic and commonsense reasoning benchmarks, consistently outperforming existing PEFT methods by a significant margin across multiple models. Our code is publicly available at: https://github.com/CERT-Lab/abba.
中文: ABBA提出了一种新的参数高效微调架构,通过两个独立低秩矩阵的哈达玛积实现与预训练权重的完全解耦,在相同参数预算下显著提升表达能力,并在多项推理基准测试中大幅领先现有方法。
English: ABBA introduces a novel parameter-efficient fine-tuning architecture that decouples updates from pre-trained weights using two independent low-rank matrices, achieving superior expressivity and state-of-the-art performance across reasoning benchmarks.
Authors:Kaustubh Ponkshe, Shaan Shah, Raghav Singhal, Praneeth Vepakomma
Abstract:
Large Language Models (LLMs) rely on safety alignment to produce socially acceptable responses. This is typically achieved through instruction tuning and reinforcement learning from human feedback. However, this alignment is known to be brittle: further fine-tuning, even on benign or lightly contaminated data, can degrade safety and reintroduce harmful behaviors. A growing body of work suggests that alignment may correspond to identifiable geometric directions in weight space, forming subspaces that could, in principle, be isolated or preserved to defend against misalignment. In this work, we conduct a comprehensive empirical study of this geometric perspective. We examine whether safety-relevant behavior is concentrated in specific subspaces, whether it can be separated from general-purpose learning, and whether harmfulness arises from distinguishable patterns in internal representations. Across both parameter and activation space, our findings are consistent: subspaces that amplify safe behaviors also amplify unsafe ones, and prompts with different safety implications activate overlapping representations. We find no evidence of a subspace that selectively governs safety. These results challenge the assumption that alignment is geometrically localized. Rather than residing in distinct directions, safety appears to emerge from entangled, high-impact components of the model's broader learning dynamics. This suggests that subspace-based defenses may face fundamental limitations and underscores the need for alternative strategies to preserve alignment under continued training. We corroborate these findings through multiple experiments on five open-source LLMs. Our code is publicly available at: https://github.com/CERT-Lab/safety-subspaces.
Large language models' safety alignment is fragile and can be compromised during fine-tuning, as safety is deeply entangled with general learning components rather than isolated in distinct subspaces, limiting the effectiveness of subspace-based defenses.
English Summary:
Authors:Yihang Du, Jiaying Hu, Suyang Hou, Yueyang Ding, Xiaobo Sun
Abstract:
Spatial labeling assigns labels to specific spatial locations to characterize their spatial properties and relationships, with broad applications in scientific research and practice. Measuring the similarity between two spatial labelings is essential for understanding their differences and the contributing factors, such as changes in location properties or labeling methods. An adequate and unbiased measurement of spatial labeling similarity should consider the number of matched labels (label agreement), the topology of spatial label distribution, and the heterogeneous impacts of mismatched labels. However, existing methods often fail to account for all these aspects. To address this gap, we propose a methodological framework to guide the development of methods that meet these requirements. Given two spatial labelings, the framework transforms them into graphs based on location organization, labels, and attributes (e.g., location significance). The distributions of their graph attributes are then extracted, enabling an efficient computation of distributional discrepancy to reflect the dissimilarity level between the two labelings. We further provide a concrete implementation of this framework, termed Spatial Labeling Analogy Metric (SLAM), along with an analysis of its theoretical foundation, for evaluating spatial labeling results in spatial transcriptomics (ST) \textit{as per} their similarity with ground truth labeling. Through a series of carefully designed experimental cases involving both simulated and real ST data, we demonstrate that SLAM provides a comprehensive and accurate reflection of labeling quality compared to other well-established evaluation metrics. Our code is available at https://github.com/YihDu/SLAM.
中文摘要:本研究提出了一种方法框架及其实现SLAM,通过综合考虑标签一致性、空间拓扑和错配标签影响来精确度量空间标注相似性,实验证明其性能优于现有评估指标。
English Summary: This study introduces a methodological framework and its implementation, SLAM, for accurately measuring spatial labeling similarity by considering label agreement, topology, and mismatched label impacts, demonstrating superior performance over existing metrics through experiments.
Authors:Guangke Chen, Fu Song, Zhe Zhao, Xiaojun Jia, Yang Liu, Yanchen Qiao, Weizhe Zhang
Abstract:
Jailbreak attacks to Large audio-language models (LALMs) are studied recently, but they achieve suboptimal effectiveness, applicability, and practicability, particularly, assuming that the adversary can fully manipulate user prompts. In this work, we first conduct an extensive experiment showing that advanced text jailbreak attacks cannot be easily ported to end-to-end LALMs via text-to speech (TTS) techniques. We then propose AudioJailbreak, a novel audio jailbreak attack, featuring (1) asynchrony: the jailbreak audio does not need to align with user prompts in the time axis by crafting suffixal jailbreak audios; (2) universality: a single jailbreak perturbation is effective for different prompts by incorporating multiple prompts into perturbation generation; (3) stealthiness: the malicious intent of jailbreak audios will not raise the awareness of victims by proposing various intent concealment strategies; and (4) over-the-air robustness: the jailbreak audios remain effective when being played over the air by incorporating the reverberation distortion effect with room impulse response into the generation of the perturbations. In contrast, all prior audio jailbreak attacks cannot offer asynchrony, universality, stealthiness, or over-the-air robustness. Moreover, AudioJailbreak is also applicable to the adversary who cannot fully manipulate user prompts, thus has a much broader attack scenario. Extensive experiments with thus far the most LALMs demonstrate the high effectiveness of AudioJailbreak. We highlight that our work peeks into the security implications of audio jailbreak attacks against LALMs, and realistically fosters improving their security robustness. The implementation and audio samples are available at our website https://audiojailbreak.github.io/AudioJailbreak.
中文: 本研究提出AudioJailbreak这一新型音频越狱攻击方法,通过异步性、通用性、隐蔽性和空中传播鲁棒性克服了现有方法的局限,在多种大型音频语言模型中展现出高效攻击能力。
English: This study introduces AudioJailbreak, a novel audio jailbreak attack that overcomes limitations of prior methods by offering asynchrony, universality, stealthiness, and over-the-air robustness, demonstrating high effectiveness across multiple large audio-language models.
Authors:Soichiro Kumano, Hiroshi Kera, Toshihiko Yamasaki
Abstract:
Adversarial training is one of the most effective adversarial defenses, but it incurs a high computational cost. In this study, we show that transformers adversarially pretrained on diverse tasks can serve as robust foundation models and eliminate the need for adversarial training in downstream tasks. Specifically, we theoretically demonstrate that through in-context learning, a single adversarially pretrained transformer can robustly generalize to multiple unseen tasks without any additional training, i.e., without any parameter updates. This robustness stems from the model's focus on robust features and its resistance to attacks that exploit non-predictive features. Besides these positive findings, we also identify several limitations. Under certain conditions (though unrealistic), no universally robust single-layer transformers exist. Moreover, robust transformers exhibit an accuracy--robustness trade-off and require a large number of in-context demonstrations. The code is available at https://github.com/s-kumano/universally-robust-in-context-learner.
中文:对抗性预训练的Transformer可作为鲁棒基础模型,通过上下文学习无需额外对抗训练即可泛化至未知任务,但也存在准确率-鲁棒性权衡及依赖大量示例等局限性。
English: Adversarially pretrained transformers can serve as robust foundation models, enabling generalization to unseen tasks without further adversarial training through in-context learning, though they face limitations like accuracy-robustness trade-offs and dependency on numerous demonstrations.
Authors:Amitayush Thakur, Jasper Lee, George Tsoukalas, Meghana Sistla, Matthew Zhao, Stefan Zetzsche, Greg Durrett, Yisong Yue, Swarat Chaudhuri
Abstract:
We introduce ${\rm C{\small LEVER}}$, a high-quality, curated benchmark of 161 problems for end-to-end verified code generation in Lean. Each problem consists of (1) the task of generating a specification that matches a held-out ground-truth specification, and (2) the task of generating a Lean implementation that provably satisfies this specification. Unlike prior benchmarks, ${\rm C{\small LEVER}}$ avoids test-case supervision, LLM-generated annotations, and specifications that leak implementation logic or allow vacuous solutions. All outputs are verified post-hoc using Lean's type checker to ensure machine-checkable correctness. We use ${\rm C{\small LEVER}}$ to evaluate several few-shot and agentic approaches based on state-of-the-art language models. These methods all struggle to achieve full verification, establishing it as a challenging frontier benchmark for program synthesis and formal reasoning. Our benchmark can be found on GitHub(https://github.com/trishullab/clever) as well as HuggingFace(https://huggingface.co/datasets/amitayusht/clever). All our evaluation code is also available online(https://github.com/trishullab/clever-prover).
中文: 本文介绍了${\rm C{\small LEVER}}$,这是一个包含161个问题的Lean验证代码生成基准,要求同时完成规范匹配和可证明实现,并通过Lean类型检查器严格验证正确性,避免了测试用例监督或逻辑泄露。
English: This paper introduces ${\rm C{\small LEVER}}$, a 161-problem benchmark for verified code generation in Lean that requires both specification matching and provable implementation, rigorously verified through Lean's type checker to ensure correctness without test-case supervision or leaked logic.
Authors:Jialong Wu, Shaofeng Yin, Ningya Feng, Mingsheng Long
Abstract:
World models predict state transitions in response to actions and are increasingly developed across diverse modalities. However, standard training objectives such as maximum likelihood estimation (MLE) often misalign with task-specific goals of world models, i.e., transition prediction metrics like accuracy or perceptual quality. In this paper, we present RLVR-World, a unified framework that leverages reinforcement learning with verifiable rewards (RLVR) to directly optimize world models for such metrics. Despite formulating world modeling as autoregressive prediction of tokenized sequences, RLVR-World evaluates metrics of decoded predictions as verifiable rewards. We demonstrate substantial performance gains on both language- and video-based world models across domains, including text games, web navigation, and robot manipulation. Our work indicates that, beyond recent advances in reasoning language models, RLVR offers a promising post-training paradigm for enhancing the utility of generative models more broadly.
Authors:Yanheng He, Jiahe Jin, Pengfei Liu
Abstract:
Scaling up high-quality trajectory data has long been a critical bottleneck for developing human-like computer use agents. We introduce PC Agent-E, an efficient agent training framework that significantly reduces reliance on large-scale human demonstrations. Starting with just 312 human-annotated computer use trajectories, we further improved data quality by synthesizing diverse action decisions with Claude 3.7 Sonnet. Trained on these enriched trajectories, our PC Agent-E model achieved a remarkable 141% relative improvement, surpassing the strong Claude 3.7 Sonnet with extended thinking on WindowsAgentArena-V2, an improved benchmark we also released. Furthermore, PC Agent-E demonstrates strong generalizability to different operating systems on OSWorld. Our findings suggest that strong computer use capabilities can be stimulated from a small amount of high-quality trajectory data.
中文: PC Agent-E 是一种高效的智能体训练框架,仅通过312条人工标注轨迹并结合合成动作决策,显著降低了对大规模人类示范数据的依赖,在基准测试中实现了141%的性能提升,并展现出强大的跨平台泛化能力。
English: PC Agent-E is an efficient training framework that significantly reduces the need for large-scale human demonstrations by using just 312 annotated trajectories enhanced with synthesized actions, achieving a 141% improvement on benchmarks and demonstrating strong cross-platform generalization.
Authors:Yingwei Zhang, Ke Bu, Zhuoran Zhuang, Tao Xie, Yao Yu, Dong Li, Yang Guo, Detao Lv
Abstract:
The past decades witness the significant advancements in time series forecasting (TSF) across various real-world domains, including e-commerce and disease spread prediction. However, TSF is usually constrained by the uncertainty dilemma of predicting future data with limited past observations. To settle this question, we explore the use of Cross-Future Behavior (CFB) in TSF, which occurs before the current time but takes effect in the future. We leverage CFB features and propose the CRoss-Future Behavior Awareness based Time Series Forecasting method (CRAFT). The core idea of CRAFT is to utilize the trend of cross-future behavior to mine the trend of time series data to be predicted. Specifically, to settle the sparse and partial flaws of cross-future behavior, CRAFT employs the Koopman Predictor Module to extract the key trend and the Internal Trend Mining Module to supplement the unknown area of the cross-future behavior matrix. Then, we introduce the External Trend Guide Module with a hierarchical structure to acquire more representative trends from higher levels. Finally, we apply the demand-constrained loss to calibrate the distribution deviation of prediction results. We conduct experiments on real-world dataset. Experiments on both offline large-scale dataset and online A/B test demonstrate the effectiveness of CRAFT. Our dataset and code is available at https://github.com/CRAFTinTSF/CRAFT.
中文: 本文提出CRAFT方法,通过挖掘跨未来行为趋势并利用Koopman预测器等模块补全数据,有效解决了时间序列预测中的不确定性难题,真实场景实验验证了其优越性。
English: This paper introduces CRAFT, a time series forecasting method that leverages cross-future behavior to address prediction uncertainty by extracting and supplementing trends through specialized modules, demonstrating effectiveness in real-world experiments.
Authors:Tian Sun, Yuqi Chen, Baihua Zheng, Weiwei Sun
Abstract:
In real-world applications, GPS trajectories often suffer from low sampling rates, with large and irregular intervals between consecutive GPS points. This sparse characteristic presents challenges for their direct use in GPS-based systems. This paper addresses the task of map-constrained trajectory recovery, aiming to enhance trajectory sampling rates of GPS trajectories. Previous studies commonly adopt a sequence-to-sequence framework, where an encoder captures the trajectory patterns and a decoder reconstructs the target trajectory. Within this framework, effectively representing the road network and extracting relevant trajectory features are crucial for overall performance. Despite advancements in these models, they fail to fully leverage the complex spatio-temporal dynamics present in both the trajectory and the road network.
To overcome these limitations, we categorize the spatio-temporal dynamics of trajectory data into two distinct aspects: spatial-temporal traffic dynamics and trajectory dynamics. Furthermore, We propose TedTrajRec, a novel method for trajectory recovery. To capture spatio-temporal traffic dynamics, we introduce PD-GNN, which models periodic patterns and learns topologically aware dynamics concurrently for each road segment. For spatio-temporal trajectory dynamics, we present TedFormer, a time-aware Transformer that incorporates temporal dynamics for each GPS location by integrating closed-form neural ordinary differential equations into the attention mechanism. This allows TedFormer to effectively handle irregularly sampled data. Extensive experiments on three real-world datasets demonstrate the superior performance of TedTrajRec. The code is publicly available at https://github.com/ysygMhdxw/TEDTrajRec/.
中文: 本文提出TedTrajRec方法,通过PD-GNN建模时空交通动态和TedFormer变换器处理轨迹动态,有效提升了低采样率GPS轨迹的恢复效果,并在实验中展现出优越性能。
English: This paper introduces TedTrajRec, a novel method that enhances low-sampling-rate GPS trajectory recovery by modeling spatio-temporal traffic dynamics with PD-GNN and trajectory dynamics with the time-aware TedFormer Transformer, demonstrating superior performance in experiments.
Authors:Matthew Raffel, Lizhong Chen
Abstract:
The Kolmogorov-Arnold Network (KAN) has been gaining popularity as an alternative to the multi-layer perceptron (MLP) with its increased expressiveness and interpretability. However, the KAN can be orders of magnitude slower due to its increased computational cost and training instability, limiting its applicability to larger-scale tasks. Recently, the Kolmogorov-Arnold Transformer (KAT) has been proposed, which can achieve FLOPs similar to the traditional Transformer with MLPs by leveraging Group-Rational KAN (GR-KAN). Unfortunately, despite the comparable FLOPs, our characterizations reveal that the KAT is still 123x slower in training speeds, indicating that there are other performance bottlenecks beyond FLOPs. In this paper, we conduct a series of experiments to understand the root cause of the slowdown in KAT. We uncover that the slowdown can be isolated to memory stalls and, more specifically, in the backward pass of GR-KAN caused by inefficient gradient accumulation. To address this memory bottleneck, we propose FlashKAT, which builds on our restructured kernel that minimizes gradient accumulation with atomic adds and accesses to slow memory. Evaluations demonstrate that FlashKAT can achieve a training speedup of 86.5x compared with the state-of-the-art KAT, while reducing rounding errors in the coefficient gradients. Our code is available at https://github.com/OSU-STARLAB/FlashKAT.
中文: Kolmogorov-Arnold Transformer (KAT) 因梯度累积中的内存瓶颈导致训练速度大幅下降,而 FlashKAT 通过重构内核减少内存延迟,实现了 86.5 倍的加速并提高了梯度精度。
English: The Kolmogorov-Arnold Transformer (KAT) suffers from significant training slowdowns due to memory bottlenecks in gradient accumulation, which FlashKAT addresses by restructuring the kernel to minimize memory stalls, achieving an 86.5x speedup while improving gradient accuracy.
Authors:Chenning Yu, Sicun Gao
Abstract:
We introduce a novel resampling criterion using lift scores, for improving compositional generation in diffusion models. By leveraging the lift scores, we evaluate whether generated samples align with each single condition and then compose the results to determine whether the composed prompt is satisfied. Our key insight is that lift scores can be efficiently approximated using only the original diffusion model, requiring no additional training or external modules. We develop an optimized variant that achieves relatively lower computational overhead during inference while maintaining effectiveness. Through extensive experiments, we demonstrate that lift scores significantly improved the condition alignment for compositional generation across 2D synthetic data, CLEVR position tasks, and text-to-image synthesis. Our code is available at http://rainorangelemon.github.io/complift.
Authors:Etienne Gauthier, Francis Bach, Michael I. Jordan
Abstract:
We introduce $\textit{Backward Conformal Prediction}$, a method that guarantees conformal coverage while providing flexible control over the size of prediction sets. Unlike standard conformal prediction, which fixes the coverage level and allows the conformal set size to vary, our approach defines a rule that constrains how prediction set sizes behave based on the observed data, and adapts the coverage level accordingly. Our method builds on two key foundations: (i) recent results by Gauthier et al. [2025] on post-hoc validity using e-values, which ensure marginal coverage of the form $\mathbb{P}(Y_{\rm test} \in \hat C_n^{\tildeα}(X_{\rm test})) \ge 1 - \mathbb{E}[\tildeα]$ up to a first-order Taylor approximation for any data-dependent miscoverage $\tildeα$, and (ii) a novel leave-one-out estimator $\hatα^{\rm LOO}$ of the marginal miscoverage $\mathbb{E}[\tildeα]$ based on the calibration set, ensuring that the theoretical guarantees remain computable in practice. This approach is particularly useful in applications where large prediction sets are impractical such as medical diagnosis. We provide theoretical results and empirical evidence supporting the validity of our method, demonstrating that it maintains computable coverage guarantees while ensuring interpretable, well-controlled prediction set sizes.
中文: 反向共形预测通过基于观测数据自适应控制预测集大小来确保共形覆盖,利用e值有效性和新颖的留一估计器保持实际可计算性,特别适用于医学诊断等大预测集不实用的场景。
English: Backward Conformal Prediction ensures conformal coverage by adaptively controlling prediction set sizes based on observed data, utilizing e-value validity and a novel leave-one-out estimator to maintain practical computability, especially in applications like medical diagnosis where large sets are impractical.
Authors:Fynn Fromme, Hans Harder, Christine Allen-Blanchette, Sebastian Peitz
Abstract:
The use of machine learning for modeling, understanding, and controlling large-scale physics systems is quickly gaining in popularity, with examples ranging from electromagnetism over nuclear fusion reactors and magneto-hydrodynamics to fluid mechanics and climate modeling. These systems - governed by partial differential equations - present unique challenges regarding the large number of degrees of freedom and the complex dynamics over many scales both in space and time, and additional measures to improve accuracy and sample efficiency are highly desirable. We present an end-to-end equivariant surrogate model consisting of an equivariant convolutional autoencoder and an equivariant convolutional LSTM using $G$-steerable kernels. As a case study, we consider the three-dimensional Rayleigh-Bénard convection, which describes the buoyancy-driven fluid flow between a heated bottom and a cooled top plate. While the system is E(2)-equivariant in the horizontal plane, the boundary conditions break the translational equivariance in the vertical direction. Our architecture leverages vertically stacked layers of $D_4$-steerable kernels, with additional partial kernel sharing in the vertical direction for further efficiency improvement. We demonstrate significant gains in sample and parameter efficiency, as well as a better scaling to more complex dynamics. The accompanying code is available under https://github.com/FynnFromme/equivariant-rb-forecasting.
机器学习正广泛应用于建模和控制由偏微分方程主导的大规模物理系统,我们提出的等变替代模型在三维瑞利-贝纳德对流等复杂动力学中显著提升了准确性和效率。
Machine learning is increasingly applied to model and control large-scale physics systems governed by partial differential equations, with our proposed equivariant surrogate model enhancing accuracy and efficiency in complex dynamics like three-dimensional Rayleigh-Bénard convection.
Authors:Yiru Jiao, Simeon C. Calvert, Sander van Cranenburgh, Hans van Lint
Abstract:
Accurately and proactively alerting drivers or automated systems to emerging collisions is crucial for road safety, particularly in highly interactive and complex urban environments. However, existing approaches to identifying potential collisions either require labour-intensive annotation of sparse risk, struggle to consider varying contextual factors, or are only useful in specific scenarios. To address these limits, this study introduces the Generalised Surrogate Safety Measure (GSSM), a new data-driven approach that learns collision risk exclusively from naturalistic driving without the need for crash or risk labels. GSSM captures the patterns of normal driving and estimates the extent to which a traffic interaction deviates from the norm towards an unsafe state. Diverse data from naturalistic driving, including motion kinematics, weather, lighting, etc., are used to train multiple GSSMs, which are tested with 2,591 reconstructed real-world crashes and near-crashes. These test events are also released here as the largest dataset of its kind to date. A basic GSSM using only instantaneous motion kinematics achieves an area under the precision-recall curve of 0.9 and secures a median time advance of 2.6 seconds to prevent potential collisions. Additional interaction patterns and contextual factors provide further performance gains. Across various types of collision risk scenarios (such as rear-end, merging, and turning interactions), the accuracy and timeliness of GSSM consistently outperforms existing baselines. GSSM therefore establishes a scalable, context-aware, and generalisable foundation for proactively quantifying collision risk in traffic interactions. This can support and facilitate autonomous driving systems, traffic safety assessment, and road emergency management. Code and experiment data are openly accessible at https://github.com/Yiru-Jiao/GSSM.
中文: 本研究提出的广义替代安全度量(GSSM)通过无标签自然驾驶数据学习碰撞风险,在多种碰撞场景中均能实现精准及时的预警,其性能全面优于现有方法,为自动驾驶和交通安全评估提供了可扩展的解决方案。
English: This study introduces the Generalised Surrogate Safety Measure (GSSM), a data-driven method that learns collision risk from naturalistic driving data without requiring labeled incidents, achieving high accuracy and timeliness in predicting various collision scenarios while outperforming existing approaches.
Authors:Pengxin Guo, Yinong Wang, Wei Li, Mengting Liu, Ming Li, Jinkai Zheng, Liangqiong Qu
Abstract:
LLM pruning has emerged as a promising technology for compressing LLMs, enabling their deployment on resource-limited devices. However, current methodologies typically require access to public calibration samples, which can be challenging to obtain in privacy-sensitive domains. To address this issue, we introduce FedPrLLM, a comprehensive federated pruning framework designed for the privacy-preserving compression of LLMs. In FedPrLLM, each client only needs to calculate a pruning mask matrix based on its local calibration data and share it with the server to prune the global model. This approach allows for collaborative pruning of the global model with the knowledge of each client while maintaining local data privacy. Additionally, we conduct extensive experiments to explore various possibilities within the FedPrLLM framework, including different comparison groups, pruning strategies, and the decision to scale weights. Our extensive evaluation reveals that one-shot pruning with layer comparison and no weight scaling is the optimal choice within the FedPrLLM framework. We hope our work will help guide future efforts in pruning LLMs in privacy-sensitive fields. Our code is available at https://github.com/Pengxin-Guo/FedPrLLM.
Chinese: FedPrLLM提出了一种联邦剪枝框架,通过客户端基于本地校准数据计算剪枝掩码并与服务器共享,实现了大型语言模型的隐私保护压缩,无需公开校准样本,有效维护了数据隐私。
English: FedPrLLM introduces a federated pruning framework that enables privacy-preserving compression of large language models by allowing clients to compute pruning masks locally and share them with a server, eliminating the need for public calibration data and ensuring data confidentiality.
Authors:Rodrigo Fritz, Pablo Suárez-Serrato, Victor Mijangos, Anayanzi D. Martinez-Hernandez, Eduardo Ivan Velazquez Richards
Abstract:
We present EuLearn, the first surface datasets equitably representing a diversity of topological types. We designed our embedded surfaces of uniformly varying genera relying on random knots, thus allowing our surfaces to knot with themselves. EuLearn contributes new topological datasets of meshes, point clouds, and scalar fields in 3D. We aim to facilitate the training of machine learning systems that can discern topological features. We experimented with specific emblematic 3D neural network architectures, finding that their vanilla implementations perform poorly on genus classification. To enhance performance, we developed a novel, non-Euclidean, statistical sampling method adapted to graph and manifold data. We also introduce adjacency-informed adaptations of PointNet and Transformer architectures that rely on our non-Euclidean sampling strategy. Our results demonstrate that incorporating topological information into deep learning workflows significantly improves performance on these otherwise challenging EuLearn datasets.
中文: EuLearn提供了首个公平代表多种拓扑类型的表面数据集,通过创新的非欧几里得采样方法和改进的PointNet与Transformer架构,显著提升了深度学习模型在拓扑特征识别上的性能。
English: EuLearn introduces the first equitable surface datasets with diverse topological types, enhancing machine learning systems' ability to discern topological features through novel non-Euclidean sampling and adapted architectures like PointNet and Transformer.
Authors:Avinash Patil, Siru Tao, Amardeep Gedhu
Abstract:
Suicide prevention remains a critical public health challenge. While online platforms such as Reddit's r/SuicideWatch have historically provided spaces for individuals to express suicidal thoughts and seek community support, the advent of large language models (LLMs) introduces a new paradigm-where individuals may begin disclosing ideation to AI systems instead of humans. This study evaluates the capability of LLMs to perform automated suicide risk assessment using the Columbia-Suicide Severity Rating Scale (C-SSRS). We assess the zero-shot performance of six models-including Claude, GPT, Mistral, and LLaMA-in classifying posts across a 7-point severity scale (Levels 0-6). Results indicate that Claude and GPT closely align with human annotations, while Mistral achieves the lowest ordinal prediction error. Most models exhibit ordinal sensitivity, with misclassifications typically occurring between adjacent severity levels. We further analyze confusion patterns, misclassification sources, and ethical considerations, underscoring the importance of human oversight, transparency, and cautious deployment. Full code and supplementary materials are available at https://github.com/av9ash/llm_cssrs_code.
中文摘要:本研究评估了六种大型语言模型使用C-SSRS量表进行自杀风险评估的能力,发现Claude和GPT与人工标注最为接近,同时强调了部署过程中的伦理考量。
English Summary: This study evaluates six large language models for automated suicide risk assessment using the C-SSRS scale, finding that Claude and GPT align best with human ratings while highlighting ethical considerations for deployment.
Authors:Jiajie Zhang, Nianyi Lin, Lei Hou, Ling Feng, Juanzi Li
Abstract:
Recently, large reasoning models have achieved impressive performance on various tasks by employing human-like deep thinking. However, the lengthy thinking process substantially increases inference overhead, making efficiency a critical bottleneck. In this work, we first demonstrate that NoThinking, which prompts the reasoning model to skip thinking and directly generate the final solution, is a better choice for relatively simple tasks in terms of both performance and efficiency. Motivated by this, we propose AdaptThink, a novel RL algorithm to teach reasoning models to choose the optimal thinking mode adaptively based on problem difficulty. Specifically, AdaptThink features two core components: (1) a constrained optimization objective that encourages the model to choose NoThinking while maintaining the overall performance; (2) an importance sampling strategy that balances Thinking and NoThinking samples during on-policy training, thereby enabling cold start and allowing the model to explore and exploit both thinking modes throughout the training process. Our experiments indicate that AdaptThink significantly reduces the inference costs while further enhancing performance. Notably, on three math datasets, AdaptThink reduces the average response length of DeepSeek-R1-Distill-Qwen-1.5B by 53% and improves its accuracy by 2.4%, highlighting the promise of adaptive thinking-mode selection for optimizing the balance between reasoning quality and efficiency. Our codes and models are available at https://github.com/THU-KEG/AdaptThink.
中文:AdaptThink是一种新颖的强化学习算法,能根据任务难度自适应选择思考与无思考模式,在多种推理任务中显著降低推理成本的同时提升性能表现。
English: AdaptThink is a novel reinforcement learning algorithm that adaptively selects between thinking and no-thinking modes based on task difficulty, significantly reducing inference costs while improving performance across various reasoning tasks.
Authors:David Anugraha, Zilu Tang, Lester James V. Miranda, Hanyang Zhao, Mohammad Rifqi Farhansyah, Garry Kuwanto, Derry Wijaya, Genta Indra Winata
Abstract:
Reward models are essential for aligning language model outputs with human preferences, yet existing approaches often lack both controllability and interpretability. These models are typically optimized for narrow objectives, limiting their generalizability to broader downstream tasks. Moreover, their scalar outputs are difficult to interpret without contextual reasoning. To address these limitations, we introduce $\shortmethodname$, a novel reward modeling framework that is rubric-agnostic, generalizable across evaluation dimensions, and provides interpretable, reasoned score assignments. $\shortmethodname$ enables more transparent and flexible evaluation of language models, supporting robust alignment with diverse human values and use cases. Our models, data, and code are available as open source at https://github.com/rubricreward/r3.
中文: 提出的$\shortmethodname$框架采用了一种与评分标准无关、可推广的奖励建模方法,通过提供可解释的评分分配来增强语言模型与多样化人类偏好对齐的透明度和灵活性。
English: The proposed $\shortmethodname$ framework introduces a rubric-agnostic, generalizable reward modeling approach that provides interpretable score assignments to enhance transparency and flexibility in aligning language models with diverse human preferences.
Authors:Nimrod Berman, Ilan Naiman, Moshe Eliasof, Hedi Zisling, Omri Azencot
Abstract:
Diffusion-based generative models have demonstrated exceptional performance, yet their iterative sampling procedures remain computationally expensive. A prominent strategy to mitigate this cost is distillation, with offline distillation offering particular advantages in terms of efficiency, modularity, and flexibility. In this work, we identify two key observations that motivate a principled distillation framework: (1) while diffusion models have been viewed through the lens of dynamical systems theory, powerful and underexplored tools can be further leveraged; and (2) diffusion models inherently impose structured, semantically coherent trajectories in latent space. Building on these observations, we introduce the Koopman Distillation Model KDM, a novel offline distillation approach grounded in Koopman theory-a classical framework for representing nonlinear dynamics linearly in a transformed space. KDM encodes noisy inputs into an embedded space where a learned linear operator propagates them forward, followed by a decoder that reconstructs clean samples. This enables single-step generation while preserving semantic fidelity. We provide theoretical justification for our approach: (1) under mild assumptions, the learned diffusion dynamics admit a finite-dimensional Koopman representation; and (2) proximity in the Koopman latent space correlates with semantic similarity in the generated outputs, allowing for effective trajectory alignment. Empirically, KDM achieves state-of-the-art performance across standard offline distillation benchmarks, improving FID scores by up to 40% in a single generation step. All implementation details and code for the experimental setups are provided in our GitHub - https://github.com/azencot-group/KDM, or in our project page - https://sites.google.com/view/koopman-distillation-model.
中文总结:Koopman蒸馏模型(KDM)基于Koopman理论提出离线蒸馏框架,通过线性算子实现单步图像生成并保持语义保真度,在基准测试中实现FID指标最高40%的性能提升。
English Summary: The Koopman Distillation Model (KDM) introduces an offline distillation framework using Koopman theory to enable single-step image generation while maintaining semantic fidelity, achieving state-of-the-art performance with up to 40% FID improvement.
Authors:Shuqing Luo, Pingzhi Li, Jie Peng, Hanrui Wang, Yang, Zhao, Yu, Cao, Yu Cheng, Tianlong Chen
Abstract:
Mixture-of-experts (MoE) architectures could achieve impressive computational efficiency with expert parallelism, which relies heavily on all-to-all communication across devices. Unfortunately, such communication overhead typically constitutes a significant portion of the total runtime, hampering the scalability of distributed training and inference for modern MoE models (consuming over $40\%$ runtime in large-scale training). In this paper, we first define collaborative communication to illustrate this intrinsic limitation, and then propose system- and algorithm-level innovations to reduce communication costs. Specifically, given a pair of experts co-activated by one token, we call them "collaborated", which comprises $2$ cases as intra- and inter-collaboration, depending on whether they are kept on the same device. Our pilot investigations reveal that augmenting the proportion of intra-collaboration can accelerate expert parallelism at scale. It motivates us to strategically optimize collaborative communication for accelerated MoE training and inference, dubbed Occult. Our designs are capable of either delivering exact results with reduced communication cost or controllably minimizing the cost with collaboration pruning, materialized by modified fine-tuning. Comprehensive experiments on various MoE-LLMs demonstrate that Occult can be faster than popular state-of-the-art inference or training frameworks (more than $1.5\times$ speed up across multiple tasks and models) with comparable or superior quality compared to the standard fine-tuning. Code is available at $\href{https://github.com/UNITES-Lab/Occult}{https://github.com/UNITES-Lab/Occult}$.
中文: Occult系统通过优化专家协作通信策略,减少混合专家模型中的通信开销,在保持模型质量的同时实现超过1.5倍的加速效果。
English: The proposed Occult system reduces communication overhead in mixture-of-experts models by optimizing collaborative communication through strategic expert placement and collaboration pruning, achieving over 1.5× speedup while maintaining model quality.
Authors:Ahmet Berke Gokmen, Yigit Ekin, Bahri Batuhan Bilecen, Aysegul Dundar
Abstract:
We propose RoPECraft, a training-free video motion transfer method for diffusion transformers that operates solely by modifying their rotary positional embeddings (RoPE). We first extract dense optical flow from a reference video, and utilize the resulting motion offsets to warp the complex-exponential tensors of RoPE, effectively encoding motion into the generation process. These embeddings are then further optimized during denoising time steps via trajectory alignment between the predicted and target velocities using a flow-matching objective. To keep the output faithful to the text prompt and prevent duplicate generations, we incorporate a regularization term based on the phase components of the reference video's Fourier transform, projecting the phase angles onto a smooth manifold to suppress high-frequency artifacts. Experiments on benchmarks reveal that RoPECraft outperforms all recently published methods, both qualitatively and quantitatively.
Authors:Paula Feldman, Martin Sinnona, Claudio Delrieux, Viviana Siless, Emmanuel Iarussi
Abstract:
Anatomical trees are critical for clinical diagnosis and treatment planning, yet their complex and diverse geometry make accurate representation a significant challenge. Motivated by the latest advances in large language models, we introduce an autoregressive method for synthesizing anatomical trees. Our approach first embeds vessel structures into a learned discrete vocabulary using a VQ-VAE architecture, then models their generation autoregressively with a GPT-2 model. This method effectively captures intricate geometries and branching patterns, enabling realistic vascular tree synthesis. Comprehensive qualitative and quantitative evaluations reveal that our technique achieves high-fidelity tree reconstruction with compact discrete representations. Moreover, our B-spline representation of vessel cross-sections preserves critical morphological details that are often overlooked in previous' methods parameterizations. To the best of our knowledge, this work is the first to generate blood vessels in an autoregressive manner. Code is available at https://github.com/LIA-DiTella/VesselGPT-MICCAI.
中文摘要:本文提出一种结合VQ-VAE与GPT-2的自回归方法,通过离散化表征和B样条参数化实现高保真血管树合成,在保留形态细节的同时首次实现了血管的自回归生成。
English Summary: This paper introduces an autoregressive method using VQ-VAE and GPT-2 to synthesize anatomical trees, achieving high-fidelity reconstruction with compact representations while preserving morphological details through B-spline parameterization.
Authors:Gabriele Spadaro, Alberto Presta, Jhony H. Giraldo, Marco Grangetto, Wei Hu, Giuseppe Valenzise, Attilio Fiandrotti, Enzo Tartaglione
Abstract:
Efficient compression of low-bit-rate point clouds is critical for bandwidth-constrained applications. However, existing techniques mainly focus on high-fidelity reconstruction, requiring many bits for compression. This paper proposes a "Denoising Diffusion Probabilistic Model" (DDPM) architecture for point cloud compression (DDPM-PCC) at low bit-rates. A PointNet encoder produces the condition vector for the generation, which is then quantized via a learnable vector quantizer. This configuration allows to achieve a low bitrates while preserving quality. Experiments on ShapeNet and ModelNet40 show improved rate-distortion at low rates compared to standardized and state-of-the-art approaches. We publicly released the code at https://github.com/EIDOSLAB/DDPM-PCC.
中文: 本文提出一种基于DDPM的点云压缩方法,通过PointNet编码器和可学习向量量化器在低比特率下实现高效压缩,在基准数据集上展现出优于现有技术的率失真性能。
English: This paper introduces a DDPM-based point cloud compression method that achieves efficient low-bit-rate performance by using a PointNet encoder and learnable vector quantizer, demonstrating superior rate-distortion results on benchmark datasets.
Authors:Yifu Cai, Xinyu Li, Mononito Goswami, MichaÅ WiliÅski, Gus Welter, Artur Dubrawski
Abstract:
We introduce TimeSeriesGym, a scalable benchmarking framework for evaluating Artificial Intelligence (AI) agents on time series machine learning engineering challenges. Existing benchmarks lack scalability, focus narrowly on model building in well-defined settings, and evaluate only a limited set of research artifacts (e.g., CSV submission files). To make AI agent benchmarking more relevant to the practice of machine learning engineering, our framework scales along two critical dimensions. First, recognizing that effective ML engineering requires a range of diverse skills, TimeSeriesGym incorporates challenges from diverse sources spanning multiple domains and tasks. We design challenges to evaluate both isolated capabilities (including data handling, understanding research repositories, and code translation) and their combinations, and rather than addressing each challenge independently, we develop tools that support designing multiple challenges at scale. Second, we implement evaluation mechanisms for multiple research artifacts, including submission files, code, and models, using both precise numeric measures and more flexible LLM-based evaluation approaches. This dual strategy balances objective assessment with contextual judgment. Although our initial focus is on time series applications, our framework can be readily extended to other data modalities, broadly enhancing the comprehensiveness and practical utility of agentic AI evaluation. We open-source our benchmarking framework to facilitate future research on the ML engineering capabilities of AI agents.
中文: TimeSeriesGym是一个可扩展的基准测试框架,通过整合多领域任务和灵活评估机制,全面评估AI代理在时序机器学习工程挑战中的综合能力,显著提升了评估的实用性和扩展性。
English: TimeSeriesGym is a scalable benchmarking framework designed to comprehensively evaluate AI agents on diverse time series machine learning engineering challenges, incorporating multi-domain tasks and flexible evaluation mechanisms for enhanced practical relevance.
Authors:Anthony Zhou, Amir Barati Farimani
Abstract:
Many architectures for neural PDE surrogates have been proposed in recent years, largely based on neural networks or operator learning. In this work, we derive and propose a new architecture, the Neural Functional, which learns function to scalar mappings. Its implementation leverages insights from operator learning and neural fields, and we show the ability of neural functionals to implicitly learn functional derivatives. For the first time, this allows for an extension of Hamiltonian mechanics to neural PDE surrogates by learning the Hamiltonian functional and optimizing its functional derivatives. We demonstrate that the Hamiltonian Neural Functional can be an effective surrogate model through improved stability and conserving energy-like quantities on 1D and 2D PDEs. Beyond PDEs, functionals are prevalent in physics; functional approximation and learning with its gradients may find other uses, such as in molecular dynamics or design optimization.
Chinese: 哈密顿神经求解器(HNS)通过神经网络表示哈密顿泛函,将哈密顿力学扩展到神经偏微分方程求解器中,从而在1D和2D系统中实现了更高的稳定性和能量类守恒量的保持。
English: The Hamiltonian Neural Solver (HNS) extends Hamiltonian mechanics to neural PDE solvers by representing the Hamiltonian functional with a neural field, enabling improved stability and conservation of energy-like quantities across 1D and 2D systems.
Authors:Jieying Xue, Phuong Minh Nguyen, Minh Le Nguyen, Xin Liu
Abstract:
With the rapid advancement of global digitalization, users from different countries increasingly rely on social media for information exchange. In this context, multilingual multi-label emotion detection has emerged as a critical research area. This study addresses SemEval-2025 Task 11: Bridging the Gap in Text-Based Emotion Detection. Our paper focuses on two sub-tracks of this task: (1) Track A: Multi-label emotion detection, and (2) Track B: Emotion intensity. To tackle multilingual challenges, we leverage pre-trained multilingual models and focus on two architectures: (1) a fine-tuned BERT-based classification model and (2) an instruction-tuned generative LLM. Additionally, we propose two methods for handling multi-label classification: the base method, which maps an input directly to all its corresponding emotion labels, and the pairwise method, which models the relationship between the input text and each emotion category individually. Experimental results demonstrate the strong generalization ability of our approach in multilingual emotion recognition. In Track A, our method achieved Top 4 performance across 10 languages, ranking 1st in Hindi. In Track B, our approach also secured Top 5 performance in 7 languages, highlighting its simplicity and effectiveness\footnote{Our code is available at https://github.com/yingjie7/mlingual_multilabel_emo_detection.
中文摘要:本研究通过采用预训练模型和创新分类方法应对多语言多标签情绪检测挑战,在SemEval-2025任务11中于多语言环境下取得领先性能。
English Summary: This study tackles multilingual multi-label emotion detection by employing pre-trained models and innovative classification methods, achieving top performance across multiple languages in SemEval-2025 Task 11.
Authors:Francesco Innocenti, El Mehdi Achour, Christopher L. Buckley
Abstract:
The biological implausibility of backpropagation (BP) has motivated many alternative, brain-inspired algorithms that attempt to rely only on local information, such as predictive coding (PC) and equilibrium propagation. However, these algorithms have notoriously struggled to train very deep networks, preventing them from competing with BP in large-scale settings. Indeed, scaling PC networks (PCNs) has recently been posed as a challenge for the community (Pinchetti et al., 2024). Here, we show that 100+ layer PCNs can be trained reliably using a Depth-$μ$P parameterisation (Yang et al., 2023; Bordelon et al., 2023) which we call "$μ$PC". Through an extensive analysis of the scaling behaviour of PCNs, we reveal several pathologies that make standard PCNs difficult to train at large depths. We then show that, despite addressing only some of these instabilities, $μ$PC allows stable training of very deep (up to 128-layer) residual networks on simple classification tasks with competitive performance and little tuning compared to current benchmarks. Moreover, $μ$PC enables zero-shot transfer of both weight and activity learning rates across widths and depths. Our results have implications for other local algorithms and could be extended to convolutional and transformer architectures. Code for $μ$PC is made available as part of a JAX library for PCNs at https://github.com/thebuckleylab/jpc (Innocenti et al., 2024).
中文: 本研究提出$μ$PC方法,采用深度-$μ$P参数化技术,成功解决了预测编码网络在深层训练中的不稳定问题,实现了超过100层网络的稳定训练,并在分类任务中取得具有竞争力的性能且无需大量调参。
English: The study introduces $μ$PC, a method using Depth-$μ$P parameterization that enables stable training of over 100-layer predictive coding networks, addressing instabilities and achieving competitive performance on classification tasks with minimal tuning.
Authors:Ji Qi, Tam Thuc Do, Mingxiao Liu, Zhuoshi Pan, Yuzhe Li, Gene Cheung, H. Vicky Zhao
Abstract:
To forecast traffic with both spatial and temporal dimensions, we unroll a mixed-graph-based optimization algorithm into a lightweight and interpretable transformer-like neural net. Specifically, we construct two graphs: an undirected graph $\mathcal{G}^u$ capturing spatial correlations across geography, and a directed graph $\mathcal{G}^d$ capturing sequential relationships over time. We formulate a prediction problem for the future samples of signal $\mathbf{x}$, assuming it is "smooth" with respect to both $\mathcal{G}^u$ and $\mathcal{G}^d$, where we design new $\ell_2$ and $\ell_1$-norm variational terms to quantify and promote signal smoothness (low-frequency reconstruction) on a directed graph. We construct an iterative algorithm based on alternating direction method of multipliers (ADMM), and unroll it into a feed-forward network for data-driven parameter learning. We insert graph learning modules for $\mathcal{G}^u$ and $\mathcal{G}^d$, which are akin to the self-attention mechanism in classical transformers. Experiments show that our unrolled networks achieve competitive traffic forecast performance as state-of-the-art prediction schemes, while reducing parameter counts drastically. Our code is available in https://github.com/SingularityUndefined/Unrolling-GSP-STForecast.
Chinese: 我们通过展开混合图优化算法,构建了一个轻量级、可解释的类Transformer神经网络,用于捕捉交通预测中的时空依赖关系,在显著减少参数的同时实现了与先进方法相媲美的性能。
English: We develop a lightweight, interpretable transformer-like neural network by unrolling a mixed-graph optimization algorithm that captures spatial and temporal dependencies for traffic forecasting, achieving competitive performance with significantly fewer parameters.
Authors:Hongrui Kou, Jingkai Li, Ziyu Wang, Zhouhang Lv, Yuxin Zhang, Cheng Wang
Abstract:
Accurate prediction of traffic flow parameters and real time identification of congestion states are essential for the efficient operation of intelligent transportation systems. This paper proposes a Periodic Pattern Transformer Network (PPTNet) for traffic flow prediction, integrating periodic pattern extraction with the Transformer architecture, coupled with a fuzzy inference method for real-time congestion identification. Firstly, a high-precision traffic flow dataset (Traffic Flow Dataset for China's Congested Highways and Expressways, TF4CHE) suitable for congested highway scenarios in China is constructed based on drone aerial imagery data. Subsequently, the proposed PPTNet employs Fast Fourier Transform to capture multi-scale periodic patterns and utilizes two-dimensional Inception convolutions to efficiently extract intra and inter periodic features. A Transformer decoder dynamically models temporal dependencies, enabling accurate predictions of traffic density and speed. Finally, congestion probabilities are calculated in real-time using the predicted outcomes via a Mamdani fuzzy inference-based congestion identification module. Experimental results demonstrate that the proposed PPTNet significantly outperforms mainstream traffic prediction methods in prediction accuracy, and the congestion identification module effectively identifies real-time road congestion states, verifying the superiority and practicality of the proposed method in real-world traffic scenarios. Project page: https://github.com/ADSafetyJointLab/PPTNet.
中文摘要:本文提出PPTNet模型,结合周期模式提取与Transformer架构进行交通流预测,并采用模糊推理方法实时识别拥堵状态,在中国高速公路数据集上验证了其优越性能。
English Summary: This paper introduces PPTNet, a model combining periodic pattern analysis with Transformer architecture for accurate traffic flow prediction and real-time congestion identification using fuzzy inference, validated on a specialized Chinese highway dataset.
Authors:Federico Del Pup, Andrea Zanola, Louis Fabrice Tshimanga, Alessandra Bertoldo, Livio Finos, Manfredo Atzori
Abstract:
Deep learning is significantly advancing the analysis of electroencephalography (EEG) data by effectively discovering highly nonlinear patterns within the signals. Data partitioning and cross-validation are crucial for assessing model performance and ensuring study comparability, as they can produce varied results and data leakage due to specific signal properties (e.g., biometric). Such variability leads to incomparable studies and, increasingly, overestimated performance claims, which are detrimental to the field. Nevertheless, no comprehensive guidelines for proper data partitioning and cross-validation exist in the domain, nor is there a quantitative evaluation of their impact on model accuracy, reliability, and generalizability. To assist researchers in identifying optimal experimental strategies, this paper thoroughly investigates the role of data partitioning and cross-validation in evaluating EEG deep learning models. Five cross-validation settings are compared across three supervised cross-subject classification tasks (BCI, Parkinson's, and Alzheimer's disease detection) and four established architectures of increasing complexity (ShallowConvNet, EEGNet, DeepConvNet, and Temporal-based ResNet). The comparison of over 100,000 trained models underscores, first, the importance of using subject-based cross-validation strategies for evaluating EEG deep learning models, except when within-subject analyses are acceptable (e.g., BCI). Second, it highlights the greater reliability of nested approaches (N-LNSO) compared to non-nested counterparts, which are prone to data leakage and favor larger models overfitting to validation data. In conclusion, this work provides EEG deep learning researchers with an analysis of data partitioning and cross-validation and offers guidelines to avoid data leakage, currently undermining the domain with potentially overestimated performance claims.
中文: 深度学习通过识别复杂模式提升了脑电图分析能力,但数据划分和交叉验证方法对模型可靠性及泛化性影响显著,建议采用基于受试者的嵌套验证策略以避免数据泄露和性能高估问题。
English: Deep learning enhances EEG analysis by detecting complex patterns, but data partitioning and cross-validation methods significantly impact model reliability and generalizability, with subject-based and nested approaches recommended to prevent data leakage and overestimated performance.
Authors:Baohao Liao, Hanze Dong, Yuhui Xu, Doyen Sahoo, Christof Monz, Junnan Li, Caiming Xiong
Abstract:
Inference-time scaling techniques have significantly bolstered the reasoning capabilities of large language models (LLMs) by harnessing additional computational effort at inference without retraining. Similarly, Chain-of-Thought (CoT) prompting and its extension, Long CoT, improve accuracy by generating rich intermediate reasoning trajectories, but these approaches incur substantial token costs that impede their deployment in latency-sensitive settings. In this work, we first show that truncated CoT, which stops reasoning before completion and directly generates the final answer, often matches full CoT sampling while using dramatically fewer tokens. Building on this insight, we introduce Fractured Sampling, a unified inference-time strategy that interpolates between full CoT and solution-only sampling along three orthogonal axes: (1) the number of reasoning trajectories, (2) the number of final solutions per trajectory, and (3) the depth at which reasoning traces are truncated. Through extensive experiments on five diverse reasoning benchmarks and several model scales, we demonstrate that Fractured Sampling consistently achieves superior accuracy-cost trade-offs, yielding steep log-linear scaling gains in Pass@k versus token budget. Our analysis reveals how to allocate computation across these dimensions to maximize performance, paving the way for more efficient and scalable LLM reasoning. Code is available at https://github.com/BaohaoLiao/frac-cot.
中文:Fractured Sampling是一种创新的推理时策略,通过调整推理轨迹、截断深度和解决方案数量,在无需重新训练大语言模型的情况下,优化了推理准确性与计算成本之间的平衡,实现了更高的效率。
English: Fractured Sampling is a new inference-time strategy that optimizes the trade-off between reasoning accuracy and computational cost by adjusting reasoning trajectories, truncation depth, and solution counts, achieving superior efficiency without retraining LLMs.
Authors:Zhihe Yang, Xufang Luo, Zilong Wang, Dongqi Han, Zhiyuan He, Dongsheng Li, Yunjian Xu
Abstract:
Reinforcement learning (RL) has become a cornerstone for enhancing the reasoning capabilities of large language models (LLMs), with recent innovations such as Group Relative Policy Optimization (GRPO) demonstrating exceptional effectiveness. In this study, we identify a critical yet underexplored issue in RL training: low-probability tokens disproportionately influence model updates due to their large gradient magnitudes. This dominance hinders the effective learning of high-probability tokens, whose gradients are essential for LLMs' performance but are substantially suppressed. To mitigate this interference, we propose two novel methods: Advantage Reweighting and Low-Probability Token Isolation (Lopti), both of which effectively attenuate gradients from low-probability tokens while emphasizing parameter updates driven by high-probability tokens. Our approaches promote balanced updates across tokens with varying probabilities, thereby enhancing the efficiency of RL training. Experimental results demonstrate that they substantially improve the performance of GRPO-trained LLMs, achieving up to a 46.2% improvement in K&K Logic Puzzle reasoning tasks. Our implementation is available at https://github.com/zhyang2226/AR-Lopti.
中文摘要:本研究针对强化学习中低概率词元梯度干扰问题,提出了优势重加权和Lopti两种方法,有效平衡不同概率词元的参数更新,使大语言模型在推理任务中的性能提升高达46.2%。
English Summary: This research addresses the imbalance in reinforcement learning for large language models by introducing Advantage Reweighting and Lopti methods, which reduce gradient interference from low-probability tokens and enhance learning from high-probability ones, achieving up to 46.2% improvement in reasoning tasks.
Authors:Shengsheng Lin, Haojun Chen, Haijie Wu, Chunyun Qiu, Weiwei Lin
Abstract:
Sufficiently modeling the correlations among variables (aka channels) is crucial for achieving accurate multivariate time series forecasting (MTSF). In this paper, we propose a novel technique called Temporal Query (TQ) to more effectively capture multivariate correlations, thereby improving model performance in MTSF tasks. Technically, the TQ technique employs periodically shifted learnable vectors as queries in the attention mechanism to capture global inter-variable patterns, while the keys and values are derived from the raw input data to encode local, sample-level correlations. Building upon the TQ technique, we develop a simple yet efficient model named Temporal Query Network (TQNet), which employs only a single-layer attention mechanism and a lightweight multi-layer perceptron (MLP). Extensive experiments demonstrate that TQNet learns more robust multivariate correlations, achieving state-of-the-art forecasting accuracy across 12 challenging real-world datasets. Furthermore, TQNet achieves high efficiency comparable to linear-based methods even on high-dimensional datasets, balancing performance and computational cost. The code is available at: https://github.com/ACAT-SCUT/TQNet.
中文: 本文提出了一种名为时序查询(TQ)的新技术,通过可学习向量的周期性偏移来捕捉全局多变量相关性,并基于此构建了TQNet模型,在多个数据集上实现了顶尖的预测精度和高效性能。
English: The paper introduces Temporal Query (TQ), a technique that uses shifted learnable vectors in attention mechanisms to capture global multivariate correlations, leading to the development of TQNet, which achieves state-of-the-art forecasting accuracy and high efficiency across multiple datasets.
Authors:Xiao Wang, Yu Jin, Lan Chen, Bo Jiang, Lin Zhu, Yonghong Tian, Jin Tang, Bin Luo
Abstract:
Event-based Vision Sensors (EVS) have demonstrated significant advantages over traditional RGB frame-based cameras in low-light conditions, high-speed motion capture, and low latency. Consequently, object detection based on EVS has attracted increasing attention from researchers. Current event stream object detection algorithms are typically built upon Convolutional Neural Networks (CNNs) or Transformers, which either capture limited local features using convolutional filters or incur high computational costs due to the utilization of self-attention. Recently proposed vision heat conduction backbone networks have shown a good balance between efficiency and accuracy; however, these models are not specifically designed for event stream data. They exhibit weak capability in modeling object contour information and fail to exploit the benefits of multi-scale features. To address these issues, this paper proposes a novel dynamic graph induced contour-aware heat conduction network for event stream based object detection, termed CvHeat-DET. The proposed model effectively leverages the clear contour information inherent in event streams to predict the thermal diffusivity coefficients within the heat conduction model, and integrates hierarchical structural graph features to enhance feature learning across multiple scales. Extensive experiments on three benchmark datasets for event stream-based object detection fully validated the effectiveness of the proposed model. The source code of this paper will be released on https://github.com/Event-AHU/OpenEvDET.
中文: 本文提出CvHeat-DET动态图轮廓感知热传导网络,通过有效利用事件流固有轮廓信息并整合多尺度层次特征,显著提升了事件流目标检测性能,在三个基准数据集上得到充分验证。
English: This paper introduces CvHeat-DET, a novel dynamic graph-based contour-aware heat conduction network that enhances event stream object detection by leveraging inherent contour information and multi-scale hierarchical features, achieving superior performance across three benchmark datasets.
Authors:Chenxi Liu, Yongqiang Chen, Tongliang Liu, James Cheng, Bo Han, Kun Zhang
Abstract:
System 2 reasoning is one of the defining characteristics of intelligence, which requires slow and logical thinking. Human conducts System 2 reasoning via the language of thoughts that organizes the reasoning process as a causal sequence of mental language, or thoughts. Recently, it has been observed that System 2 reasoning can be elicited from Large Language Models (LLMs) pre-trained on large-scale natural languages. However, in this work, we show that there is a significant gap between the modeling of languages and thoughts. As language is primarily a tool for humans to share knowledge and thinking, modeling human language can easily absorb language biases into LLMs deviated from the chain of thoughts in minds. Furthermore, we show that the biases will mislead the eliciting of "thoughts" in LLMs to focus only on a biased part of the premise. To this end, we propose a new prompt technique termed Language-of-Thoughts (LoT) to demonstrate and alleviate this gap. Instead of directly eliciting the chain of thoughts from partial information, LoT instructs LLMs to adjust the order and token used for the expressions of all the relevant information. We show that the simple strategy significantly reduces the language modeling biases in LLMs and improves the performance of LLMs across a variety of reasoning tasks.
中文: 本研究揭示了大型语言模型在语言与思维建模之间存在偏差,提出“思维语言”提示技术以减少语言模型偏见,从而提升各类推理任务的性能。
English: This study reveals that large language models (LLMs) exhibit a gap between language and thought modeling, leading to biased reasoning, and proposes a Language-of-Thoughts (LoT) prompting technique to reduce biases and enhance performance across reasoning tasks.
Authors:Zheng Wu, Pengzhou Cheng, Zongru Wu, Lingzhong Dong, Zhuosheng Zhang
Abstract:
Graphical user interface (GUI) agents have recently emerged as an intriguing paradigm for human-computer interaction, capable of automatically executing user instructions to operate intelligent terminal devices. However, when encountering out-of-distribution (OOD) instructions that violate environmental constraints or exceed the current capabilities of agents, GUI agents may suffer task breakdowns or even pose security threats. Therefore, effective OOD detection for GUI agents is essential. Traditional OOD detection methods perform suboptimally in this domain due to the complex embedding space and evolving GUI environments. In this work, we observe that the in-distribution input semantic space of GUI agents exhibits a clustering pattern with respect to the distance from the centroid. Based on the finding, we propose GEM, a novel method based on fitting a Gaussian mixture model over input embedding distances extracted from the GUI agent that reflect its capability boundary. Evaluated on eight datasets spanning smartphones, computers, and web browsers, our method achieves an average accuracy improvement of 23.70\% over the best-performing baseline while only increasing training time by 4.9\% and testing time by 6.5\%. We also experimentally demonstrate that GEM can improve the step-wise success rate by 9.40\% by requesting assistance from the cloud model when encountering OOD samples. Analysis verifies the generalization ability of our method through experiments on nine different backbones. The codes are available at https://github.com/Wuzheng02/GEM-OODforGUIagents.
中文: 图形用户界面(GUI)代理在处理超出分布(OOD)指令时易出现任务失败或安全威胁,而提出的GEM方法通过高斯混合模型有效提升检测精度和步骤成功率,适用于多种设备环境。
English: GUI agents face challenges with out-of-distribution (OOD) instructions, leading to task failures or security risks, but the proposed GEM method significantly improves OOD detection accuracy and step-wise success rates across various devices.
Authors:Yanbin Yin, Kun Zhou, Zhen Wang, Xiangdong Zhang, Yifei Shao, Shibo Hao, Yi Gu, Jieyuan Liu, Somanshu Singla, Tianyang Liu, Eric P. Xing, Zhengzhong Liu, Haojian Jin, Zhiting Hu
Abstract:
The recent explosion of large language models (LLMs), each with its own general or specialized strengths, makes scalable, reliable benchmarking more urgent than ever. Standard practices nowadays face fundamental trade-offs: closed-ended question-based benchmarks (eg MMLU) struggle with saturation as newer models emerge, while crowd-sourced leaderboards (eg Chatbot Arena) rely on costly and slow human judges. Recently, automated methods (eg LLM-as-a-judge) shed light on the scalability, but risk bias by relying on one or a few "authority" models. To tackle these issues, we propose Decentralized Arena (dearena), a fully automated framework leveraging collective intelligence from all LLMs to evaluate each other. It mitigates single-model judge bias by democratic, pairwise evaluation, and remains efficient at scale through two key components: (1) a coarse-to-fine ranking algorithm for fast incremental insertion of new models with sub-quadratic complexity, and (2) an automatic question selection strategy for the construction of new evaluation dimensions. Across extensive experiments across 66 LLMs, dearena attains up to 97% correlation with human judgements, while significantly reducing the cost. Our code and data will be publicly released on https://github.com/maitrix-org/de-arena.
Chinese Summary: 去中心化竞技场(dearena)是一个完全自动化的框架,利用所有大语言模型的集体智慧进行相互评估,通过高效的排名算法和问题选择策略,在显著降低成本的同时,实现了与人类判断高达97%的相关性。
English Summary: The Decentralized Arena (dearena) is a fully automated framework that leverages collective intelligence from all large language models (LLMs) to evaluate each other democratically, achieving up to 97% correlation with human judgments while significantly reducing costs through efficient ranking algorithms and question selection strategies.
Authors:Yiling Tao, Shuyi Wang, Jiaxi Yang, Guido Zuccon
Abstract:
This paper reports on findings from a comparative study on the effectiveness and efficiency of federated unlearning strategies within Federated Online Learning to Rank (FOLTR), with specific attention to systematically analysing the unlearning capabilities of methods in a verifiable manner.
Federated approaches to ranking of search results have recently garnered attention to address users privacy concerns. In FOLTR, privacy is safeguarded by collaboratively training ranking models across decentralized data sources, preserving individual user data while optimizing search results based on implicit feedback, such as clicks.
Recent legislation introduced across numerous countries is establishing the so called "the right to be forgotten", according to which services based on machine learning models like those in FOLTR should provide capabilities that allow users to remove their own data from those used to train models. This has sparked the development of unlearning methods, along with evaluation practices to measure whether unlearning of a user data successfully occurred. Current evaluation practices are however often controversial, necessitating the use of multiple metrics for a more comprehensive assessment -- but previous proposals of unlearning methods only used single evaluation metrics.
This paper addresses this limitation: our study rigorously assesses the effectiveness of unlearning strategies in managing both under-unlearning and over-unlearning scenarios using adapted, and newly proposed evaluation metrics. Thanks to our detailed analysis, we uncover the strengths and limitations of five unlearning strategies, offering valuable insights into optimizing federated unlearning to balance data privacy and system performance within FOLTR. We publicly release our code and complete results at https://github.com/Iris1026/Unlearning-for-FOLTR.git.
中文: 本研究通过多种评估指标对联邦在线排序学习中的联邦遗忘策略进行系统评估,揭示了五种方法在平衡数据隐私与系统性能方面的优缺点,弥补了现有评估方法的不足。
English: This study evaluates federated unlearning strategies in Federated Online Learning to Rank using multiple metrics to address limitations in existing evaluation methods, revealing the strengths and weaknesses of five approaches for balancing data privacy and system performance.
Authors:Joel Jang, Seonghyeon Ye, Zongyu Lin, Jiannan Xiang, Johan Bjorck, Yu Fang, Fengyuan Hu, Spencer Huang, Kaushil Kundalia, Yen-Chen Lin, Loic Magne, Ajay Mandlekar, Avnish Narayan, You Liang Tan, Guanzhi Wang, Jing Wang, Qi Wang, Yinzhen Xu, Xiaohui Zeng, Kaiyuan Zheng, Ruijie Zheng, Ming-Yu Liu, Luke Zettlemoyer, Dieter Fox, Jan Kautz, Scott Reed, Yuke Zhu, Linxi Fan
Abstract:
We introduce DreamGen, a simple yet highly effective 4-stage pipeline for training robot policies that generalize across behaviors and environments through neural trajectories - synthetic robot data generated from video world models. DreamGen leverages state-of-the-art image-to-video generative models, adapting them to the target robot embodiment to produce photorealistic synthetic videos of familiar or novel tasks in diverse environments. Since these models generate only videos, we recover pseudo-action sequences using either a latent action model or an inverse-dynamics model (IDM). Despite its simplicity, DreamGen unlocks strong behavior and environment generalization: a humanoid robot can perform 22 new behaviors in both seen and unseen environments, while requiring teleoperation data from only a single pick-and-place task in one environment. To evaluate the pipeline systematically, we introduce DreamGen Bench, a video generation benchmark that shows a strong correlation between benchmark performance and downstream policy success. Our work establishes a promising new axis for scaling robot learning well beyond manual data collection. Code available at https://github.com/NVIDIA/GR00T-Dreams.
中文: DreamGen采用四阶段流程,通过视频世界模型生成的神经轨迹训练机器人策略,使人形机器人仅需单一任务数据即可泛化至22种新行为和多类环境。
English: DreamGen is a four-stage pipeline that uses neural trajectories from video world models to train robot policies, enabling a humanoid robot to generalize across 22 new behaviors and environments with minimal initial data.
Authors:Mingyuan Zhou, Yi Gu, Zhendong Wang
Abstract:
Diffusion distillation has emerged as a promising strategy for accelerating text-to-image (T2I) diffusion models by distilling a pretrained score network into a one- or few-step generator. While existing methods have made notable progress, they often rely on real or teacher-synthesized images to perform well when distilling high-resolution T2I diffusion models such as Stable Diffusion XL (SDXL), and their use of classifier-free guidance (CFG) introduces a persistent trade-off between text-image alignment and generation diversity. We address these challenges by optimizing Score identity Distillation (SiD) -- a data-free, one-step distillation framework -- for few-step generation. Backed by theoretical analysis that justifies matching a uniform mixture of outputs from all generation steps to the data distribution, our few-step distillation algorithm avoids step-specific networks and integrates seamlessly into existing pipelines, achieving state-of-the-art performance on SDXL at 1024x1024 resolution. To mitigate the alignment-diversity trade-off when real text-image pairs are available, we introduce a Diffusion GAN-based adversarial loss applied to the uniform mixture and propose two new guidance strategies: Zero-CFG, which disables CFG in the teacher and removes text conditioning in the fake score network, and Anti-CFG, which applies negative CFG in the fake score network. This flexible setup improves diversity without sacrificing alignment. Comprehensive experiments on SD1.5 and SDXL demonstrate state-of-the-art performance in both one-step and few-step generation settings, along with robustness to the absence of real images. Our efficient PyTorch implementation, along with the resulting one- and few-step distilled generators, will be released publicly as a separate branch at https://github.com/mingyuanzhou/SiD-LSG.
中文摘要:SiD提出了一种无需真实数据、支持少步生成的蒸馏框架,通过创新性引导策略解决了高分辨率文生图模型中的对齐与多样性权衡问题,在不依赖真实图像的情况下实现了顶尖性能。
English Summary: SiD introduces a data-free, few-step distillation framework that overcomes the alignment-diversity trade-off in high-resolution text-to-image generation by employing novel guidance strategies, achieving state-of-the-art performance without relying on real images.
Authors:Wanfu Gao, Zengyao Man, Hanlin Pan, Kunpeng Liu
Abstract:
Feature generation involves creating new features from raw data to capture complex relationships among the original features, improving model robustness and machine learning performance. Current methods using reinforcement learning for feature generation have made feature exploration more flexible and efficient. However, several challenges remain: first, during feature expansion, a large number of redundant features are generated. When removing them, current methods only retain the best features each round, neglecting those that perform poorly initially but could improve later. Second, the state representation used by current methods fails to fully capture complex feature relationships. Third, there are significant differences between discrete and continuous features in tabular data, requiring different operations for each type. To address these challenges, we propose a novel dual-agent reinforcement learning method for feature generation. Two agents are designed: the first generates new features, and the second determines whether they should be preserved. A self-attention mechanism enhances state representation, and diverse operations distinguish interactions between discrete and continuous features. The experimental results on multiple datasets demonstrate that the proposed method is effective. The code is available at https://github.com/extess0/DARL.
中文摘要:该摘要提出了一种双智能体强化学习方法,通过自注意力机制增强状态表示,并针对离散和连续特征采用不同操作,有效解决了现有方法中特征冗余和关系捕捉不足的问题。
English Summary: This abstract introduces a dual-agent reinforcement learning method that generates and selects features using a self-attention mechanism and distinct operations for discrete and continuous features, effectively addressing redundancy and relationship capture issues in current approaches.
Authors:Sanggeon Yun, Ryozo Masukawa, Hyunwoo Oh, Nathaniel D. Bastian, Mohsen Imani
Abstract:
Deep neural networks (DNNs) are highly susceptible to adversarial examples--subtle, imperceptible perturbations that can lead to incorrect predictions. While detection-based defenses offer a practical alternative to adversarial training, many existing methods depend on external models, complex architectures, heavy augmentations, or adversarial data, limiting their efficiency and generalizability. We introduce a lightweight, plug-in detection framework that leverages internal layer-wise inconsistencies within the target model itself, requiring only benign data for calibration. Our approach is grounded in the A Few Large Shifts Assumption, which posits that adversarial perturbations typically induce large representation shifts in a small subset of layers. Building on this, we propose two complementary strategies--Recovery Testing (RT) and Logit-layer Testing (LT)--to expose internal disruptions caused by adversaries. Evaluated on CIFAR-10, CIFAR-100, and ImageNet under both standard and adaptive threat models, our method achieves state-of-the-art detection performance with negligible computational overhead and no compromise to clean accuracy. The code is available here: https://github.com/c0510gy/AFLS-AED.
Chinese: 本文提出了一种轻量级的即插即用检测框架,通过仅使用良性数据测量深度神经网络内部层间不一致性来识别对抗样本,以极低计算成本实现了最先进的检测性能。
English: This paper introduces a lightweight, plug-in detection framework that identifies adversarial examples by measuring internal layer-wise inconsistencies in deep neural networks using only benign data, achieving state-of-the-art performance with minimal computational cost.
Authors:Florent Chiaroni, Ali Ayub, Ola Ahmad
Abstract:
In robotics applications, few-shot segmentation is crucial because it allows robots to perform complex tasks with minimal training data, facilitating their adaptation to diverse, real-world environments. However, pixel-level annotations of even small amount of images is highly time-consuming and costly. In this paper, we present a novel few-shot binary segmentation method based on bounding-box annotations instead of pixel-level labels. We introduce, ProMi, an efficient prototype-mixture-based method that treats the background class as a mixture of distributions. Our approach is simple, training-free, and effective, accommodating coarse annotations with ease. Compared to existing baselines, ProMi achieves the best results across different datasets with significant gains, demonstrating its effectiveness. Furthermore, we present qualitative experiments tailored to real-world mobile robot tasks, demonstrating the applicability of our approach in such scenarios. Our code: https://github.com/ThalesGroup/promi.
Chinese: 本文提出了一种名为ProMi的新型小样本二元分割方法,它采用边界框标注而非像素级标签,在多个数据集上取得了最佳效果,并展示了在机器人实际应用中的有效性。
English: This paper introduces ProMi, a novel few-shot binary segmentation method that uses bounding-box annotations instead of pixel-level labels, achieving superior results across datasets and demonstrating real-world applicability in robotics.
Authors:Zongkai Liu, Fanqing Meng, Lingxiao Du, Zhixiang Zhou, Chao Yu, Wenqi Shao, Qiaosheng Zhang
Abstract:
Recent advances in rule-based reinforcement learning (RL) have significantly improved the reasoning capability of language models (LMs) with rule-based rewards. However, existing RL methods -- such as GRPO, REINFORCE++, and RLOO -- often suffer from training instability, where large policy updates and improper clipping can lead to training collapse. To address this issue, we propose Clipped Policy Gradient Optimization with Policy Drift (CPGD), a novel algorithm designed to stabilize policy learning in LMs. CPGD introduces a policy drift constraint based on KL divergence to dynamically regularize policy updates, and leverages a clip mechanism on the logarithm of the ratio to prevent excessive policy updates. We provide theoretical justification for CPGD and demonstrate through empirical analysis that it mitigates the instability observed in prior approaches. Furthermore, we show that CPGD significantly improves performance while maintaining training stability. Our implementation balances theoretical rigor with practical usability, offering a robust alternative for RL in the post-training of LMs. We release our code at https://github.com/ModalMinds/MM-EUREKA.
Chinese: 本文提出CPGD算法,通过基于KL散度的策略漂移约束和比率对数裁剪机制,有效稳定语言模型的强化学习训练过程,在保持训练稳定性的同时显著提升性能表现。
English: This paper introduces CPGD, a novel reinforcement learning algorithm that stabilizes policy training in language models by dynamically constraining policy drift with KL divergence and preventing excessive updates through a clipping mechanism, outperforming prior methods in both stability and performance.
Authors:Zirun Guo, Minjie Hong, Tao Jin
Abstract:
Reinforcement Learning (RL) has shown promise in improving the reasoning abilities of Large Language Models (LLMs). However, the specific challenges of adapting RL to multimodal data and formats remain relatively unexplored. In this work, we present Observe-R1, a novel framework aimed at enhancing the reasoning capabilities of multimodal large language models (MLLMs). We draw inspirations from human learning progression--from simple to complex and easy to difficult, and propose a gradual learning paradigm for MLLMs. To this end, we construct the NeuraLadder dataset, which is organized and sampled according to the difficulty and complexity of data samples for RL training. To tackle multimodal tasks, we introduce a multimodal format constraint that encourages careful observation of images, resulting in enhanced visual abilities and clearer and more structured responses. Additionally, we implement a bonus reward system that favors concise, correct answers within a length constraint, alongside a dynamic weighting mechanism that prioritizes uncertain and medium-difficulty problems, ensuring that more informative samples have a greater impact on training. Our experiments with the Qwen2.5-VL-3B and Qwen2.5-VL-7B models on 20k samples from the NeuraLadder dataset show that Observe-R1 outperforms a series of larger reasoning models on both reasoning and general benchmarks, achieving superior clarity and conciseness in reasoning chains. Ablation studies validate the effectiveness of our strategies, highlighting the robustness and generalization of our approach. The dataset and code will be released at https://github.com/zrguo/Observe-R1.
中文: 本文提出Observe-R1框架,通过渐进式学习和多模态约束增强大语言模型的多模态推理能力,在多项基准测试中实现更优性能。
English: This paper introduces Observe-R1, a framework that enhances multimodal reasoning in large language models through progressive learning and multimodal constraints, achieving superior performance on reasoning benchmarks.
Authors:Emanuele La Malfa, Jon Vadillo, Marco Molinari, Michael Wooldridge
Abstract:
This paper introduces a formal notion of fixed point explanations, inspired by the "why regress" principle, to assess, through recursive applications, the stability of the interplay between a model and its explainer. Fixed point explanations satisfy properties like minimality, stability, and faithfulness, revealing hidden model behaviours and explanatory weaknesses. We define convergence conditions for several classes of explainers, from feature-based to mechanistic tools like Sparse AutoEncoders, and we report quantitative and qualitative results.
Chinese Summary: 本文提出固定点解释的形式化概念,用以评估模型与解释器之间相互作用的稳定性,为多种解释器类型建立了收敛条件并展示了实证结果。
English Summary: This paper proposes a formal concept of fixed point explanations to evaluate the interaction stability between models and explainers, establishing convergence criteria for various explainer types and presenting empirical findings.
Authors:Yang Hu, Xingyu Zhang, Xueji Fang, Zhiyang Chen, Xiao Wang, Huatian Zhang, Guojun Qi
Abstract:
We propose SLOT (Sample-specific Language Model Optimization at Test-time), a novel and parameter-efficient test-time inference approach that enhances a language model's ability to more accurately respond to individual prompts. Existing Large Language Models (LLMs) often struggle with complex instructions, leading to poor performances on those not well represented among general samples. To address this, SLOT conducts few optimization steps at test-time to update a light-weight sample-specific parameter vector. It is added to the final hidden layer before the output head, and enables efficient adaptation by caching the last layer features during per-sample optimization. By minimizing the cross-entropy loss on the input prompt only, SLOT helps the model better aligned with and follow each given instruction. In experiments, we demonstrate that our method outperforms the compared models across multiple benchmarks and LLMs. For example, Qwen2.5-7B with SLOT achieves an accuracy gain of 8.6% on GSM8K from 57.54% to 66.19%, while DeepSeek-R1-Distill-Llama-70B with SLOT achieves a SOTA accuracy of 68.69% on GPQA among 70B-level models. Our code is available at https://github.com/maple-research-lab/SLOT.
中文: SLOT是一种参数高效的测试时优化方法,通过更新轻量级参数向量来提升语言模型对单个提示的响应准确性,在多个基准测试中实现了显著的性能提升。
English: SLOT is a parameter-efficient test-time optimization method that enhances language models' accuracy on individual prompts by updating a lightweight parameter vector, achieving significant performance gains across multiple benchmarks.
Authors:Quanjiang Guo, Jinchuan Zhang, Sijie Wang, Ling Tian, Zhao Kang, Bin Yan, Weidong Xiao
Abstract:
Few-Shot Relation Extraction (FSRE) remains a challenging task due to the scarcity of annotated data and the limited generalization capabilities of existing models. Although large language models (LLMs) have demonstrated potential in FSRE through in-context learning (ICL), their general-purpose training objectives often result in suboptimal performance for task-specific relation extraction. To overcome these challenges, we propose TKRE (Two-Stage Knowledge-Guided Pre-training for Relation Extraction), a novel framework that synergistically integrates LLMs with traditional relation extraction models, bridging generative and discriminative learning paradigms. TKRE introduces two key innovations: (1) leveraging LLMs to generate explanation-driven knowledge and schema-constrained synthetic data, addressing the issue of data scarcity; and (2) a two-stage pre-training strategy combining Masked Span Language Modeling (MSLM) and Span-Level Contrastive Learning (SCL) to enhance relational reasoning and generalization. Together, these components enable TKRE to effectively tackle FSRE tasks. Comprehensive experiments on benchmark datasets demonstrate the efficacy of TKRE, achieving new state-of-the-art performance in FSRE and underscoring its potential for broader application in low-resource scenarios. \footnote{The code and data are released on https://github.com/UESTC-GQJ/TKRE.
中文: 提出的TKRE框架通过解释驱动的知识生成和两阶段预训练策略,将大语言模型与传统关系抽取技术相结合,有效解决数据稀缺问题并提升泛化能力,在小样本关系抽取任务中实现了最先进的性能。
English: The proposed TKRE framework integrates large language models with traditional relation extraction techniques through explanation-driven knowledge generation and a two-stage pre-training strategy, achieving state-of-the-art performance in Few-Shot Relation Extraction by effectively addressing data scarcity and enhancing generalization.
Authors:Riad Hossain, Muhammad Ashad Kabir, Arat Ibne Golam Mowla, Animesh Chandra Roy, Ranjit Kumar Ghosh
Abstract:
Parkinson's disease (PD) poses a growing global health challenge, with Bangladesh experiencing a notable rise in PD-related mortality. Early detection of PD remains particularly challenging in resource-constrained settings, where voice-based analysis has emerged as a promising non-invasive and cost-effective alternative. However, existing studies predominantly focus on English or other major languages; notably, no voice dataset for PD exists for Bengali - posing a significant barrier to culturally inclusive and accessible healthcare solutions. Moreover, most prior studies employed only a narrow set of acoustic features, with limited or no hyperparameter tuning and feature selection strategies, and little attention to model explainability. This restricts the development of a robust and generalizable machine learning model. To address this gap, we present BenSparX, the first Bengali conversational speech dataset for PD detection, along with a robust and explainable machine learning framework tailored for early diagnosis. The proposed framework incorporates diverse acoustic feature categories, systematic feature selection methods, and state-of-the-art machine learning algorithms with extensive hyperparameter optimization. Furthermore, to enhance interpretability and trust in model predictions, the framework incorporates SHAP (SHapley Additive exPlanations) analysis to quantify the contribution of individual acoustic features toward PD detection. Our framework achieves state-of-the-art performance, yielding an accuracy of 95.77%, F1 score of 95.57%, and AUC-ROC of 0.982. We further externally validated our approach by applying the framework to existing PD datasets in other languages, where it consistently outperforms state-of-the-art approaches. To facilitate further research and reproducibility, the dataset has been made publicly available at https://github.com/Riad071/BenSParX.
中文: 本研究推出了首个用于帕金森病检测的孟加拉语语音数据集BenSparX,并提出了一种鲁棒的机器学习框架,通过全面的特征工程和可解释AI技术实现了最优性能。
English: This study introduces BenSparX, the first Bengali speech dataset for Parkinson's disease detection, and presents a robust machine learning framework that achieves state-of-the-art performance through comprehensive feature engineering and explainable AI techniques.
Authors:Wenquan Lu, Jiaqi Zhang, Hugues Van Assel, Randall Balestriero
Abstract:
Self-Supervised Learning (SSL) has become a powerful solution to extract rich representations from unlabeled data. Yet, SSL research is mostly focused on clean, curated and high-quality datasets. As a result, applying SSL on noisy data remains a challenge, despite being crucial to applications such as astrophysics, medical imaging, geophysics or finance. In this work, we present a fully self-supervised framework that enables noise-robust representation learning without requiring a denoiser at inference or downstream fine-tuning. Our method first trains an SSL denoiser on noisy data, then uses it to construct a denoised-to-noisy data curriculum (i.e., training first on denoised, then noisy samples) for pretraining a SSL backbone (e.g., DINOv2), combined with a teacher-guided regularization that anchors noisy embeddings to their denoised counterparts. This process encourages the model to internalize noise robustness. Notably, the denoiser can be discarded after pretraining, simplifying deployment. On ImageNet-1k with ViT-B under extreme Gaussian noise ($Ï=255$, SNR = 0.72 dB), our method improves linear probing accuracy by 4.8% over DINOv2, demonstrating that denoiser-free robustness can emerge from noise-aware pretraining. The code is available at https://github.com/wenquanlu/noisy_dinov2.
Chinese: 本研究提出了一种自监督框架,通过采用基于去噪器训练的课程学习和教师引导正则化,增强了噪声鲁棒性表征学习,在无需推理阶段去噪器的情况下显著提升了模型在噪声数据上的性能。
English: This study introduces a self-supervised framework that enhances noise-robust representation learning by employing a denoiser-trained curriculum and teacher-guided regularization, eliminating the need for denoisers during inference while significantly improving model performance on noisy data.
Authors:Tiannuo Yang, Zebin Yao, Bowen Jin, Lixiao Cui, Yusen Li, Gang Wang, Xiaoguang Liu
Abstract:
Large Language Model (LLM)-based search agents have shown remarkable capabilities in solving complex tasks by dynamically decomposing problems and addressing them through interleaved reasoning and retrieval. However, this interleaved paradigm introduces substantial efficiency bottlenecks. First, we observe that both highly accurate and overly approximate retrieval methods degrade system efficiency: exact search incurs significant retrieval overhead, while coarse retrieval requires additional reasoning steps during generation. Second, we identify inefficiencies in system design, including improper scheduling and frequent retrieval stalls, which lead to cascading latency -- where even minor delays in retrieval amplify end-to-end inference time. To address these challenges, we introduce SearchAgent-X, a high-efficiency inference framework for LLM-based search agents. SearchAgent-X leverages high-recall approximate retrieval and incorporates two key techniques: priority-aware scheduling and non-stall retrieval. Extensive experiments demonstrate that SearchAgent-X consistently outperforms state-of-the-art systems such as vLLM and HNSW-based retrieval across diverse tasks, achieving up to 3.4$\times$ higher throughput and 5$\times$ lower latency, without compromising generation quality. SearchAgent-X is available at https://github.com/tiannuo-yang/SearchAgent-X.
中文: 基于大语言模型的搜索代理因检索方法和系统设计存在效率瓶颈,而SearchAgent-X通过高召回近似检索、优先级调度和无中断检索技术,在不影响生成质量的前提下大幅提升了吞吐量并降低了延迟。
English: LLM-based search agents face efficiency issues from retrieval methods and system design, which SearchAgent-X addresses using high-recall retrieval, priority-aware scheduling, and non-stall retrieval to significantly boost throughput and reduce latency without sacrificing quality.
Authors:Puning Yang, Qizhou Wang, Zhuo Huang, Tongliang Liu, Chengqi Zhang, Bo Han
Abstract:
Loss reweighting has shown significant benefits for machine unlearning with large language models (LLMs). However, their exact functionalities are left unclear and the optimal strategy remains an open question, thus impeding the understanding and improvement of existing methodologies. In this paper, we identify two distinct goals of loss reweighting, namely, Saturation and Importance -- the former indicates that those insufficiently optimized data should be emphasized, while the latter stresses some critical data that are most influential for loss minimization. To study their usefulness, we design specific reweighting strategies for each goal and evaluate their respective effects on unlearning. We conduct extensive empirical analyses on well-established benchmarks, and summarize some important observations as follows: (i) Saturation enhances efficacy more than importance-based reweighting, and their combination can yield additional improvements. (ii) Saturation typically allocates lower weights to data with lower likelihoods, whereas importance-based reweighting does the opposite. (iii) The efficacy of unlearning is also largely influenced by the smoothness and granularity of the weight distributions. Based on these findings, we propose SatImp, a simple reweighting method that combines the advantages of both saturation and importance. Empirical results on extensive datasets validate the efficacy of our method, potentially bridging existing research gaps and indicating directions for future research. Our code is available at https://github.com/tmlr-group/SatImp.
中文: 损失重加权在大型语言模型的机器遗忘中通过饱和度和重要性两个目标得以阐明,其中饱和度更有效,两者结合的SatImp方法展现出更优性能。
English: Loss reweighting in machine unlearning for LLMs is clarified through two goals—Saturation and Importance—with Saturation proving more effective, and their combination in the SatImp method showing improved performance.
Authors:Zhiheng Chen, Ruofan Wu, Guanhua Fang
Abstract:
The transformer architecture has demonstrated remarkable capabilities in modern artificial intelligence, among which the capability of implicitly learning an internal model during inference time is widely believed to play a key role in the under standing of pre-trained large language models. However, most recent works have been focusing on studying supervised learning topics such as in-context learning, leaving the field of unsupervised learning largely unexplored. This paper investigates the capabilities of transformers in solving Gaussian Mixture Models (GMMs), a fundamental unsupervised learning problem through the lens of statistical estimation. We propose a transformer-based learning framework called TGMM that simultaneously learns to solve multiple GMM tasks using a shared transformer backbone. The learned models are empirically demonstrated to effectively mitigate the limitations of classical methods such as Expectation-Maximization (EM) or spectral algorithms, at the same time exhibit reasonable robustness to distribution shifts. Theoretically, we prove that transformers can approximate both the EM algorithm and a core component of spectral methods (cubic tensor power iterations). These results bridge the gap between practical success and theoretical understanding, positioning transformers as versatile tools for unsupervised learning.
中文: 本文提出TGMM这一基于Transformer的框架,能够有效学习解决多个高斯混合模型任务,在超越传统方法性能的同时,理论上证明了其可近似EM算法和谱方法的核心组件。
English: This paper introduces TGMM, a transformer-based framework that effectively learns to solve multiple Gaussian Mixture Models, demonstrating superior performance over classical methods while providing theoretical guarantees of approximating both EM algorithms and spectral methods.
Authors:Shiming Chen, Dingjie Fu, Salman Khan, Fahad Shahbaz Khan
Abstract:
Remarkable progress in zero-shot learning (ZSL) has been achieved using generative models. However, existing generative ZSL methods merely generate (imagine) the visual features from scratch guided by the strong class semantic vectors annotated by experts, resulting in suboptimal generative performance and limited scene generalization. To address these and advance ZSL, we propose an inductive variational autoencoder for generative zero-shot learning, dubbed GenZSL. Mimicking human-level concept learning, GenZSL operates by inducting new class samples from similar seen classes using weak class semantic vectors derived from target class names (i.e., CLIP text embedding). To ensure the generation of informative samples for training an effective ZSL classifier, our GenZSL incorporates two key strategies. Firstly, it employs class diversity promotion to enhance the diversity of class semantic vectors. Secondly, it utilizes target class-guided information boosting criteria to optimize the model. Extensive experiments conducted on three popular benchmark datasets showcase the superiority and potential of our GenZSL with significant efficacy and efficiency over f-VAEGAN, e.g., 24.7% performance gains and more than $60\times$ faster training speed on AWA2. Codes are available at https://github.com/shiming-chen/GenZSL.
中文摘要:提出的GenZSL模型通过使用弱语义向量从相似已见类别中归纳新类样本,推进零样本学习,并采用类别多样性促进和目标类别引导优化策略,在效能和效率上实现显著提升。
English Summary: The proposed GenZSL model advances zero-shot learning by generating new class samples from similar seen classes using weak semantic vectors, enhancing diversity and optimizing performance with significant gains in efficacy and efficiency.
Authors:Chicago Y. Park, Shirin Shoushtari, Hongyu An, Ulugbek S. Kamilov
Abstract:
Diffusion models are widely used in applications ranging from image generation to inverse problems. However, training diffusion models typically requires clean ground-truth images, which are unavailable in many applications. We introduce the Measurement Score-based diffusion Model (MSM), a novel framework that learns partial measurement scores using only noisy and subsampled measurements. MSM models the distribution of full measurements as an expectation over partial scores induced by randomized subsampling. To make the MSM representation computationally efficient, we also develop a stochastic sampling algorithm that generates full images by using a randomly selected subset of partial scores at each step. We additionally propose a new posterior sampling method for solving inverse problems that reconstructs images using these partial scores. We provide a theoretical analysis that bounds the Kullback-Leibler divergence between the distributions induced by full and stochastic sampling, establishing the accuracy of the proposed algorithm. We demonstrate the effectiveness of MSM on natural images and multi-coil MRI, showing that it can generate high-quality images and solve inverse problems -- all without access to clean training data. Code is available at https://github.com/wustl-cig/MSM.
Chinese: 测量分数扩散模型(MSM)通过创新的部分分数学习和高效随机采样算法,仅使用含噪声的欠采样测量数据即可实现高质量图像生成和逆问题求解,无需依赖干净训练数据。
English: The Measurement Score-based diffusion Model (MSM) enables high-quality image generation and inverse problem solving using only noisy, subsampled measurements, eliminating the need for clean training data through its innovative partial score learning and efficient stochastic sampling algorithm.
Authors:Yiting Wang, Guoheng Sun, Wanghao Ye, Gang Qu, Ang Li
Abstract:
Automating Register Transfer Level (RTL) code generation using Large Language Models (LLMs) offers substantial promise for streamlining digital circuit design and reducing human effort. However, current LLM-based approaches face significant challenges with training data scarcity, poor specification-code alignment, lack of verification mechanisms, and balancing generalization with specialization. Inspired by DeepSeek-R1, we introduce VeriReason, a framework integrating supervised fine-tuning with Guided Reward Proximal Optimization (GRPO) reinforcement learning for RTL generation. Using curated training examples and a feedback-driven reward model, VeriReason combines testbench evaluations with structural heuristics while embedding self-checking capabilities for autonomous error correction. On the VerilogEval Benchmark, VeriReason delivers significant improvements: achieving 83.1% functional correctness on the VerilogEval Machine benchmark, substantially outperforming both comparable-sized models and much larger commercial systems like GPT-4 Turbo. Additionally, our approach demonstrates up to a 2.8X increase in first-attempt functional correctness compared to baseline methods and exhibits robust generalization to unseen designs. To our knowledge, VeriReason represents the first system to successfully integrate explicit reasoning capabilities with reinforcement learning for Verilog generation, establishing a new state-of-the-art for automated RTL synthesis. The models and datasets are available at: https://huggingface.co/collections/AI4EDA-CASE Code is Available at: https://github.com/NellyW8/VeriReason
中文:VeriReason提出了一种融合监督微调与强化学习的新框架,通过集成测试评估与自检机制,在VerilogEval基准测试中实现了83.1%功能正确率的突破性表现,显著优于现有大型商业模型。
English: VeriReason introduces a novel framework combining supervised fine-tuning with reinforcement learning to automate RTL code generation, achieving state-of-the-art 83.1% functional correctness on benchmarks while embedding self-checking mechanisms for error correction.
Authors:Ziyao Cui, Minxing Zhang, Jian Pei
Abstract:
Nowadays, Large Language Models (LLMs) are trained on huge datasets, some including sensitive information. This poses a serious privacy concern because privacy attacks such as Membership Inference Attacks (MIAs) may detect this sensitive information. While knowledge distillation compresses LLMs into efficient, smaller student models, its impact on privacy remains underexplored. In this paper, we investigate how knowledge distillation affects model robustness against MIA. We focus on two questions. First, how is private data protected in teacher and student models? Second, how can we strengthen privacy preservation against MIAs in knowledge distillation? Through comprehensive experiments, we show that while teacher and student models achieve similar overall MIA accuracy, teacher models better protect member data, the primary target of MIA, whereas student models better protect non-member data. To address this vulnerability in student models, we propose 5 privacy-preserving distillation methods and demonstrate that they successfully reduce student models' vulnerability to MIA, with ensembling further stabilizing the robustness, offering a reliable approach for distilling more secure and efficient student models. Our implementation source code is available at https://github.com/richardcui18/MIA_in_KD.
中文摘要:本研究探讨了知识蒸馏对大型语言模型隐私保护的影响,发现学生模型在成员数据上对成员推理攻击更为脆弱,并提出了五种有效方法以增强其安全性。
English Summary: This study explores how knowledge distillation affects privacy in large language models, revealing that student models are more vulnerable to membership inference attacks on member data, and proposes five effective methods to enhance their security.
Authors:Jeremy Budd, Javier Ideami, Benjamin Macdowall Rynne, Keith Duggar, Randall Balestriero
Abstract:
Sparse autoencoders (SAEs) have received considerable recent attention as tools for mechanistic interpretability, showing success at extracting interpretable features even from very large LLMs. However, this research has been largely empirical, and there have been recent doubts about the true utility of SAEs. In this work, we seek to enhance the theoretical understanding of SAEs, using the spline theory of deep learning. By situating SAEs in this framework: we discover that SAEs generalise ``$k$-means autoencoders'' to be piecewise affine, but sacrifice accuracy for interpretability vs. the optimal ``$k$-means-esque plus local principal component analysis (PCA)'' piecewise affine autoencoder. We characterise the underlying geometry of (TopK) SAEs using power diagrams. And we develop a novel proximal alternating method SGD (PAM-SGD) algorithm for training SAEs, with both solid theoretical foundations and promising empirical results in MNIST and LLM experiments, particularly in sample efficiency and (in the LLM setting) improved sparsity of codes. All code is available at: https://github.com/splInterp2025/splInterp
中文: 本研究通过将稀疏自编码器置于样条理论框架中,深化了对其的理论理解,揭示了其在可解释性与准确性之间的权衡,并提出了新颖的PAM-SGD算法,在实验中展现出更优的样本效率和稀疏性。
English: This study enhances the theoretical understanding of sparse autoencoders (SAEs) by framing them within spline theory, revealing their trade-off between interpretability and accuracy compared to optimal piecewise affine autoencoders, and introduces a novel PAM-SGD algorithm that demonstrates improved sample efficiency and sparsity in experiments.
Authors:Yang Tan, Wenrui Gou, Bozitao Zhong, Liang Hong, Huiqun Yu, Bingxin Zhou
Abstract:
Deep learning models have driven significant progress in predicting protein function and interactions at the protein level. While these advancements have been invaluable for many biological applications such as enzyme engineering and function annotation, a more detailed perspective is essential for understanding protein functional mechanisms and evaluating the biological knowledge captured by models. To address this demand, we introduce VenusX, the first large-scale benchmark for fine-grained functional annotation and function-based protein pairing at the residue, fragment, and domain levels. VenusX comprises three major task categories across six types of annotations, including residue-level binary classification, fragment-level multi-class classification, and pairwise functional similarity scoring for identifying critical active sites, binding sites, conserved sites, motifs, domains, and epitopes. The benchmark features over 878,000 samples curated from major open-source databases such as InterPro, BioLiP, and SAbDab. By providing mixed-family and cross-family splits at three sequence identity thresholds, our benchmark enables a comprehensive assessment of model performance on both in-distribution and out-of-distribution scenarios. For baseline evaluation, we assess a diverse set of popular and open-source models, including pre-trained protein language models, sequence-structure hybrids, structure-based methods, and alignment-based techniques. Their performance is reported across all benchmark datasets and evaluation settings using multiple metrics, offering a thorough comparison and a strong foundation for future research. Code and data are publicly available at https://github.com/ai4protein/VenusX.
中文: VenusX作为首个在残基、片段和结构域层面进行精细蛋白质功能注释和配对的大规模基准,通过涵盖多种任务和场景,为模型评估提供了全面支持。
English: VenusX is introduced as the first large-scale benchmark for fine-grained protein functional annotation and pairing at residue, fragment, and domain levels, enabling comprehensive evaluation of models across diverse tasks and scenarios.
Authors:Ryan Hoque, Peide Huang, David J. Yoon, Mouli Sivapurapu, Jian Zhang
Abstract:
Imitation learning for manipulation has a well-known data scarcity problem. Unlike natural language and 2D computer vision, there is no Internet-scale corpus of data for dexterous manipulation. One appealing option is egocentric human video, a passively scalable data source. However, existing large-scale datasets such as Ego4D do not have native hand pose annotations and do not focus on object manipulation. To this end, we use Apple Vision Pro to collect EgoDex: the largest and most diverse dataset of dexterous human manipulation to date. EgoDex has 829 hours of egocentric video with paired 3D hand and finger tracking data collected at the time of recording, where multiple calibrated cameras and on-device SLAM can be used to precisely track the pose of every joint of each hand. The dataset covers a wide range of diverse manipulation behaviors with everyday household objects in 194 different tabletop tasks ranging from tying shoelaces to folding laundry. Furthermore, we train and systematically evaluate imitation learning policies for hand trajectory prediction on the dataset, introducing metrics and benchmarks for measuring progress in this increasingly important area. By releasing this large-scale dataset, we hope to push the frontier of robotics, computer vision, and foundation models. EgoDex is publicly available for download at https://github.com/apple/ml-egodex.
中文摘要:EgoDex数据集通过苹果Vision Pro设备采集了829小时包含实时3D手部追踪的自我中心视角视频,覆盖194种日常家居操作任务,有效解决了操作模仿学习中的数据稀缺问题,为机器人和计算机视觉领域提供了重要资源。
English Summary: The EgoDex dataset addresses data scarcity in imitation learning for manipulation by providing 829 hours of egocentric video with real-time 3D hand tracking, collected using Apple Vision Pro across 194 household tasks, enabling advancements in robotics and computer vision.
Authors:Keenan Eikenberry, Lizuo Liu, Yoonsang Lee
Abstract:
This work investigates the use of Wasserstein correlation -- a normalized measure of statistical dependence based on the Wasserstein distance between a joint distribution and the product of its marginals -- for unsupervised representation learning. Unlike, for example, contrastive methods, which naturally cluster classes in the latent space, we find that an (auto)encoder trained to maximize Wasserstein correlation between the input and encoded distributions instead acts as a compressor, reducing dimensionality while approximately preserving the topological and geometric properties of the input distribution. More strikingly, we show that Wasserstein correlation maximization can be used to arrive at an (auto)encoder -- either trained from scratch, or else one that extends a frozen, pretrained model -- that is approximately invariant to a chosen augmentation, or collection of augmentations, and that still approximately preserves the structural properties of the non-augmented input distribution. To do this, we first define the notion of an augmented encoder using the machinery of Markov-Wasserstein kernels. When the maximization objective is then applied to the augmented encoder, as opposed to the underlying, deterministic encoder, the resulting model exhibits the desired invariance properties. Finally, besides our experimental results, which show that even simple feedforward networks can be imbued with invariants or can, alternatively, be used to impart invariants to pretrained models under this training process, we additionally establish various theoretical results for optimal transport-based dependence measures. Code is available at https://github.com/keenan-eikenberry/wasserstein_correlation_maximization .
本研究提出将Wasserstein相关性作为无监督表征学习的工具,证明其能使(自)编码器在压缩数据的同时保持结构特征,并对特定数据增强具备不变性。
This study introduces Wasserstein correlation as a tool for unsupervised representation learning, demonstrating that it enables (auto)encoders to compress data while preserving its structure and achieve invariance to specified augmentations.
Authors:Jintao Zhang, Jia Wei, Pengle Zhang, Xiaoming Xu, Haofeng Huang, Haoxu Wang, Kai Jiang, Jun Zhu, Jianfei Chen
Abstract:
The efficiency of attention is important due to its quadratic time complexity. We enhance the efficiency of attention through two key contributions: First, we leverage the new FP4 Tensor Cores in Blackwell GPUs to accelerate attention computation. Our implementation achieves 1038 TOPS on RTX5090, which is a 5x speedup over the fastest FlashAttention on RTX5090. Experiments show that our FP4 attention can accelerate inference of various models in a plug-and-play way. Second, we pioneer low-bit attention to training tasks. Existing low-bit attention works like FlashAttention3 and SageAttention focus only on inference. However, the efficiency of training large models is also important. To explore whether low-bit attention can be effectively applied to training tasks, we design an accurate and efficient 8-bit attention for both forward and backward propagation. Experiments indicate that 8-bit attention achieves lossless performance in fine-tuning tasks but exhibits slower convergence in pretraining tasks. The code will be available at https://github.com/thu-ml/SageAttention.
中文: 本研究通过利用Blackwell GPU的FP4张量核心将注意力推理速度提升5倍,并开创性地将8位注意力应用于训练任务,在微调中实现无损性能,但预训练收敛较慢。
English: This study enhances attention efficiency by utilizing FP4 Tensor Cores in Blackwell GPUs for a 5x inference speedup and pioneers 8-bit attention for training, achieving lossless fine-tuning results despite slower pretraining convergence.
Authors:Andrew Liu, Axel Elaldi, Nicholas T Franklin, Nathan Russell, Gurinder S Atwal, Yih-En A Ban, Olivia Viessmann
Abstract:
Invariant Point Attention (IPA) is a key algorithm for geometry-aware modeling in structural biology, central to many protein and RNA models. However, its quadratic complexity limits the input sequence length. We introduce FlashIPA, a factorized reformulation of IPA that leverages hardware-efficient FlashAttention to achieve linear scaling in GPU memory and wall-clock time with sequence length. FlashIPA matches or exceeds standard IPA performance while substantially reducing computational costs. FlashIPA extends training to previously unattainable lengths, and we demonstrate this by re-training generative models without length restrictions and generating structures of thousands of residues. FlashIPA is available at https://github.com/flagshippioneering/flash_ipa.
中文摘要:FlashIPA是对不变点注意力算法的优化重构,实现了与序列长度的线性计算扩展,可在保持或提升性能的同时支持更长序列的训练。
English Summary: FlashIPA is a computationally efficient reformulation of Invariant Point Attention that achieves linear scaling with sequence length, enabling training on longer sequences while maintaining or improving performance.
Authors:Stylianos Stasinos, Martino Mensio, Elena Lazovik, Athanasios Trantas
Abstract:
Biodiversity research requires complete and detailed information to study ecosystem dynamics at different scales. Employing data-driven methods like Machine Learning is getting traction in ecology and more specific biodiversity, offering alternative modelling pathways. For these methods to deliver accurate results there is the need for large, curated and multimodal datasets that offer granular spatial and temporal resolutions. In this work, we introduce BioCube, a multimodal, fine-grained global dataset for ecology and biodiversity research. BioCube incorporates species observations through images, audio recordings and descriptions, environmental DNA, vegetation indices, agricultural, forest, land indicators, and high-resolution climate variables. All observations are geospatially aligned under the WGS84 geodetic system, spanning from 2000 to 2020. The dataset will become available at https://huggingface.co/datasets/BioDT/BioCube while the acquisition and processing code base at https://github.com/BioDT/bfm-data.
中文: BioCube是一个面向生态与生物多样性研究的多模态全球数据集,整合了2000至2020年间具有精细时空分辨率的多源数据,为机器学习在生态学中的应用提供支持。
English: BioCube is a comprehensive global dataset for biodiversity research, integrating multimodal data from 2000 to 2020 with fine-grained spatial and temporal resolution to support machine learning applications in ecology.
Authors:Tianyi Shi, Zhu Meng, Yue Chen, Siyang Zheng, Fei Su, Jin Huang, Changrui Ren, Zhicheng Zhao
Abstract:
Time series forecasting faces two important but often overlooked challenges. Firstly, the inherent random noise in the time series labels sets a theoretical lower bound for the forecasting error, which is positively correlated with the entropy of the labels. Secondly, neural networks exhibit a frequency bias when modeling the state-space of time series, that is, the model performs well in learning certain frequency bands but poorly in others, thus restricting the overall forecasting performance. To address the first challenge, we prove a theorem that there exists a unitary transformation that can reduce the marginal entropy of multiple correlated Gaussian processes, thereby providing guidance for reducing the lower bound of forecasting error. Furthermore, experiments confirm that Discrete Fourier Transform (DFT) can reduce the entropy in the majority of scenarios. Correspondingly, to alleviate the frequency bias, we jointly introduce supervision in the frequency domain along the temporal dimension through DFT and Discrete Wavelet Transform (DWT). This supervision-side strategy is highly general and can be seamlessly integrated into any supervised learning method. Moreover, we propose a novel loss function named OLMA, which utilizes the frequency domain transformation across both channel and temporal dimensions to enhance forecasting. Finally, the experimental results on multiple datasets demonstrate the effectiveness of OLMA in addressing the above two challenges and the resulting improvement in forecasting accuracy. The results also indicate that the perspectives of entropy and frequency bias provide a new and feasible research direction for time series forecasting. The code is available at: https://github.com/Yuyun1011/OLMA-One-Loss-for-More-Accurate-Time-Series-Forecasting.
中文: 本研究针对时间序列预测中的两个关键挑战——随机噪声设定的理论误差下界和神经网络频率偏差,提出通过酉变换降低标签熵并引入频域监督与新型OLMA损失函数,在多个数据集上显著提升了预测精度。
English: This study addresses two key challenges in time series forecasting—random noise setting a theoretical error bound and neural network frequency bias—by proposing a unitary transformation to reduce label entropy and introducing frequency domain supervision with a novel OLMA loss function, which significantly improves forecasting accuracy across multiple datasets.
Authors:Adrian Robert Minut, Tommaso Mencattini, Andrea Santilli, Donato Crisostomi, Emanuele RodolÃ
Abstract:
Model merging allows combining the capabilities of existing models into a new one - post hoc, without additional training. This has made it increasingly popular thanks to its low cost and the availability of libraries that support merging on consumer GPUs. Recent work shows that pairing merging with evolutionary algorithms can boost performance, but no framework currently supports flexible experimentation with such strategies in language models. We introduce Mergenetic, an open-source library for evolutionary model merging. Mergenetic enables easy composition of merging methods and evolutionary algorithms while incorporating lightweight fitness estimators to reduce evaluation costs. We describe its design and demonstrate that Mergenetic produces competitive results across tasks and languages using modest hardware.
中文: Mergenetic 是一个开源库,它通过结合多种模型融合方法和进化算法,并采用轻量级适应度评估器,在普通硬件上实现了跨任务和语言的优异性能。
English: Mergenetic is an open-source library that facilitates evolutionary model merging by combining various merging methods and algorithms with lightweight fitness estimators, delivering competitive performance across tasks and languages on modest hardware.
Authors:Petr Kasalický, Martin Spišák, VojtÄch VanÄura, Daniel BohunÄk, Rodrigo Alves, Pavel KordÃk
Abstract:
Industry-scale recommender systems face a core challenge: representing entities with high cardinality, such as users or items, using dense embeddings that must be accessible during both training and inference. However, as embedding sizes grow, memory constraints make storage and access increasingly difficult. We describe a lightweight, learnable embedding compression technique that projects dense embeddings into a high-dimensional, sparsely activated space. Designed for retrieval tasks, our method reduces memory requirements while preserving retrieval performance, enabling scalable deployment under strict resource constraints. Our results demonstrate that leveraging sparsity is a promising approach for improving the efficiency of large-scale recommenders. We release our code at https://github.com/recombee/CompresSAE.
中文摘要:本文提出一种可学习的嵌入压缩技术,通过将稠密嵌入投影到高维稀疏空间,在保持检索性能的同时显著降低内存需求,适用于资源受限的大规模推荐系统部署。
English Summary: The paper introduces a learnable embedding compression method that projects dense embeddings into a high-dimensional sparse space, reducing memory usage while maintaining retrieval performance for large-scale recommender systems.
Authors:Raja Gond, Nipun Kwatra, Ramachandran Ramjee
Abstract:
Distributed inference of large language models (LLMs) can introduce overheads of up to 20% even over GPUs connected via high-speed interconnects such as NVLink. Multiple techniques have been proposed to mitigate these overheads by decomposing computations into finer-grained tasks and overlapping communication with sub-tasks as they complete. However, fine-grained decomposition of a large computation into many smaller computations on GPUs results in overheads. Furthermore, the communication itself uses many streaming multiprocessors (SMs), adding to the overhead.
We present TokenWeave to address these challenges. TokenWeave proposes a Token-Splitting technique that divides the tokens in the inference batch into two approximately equal subsets in a wave-aware manner. The communication of one subset is then overlapped with the computation of the other. In addition, TokenWeave optimizes the order of the layer normalization computation with respect to communication operations and implements a novel fused AllReduce--RMSNorm kernel that carefully leverages Multimem instruction support available on NVIDIA Hopper GPUs. These optimizations allow TokenWeave to perform communication and RMSNorm using only 2-8 SMs. Moreover, our kernel enables the memory-bound RMSNorm to be overlapped with the other batch's computation, providing additional gains.
Our evaluations demonstrate up to 1.29x speedup in latency and 1.26x higher throughput across multiple models and workloads. In several settings, TokenWeave results in better performance compared to an equivalent model with all communication removed.
Chinese: TokenWeave通过令牌分割技术和优化内核,在分布式大语言模型推理中重叠通信与计算以减少开销,实现了高达1.29倍的延迟加速和1.26倍的吞吐量提升。
English: TokenWeave introduces a token-splitting technique and optimized kernels to reduce overheads in distributed LLM inference by overlapping communication with computation, achieving up to 1.29x speedup in latency and 1.26x higher throughput.
Authors:Reginald McLean, Evangelos Chatzaroulas, Luc McCutcheon, Frank Röder, Tianhe Yu, Zhanpeng He, K. R. Zentner, Ryan Julian, J K Terry, Isaac Woungang, Nariman Farsad, Pablo Samuel Castro
Abstract:
Meta-World is widely used for evaluating multi-task and meta-reinforcement learning agents, which are challenged to master diverse skills simultaneously. Since its introduction however, there have been numerous undocumented changes which inhibit a fair comparison of algorithms. This work strives to disambiguate these results from the literature, while also leveraging the past versions of Meta-World to provide insights into multi-task and meta-reinforcement learning benchmark design. Through this process we release a new open-source version of Meta-World (https://github.com/Farama-Foundation/Metaworld/) that has full reproducibility of past results, is more technically ergonomic, and gives users more control over the tasks that are included in a task set.
中文: 本研究澄清了Meta-World中阻碍算法公平比较的未记录变更,并发布了新的开源版本,确保结果可复现、提升使用便利性,并支持任务集自定义。
English: This work clarifies undocumented changes in Meta-World that hindered fair algorithm comparisons and releases a new open-source version ensuring reproducibility, improved usability, and customizable task sets.
Authors:Wilson Wongso, Hao Xue, Flora D. Salim
Abstract:
Understanding human mobility through Point-of-Interest (POI) recommendation is increasingly important for applications such as urban planning, personalized services, and generative agent simulation. However, progress in this field is hindered by two key challenges: the over-reliance on older datasets from 2012-2013 and the lack of reproducible, city-level check-in datasets that reflect diverse global regions. To address these gaps, we present Massive-STEPS (Massive Semantic Trajectories for Understanding POI Check-ins), a large-scale, publicly available benchmark dataset built upon the Semantic Trails dataset and enriched with semantic POI metadata. Massive-STEPS spans 12 geographically and culturally diverse cities and features more recent (2017-2018) and longer-duration (24 months) check-in data than prior datasets. We benchmarked a wide range of POI recommendation models on Massive-STEPS using both supervised and zero-shot approaches, and evaluated their performance across multiple urban contexts. By releasing Massive-STEPS, we aim to facilitate reproducible and equitable research in human mobility and POI recommendation. The dataset and benchmarking code are available at: https://github.com/cruiseresearchgroup/Massive-STEPS
中文: Massive-STEPS数据集通过提供近期、大规模且地理多样性的基准数据,解决了兴趣点推荐研究中的局限性,以促进人类移动性研究的可复现性和公平性。
English: The Massive-STEPS dataset addresses limitations in POI recommendation research by providing a recent, large-scale, and geographically diverse benchmark to enhance the study of human mobility.
Authors:Wenhao Qian, Zhenzhen Hu, Zijie Song, Jia Li
Abstract:
Metaphorical imagination, the ability to connect seemingly unrelated concepts, is fundamental to human cognition and communication. While understanding linguistic metaphors has advanced significantly, grasping multimodal metaphors, such as those found in internet memes, presents unique challenges due to their unconventional expressions and implied meanings. Existing methods for multimodal metaphor identification often struggle to bridge the gap between literal and figurative interpretations. Additionally, generative approaches that utilize large language models or text-to-image models, while promising, suffer from high computational costs. This paper introduces \textbf{C}oncept \textbf{D}rift \textbf{G}uided \textbf{L}ayerNorm \textbf{T}uning (\textbf{CDGLT}), a novel and training-efficient framework for multimodal metaphor identification. CDGLT incorporates two key innovations: (1) Concept Drift, a mechanism that leverages Spherical Linear Interpolation (SLERP) of cross-modal embeddings from a CLIP encoder to generate a new, divergent concept embedding. This drifted concept helps to alleviate the gap between literal features and the figurative task. (2) A prompt construction strategy, that adapts the method of feature extraction and fusion using pre-trained language models for the multimodal metaphor identification task. CDGLT achieves state-of-the-art performance on the MET-Meme benchmark while significantly reducing training costs compared to existing generative methods. Ablation studies demonstrate the effectiveness of both Concept Drift and our adapted LN Tuning approach. Our method represents a significant step towards efficient and accurate multimodal metaphor understanding. The code is available: \href{https://github.com/Qianvenh/CDGLT}{https://github.com/Qianvenh/CDGLT}.
中文: 本文提出CDGLT这一训练高效的多模态隐喻识别框架,通过概念漂移和提示构建策略弥合字面与比喻特征间的鸿沟,在降低计算成本的同时于MET-Meme基准测试中实现最优性能。
English: This paper introduces CDGLT, a training-efficient framework for multimodal metaphor identification that uses Concept Drift and prompt construction to bridge literal-figurative gaps while reducing computational costs, achieving state-of-the-art performance on the MET-Meme benchmark.
Authors:Hangyu Zhou, Aaron Gokaslan, Volodymyr Kuleshov, Bharath Hariharan
Abstract:
From a multi-model compression perspective, model merging enables memory-efficient serving of multiple models fine-tuned from the same base, but suffers from degraded performance due to interference among their task-specific parameter adjustments (i.e., deltas). In this paper, we reformulate model merging as a compress-and-retrieve scheme, revealing that the task interference arises from the summation of irrelevant deltas during model retrieval. To address this issue, we use random orthogonal transformations to decorrelate these vectors into self-cancellation. We show that this approach drastically reduces interference, improving performance across both vision and language tasks. Since these transformations are fully defined by random seeds, adding new models requires no extra memory. Further, their data- and model-agnostic nature enables easy addition or removal of models with minimal compute overhead, supporting efficient and flexible multi-model serving.
Chinese: 本文提出了一种采用随机正交变换的模型合并技术,通过解耦任务特定参数调整来减少干扰,在无需为新增模型分配额外内存的情况下,提升了视觉和语言任务的性能。
English: This paper introduces a model merging technique using random orthogonal transformations to decorrelate task-specific parameter adjustments, reducing interference and improving performance across vision and language tasks without requiring extra memory for new models.
Authors:Chenhong Zhou, Jie Chen, Zaifeng Yang, Ching Eng Png
Abstract:
Physics-informed neural networks (PINNs) have emerged as a new learning paradigm for solving partial differential equations (PDEs) by enforcing the constraints of physical equations, boundary conditions (BCs), and initial conditions (ICs) into the loss function. Despite their successes, vanilla PINNs still suffer from poor accuracy and slow convergence due to the intractable multi-objective optimization issue. In this paper, we propose a novel Dual-Balanced PINN (DB-PINN), which dynamically adjusts loss weights by integrating inter-balancing and intra-balancing to alleviate two imbalance issues in PINNs. Inter-balancing aims to mitigate the gradient imbalance between PDE residual loss and condition-fitting losses by determining an aggregated weight that offsets their gradient distribution discrepancies. Intra-balancing acts on condition-fitting losses to tackle the imbalance in fitting difficulty across diverse conditions. By evaluating the fitting difficulty based on the loss records, intra-balancing can allocate the aggregated weight proportionally to each condition loss according to its fitting difficulty level. We further introduce a robust weight update strategy to prevent abrupt spikes and arithmetic overflow in instantaneous weight values caused by large loss variances, enabling smooth weight updating and stable training. Extensive experiments demonstrate that DB-PINN achieves significantly superior performance than those popular gradient-based weighting methods in terms of convergence speed and prediction accuracy. Our code and supplementary material are available at https://github.com/chenhong-zhou/DualBalanced-PINNs.
中文: 本文提出了一种双平衡物理信息神经网络(DB-PINN),通过集成间平衡和内部平衡机制动态调整损失权重,有效解决了PINNs中的梯度失衡和条件拟合难度不均问题,在收敛速度和预测精度上显著优于主流方法。
English: This paper introduces a Dual-Balanced PINN (DB-PINN) that dynamically adjusts loss weights through inter-balancing and intra-balancing mechanisms to address gradient and fitting imbalances in PINNs, achieving superior convergence speed and prediction accuracy compared to existing methods.
Authors:Lin Zhu, Yijun Bian, Lei You
Abstract:
Ensuring fairness in machine learning models is critical, particularly in high-stakes domains where biased decisions can lead to serious societal consequences. Existing preprocessing approaches generally lack transparent mechanisms for identifying which features or instances are responsible for unfairness. This obscures the rationale behind data modifications. We introduce FairSHAP, a novel pre-processing framework that leverages Shapley value attribution to improve both individual and group fairness. FairSHAP identifies fairness-critical instances in the training data using an interpretable measure of feature importance, and systematically modifies them through instance-level matching across sensitive groups. This process reduces discriminative risk - an individual fairness metric - while preserving data integrity and model accuracy. We demonstrate that FairSHAP significantly improves demographic parity and equality of opportunity across diverse tabular datasets, achieving fairness gains with minimal data perturbation and, in some cases, improved predictive performance. As a model-agnostic and transparent method, FairSHAP integrates seamlessly into existing machine learning pipelines and provides actionable insights into the sources of bias.Our code is on https://github.com/youlei202/FairSHAP.
中文: FairSHAP是一种新颖的预处理框架,利用Shapley值识别并修改训练数据中影响公平性的关键实例,在保持数据完整性和模型准确性的同时,显著提升了个体与群体公平性。
English: FairSHAP is a novel pre-processing framework that uses Shapley values to identify and modify fairness-critical instances in training data, enhancing both individual and group fairness while preserving data integrity and model accuracy.
Authors:Guangqiang Li, M. Amine Atoui, Xiangshun Li
Abstract:
Deep learning methods have shown promising performance in fault diagnosis for multimode process. Most existing studies assume that the collected health state categories from different operating modes are identical. However, in real industrial scenarios, these categories typically exhibit only partial overlap. The incompleteness of the available data and the large distributional differences between the operating modes pose a significant challenge to existing fault diagnosis methods. To address this problem, a novel fault diagnosis model named self-adaptive temporal-spatial attention network (TSA-SAN) is proposed. First, inter-mode mappings are constructed using healthy category data to generate multimode samples. To enrich the diversity of the fault data, interpolation is performed between healthy and fault samples. Subsequently, the fault diagnosis model is trained using real and generated data. The self-adaptive instance normalization is established to suppress irrelevant information while retaining essential statistical features for diagnosis. In addition, a temporal-spatial attention mechanism is constructed to focus on the key features, thus enhancing the generalization ability of the model. The extensive experiments demonstrate that the proposed model significantly outperforms the state-of-the-art methods. The code will be available on Github at https://github.com/GuangqiangLi/TSA-SAN.
中文摘要:提出的自适应时空注意力网络(TSA-SAN)通过生成合成数据并采用自适应归一化和注意力机制,解决了多模态故障诊断中类别部分重叠和分布差异的问题,实验证明其性能显著优于现有方法。
English Summary: The proposed self-adaptive temporal-spatial attention network (TSA-SAN) addresses partial category overlap and distribution differences in multimode fault diagnosis by generating synthetic data and employing adaptive normalization with attention mechanisms, demonstrating superior performance over existing methods.
Authors:Yapei Chang, Yekyung Kim, Michael Krumdick, Amir Zadeh, Chuan Li, Chris Tanner, Mohit Iyyer
Abstract:
Reward models are central to aligning LLMs with human preferences, but they are costly to train, requiring large-scale human-labeled preference data and powerful pretrained LLM backbones. Meanwhile, the increasing availability of high-quality synthetic instruction-following datasets raises the question: can simpler, reference-based metrics serve as viable alternatives to reward models during RL-based alignment? In this paper, we show first that BLEU, a basic string-matching metric, surprisingly matches strong reward models in agreement with human preferences on general instruction-following datasets. Based on this insight, we develop BLEUBERI, a method that first identifies challenging instructions and then applies Group Relative Policy Optimization (GRPO) using BLEU directly as the reward function. We demonstrate that BLEUBERI-trained models are competitive with models trained via reward model-guided RL across four challenging instruction-following benchmarks and three different base language models. A human evaluation further supports that the quality of BLEUBERI model outputs is on par with those from reward model-aligned models. Moreover, BLEUBERI models generate outputs that are more factually grounded than competing methods. Overall, we show that given access to high-quality reference outputs (easily obtained via existing instruction-following datasets or synthetic data generation), string matching-based metrics are cheap yet effective proxies for reward models during alignment. We release our code and data at https://github.com/lilakk/BLEUBERI.
Chinese: 研究表明,简单的字符串匹配指标如BLEU可有效替代成本高昂的奖励模型来对齐语言模型与人类偏好,提出的BLEUBERI方法直接使用BLEU作为奖励函数,在指令遵循任务中实现与奖励模型指导的强化学习相竞争的性能,同时增强生成内容的事实依据性。
English: The study reveals that simple string-matching metrics like BLEU can effectively replace costly reward models for aligning language models with human preferences, introducing BLEUBERI, a method that leverages BLEU as a reward function to achieve competitive performance in instruction-following tasks while enhancing factual grounding.
Authors:VladimÃr Boža, VladimÃr Macko
Abstract:
Binary quantization approaches, which replace weight matrices with binary matrices and substitute costly multiplications with cheaper additions, offer a computationally efficient approach to address the increasing computational and storage requirements of Large Language Models (LLMs). However, the severe quantization constraint ($\pm1$) can lead to significant accuracy degradation. In this paper, we propose Double Binary Factorization (DBF), a novel method that factorizes dense weight matrices into products of two binary (sign) matrices, each accompanied by scaling vectors. DBF preserves the efficiency advantages of binary representations while achieving compression rates that are competitive with or superior to state-of-the-art methods. Specifically, in a 1-bit per weight range, DBF is better than existing binarization approaches. In a 2-bit per weight range, DBF is competitive with the best quantization methods like QuIP\# and QTIP. Unlike most existing compression techniques, which offer limited compression level choices, DBF allows fine-grained control over compression ratios by adjusting the factorization's intermediate dimension. Based on this advantage, we further introduce an algorithm for estimating non-uniform layer-wise compression ratios for DBF, based on previously developed channel pruning criteria.
Code available at: https://github.com/usamec/double_binary
中文摘要:本文提出双二元分解(DBF)方法,通过将稠密权重矩阵分解为两个带缩放向量的二元矩阵,在保持计算效率的同时,实现了优于或媲美现有先进方法的压缩率,并具备对压缩比的精细调控能力。
English Summary: The paper introduces Double Binary Factorization (DBF), a method that decomposes weight matrices into two binary matrices with scaling vectors, maintaining computational efficiency while achieving superior or competitive compression rates compared to state-of-the-art techniques, with the added benefit of fine-grained control over compression ratios.
Authors:Yexiang Liu, Zekun Li, Zhi Fang, Nan Xu, Ran He, Tieniu Tan
Abstract:
Recently, scaling test-time compute on Large Language Models (LLM) has garnered wide attention. However, there has been limited investigation of how various reasoning prompting strategies perform as scaling. In this paper, we focus on a standard and realistic scaling setting: majority voting. We systematically conduct experiments on 6 LLMs $\times$ 8 prompting strategies $\times$ 6 benchmarks. Experiment results consistently show that as the sampling time and computational overhead increase, complicated prompting strategies with superior initial performance gradually fall behind simple Chain-of-Thought. We analyze this phenomenon and provide theoretical proofs. Additionally, we propose a probabilistic method to efficiently predict scaling performance and identify the best prompting strategy under large sampling times, eliminating the need for resource-intensive inference processes in practical applications. Furthermore, we introduce two ways derived from our theoretical analysis to significantly improve the scaling performance. We hope that our research can promote to re-examine the role of complicated prompting, unleash the potential of simple prompting strategies, and provide new insights for enhancing test-time scaling performance. Code is available at https://github.com/MraDonkey/rethinking_prompting.
Chinese: 本研究发现在测试时计算扩展过程中,随着资源增加,复杂提示策略会逐渐被简单的思维链方法超越,并提出无需大量推理即可预测最优策略并提升扩展性能的高效方法。
English: This study reveals that as computational resources increase during test-time scaling, complex prompting strategies are outperformed by simple Chain-of-Thought, and proposes efficient methods to predict optimal strategies and enhance scaling performance without extensive inference.
Authors:Vijay Prakash Dwivedi, Sri Jaladi, Yangyi Shen, Federico López, Charilaos I. Kanatsoulis, Rishi Puri, Matthias Fey, Jure Leskovec
Abstract:
Relational Deep Learning (RDL) is a promising approach for building state-of-the-art predictive models on multi-table relational data by representing it as a heterogeneous temporal graph. However, commonly used Graph Neural Network models suffer from fundamental limitations in capturing complex structural patterns and long-range dependencies that are inherent in relational data. While Graph Transformers have emerged as powerful alternatives to GNNs on general graphs, applying them to relational entity graphs presents unique challenges: (i) Traditional positional encodings fail to generalize to massive, heterogeneous graphs; (ii) existing architectures cannot model the temporal dynamics and schema constraints of relational data; (iii) existing tokenization schemes lose critical structural information. Here we introduce the Relational Graph Transformer (RelGT), the first graph transformer architecture designed specifically for relational tables. RelGT employs a novel multi-element tokenization strategy that decomposes each node into five components (features, type, hop distance, time, and local structure), enabling efficient encoding of heterogeneity, temporality, and topology without expensive precomputation. Our architecture combines local attention over sampled subgraphs with global attention to learnable centroids, incorporating both local and database-wide representations. Across 21 tasks from the RelBench benchmark, RelGT consistently matches or outperforms GNN baselines by up to 18%, establishing Graph Transformers as a powerful architecture for Relational Deep Learning.
中文:关系图变换器(RelGT)是一种专为多表关系数据设计的新型架构,通过多元素标记化策略结合局部与全局注意力机制,有效解决了传统图神经网络在异质性和时序数据处理上的局限,在基准测试中性能显著优于现有方法。
English: The Relational Graph Transformer (RelGT) is a novel architecture designed to overcome the limitations of traditional Graph Neural Networks in handling heterogeneous and temporal relational data by employing a multi-element tokenization strategy and combining local and global attention mechanisms, achieving superior performance on benchmark tasks.
Authors:Congcong Zhu, Xiaoyan Xu, Jiayue Han, Jingrun Chen
Abstract:
Auto-regressive partial differential equation (PDE) foundation models have shown great potential in handling time-dependent data. However, these models suffer from the shortcut problem deeply rooted in auto-regressive prediction, causing error accumulation. The challenge becomes particularly evident for out-of-distribution data, as the pretraining performance may approach random model initialization for downstream tasks with long-term dynamics. To deal with this problem, we propose physics-informed temporal alignment (PITA), a self-supervised learning framework inspired by inverse problem solving. Specifically, PITA aligns the physical dynamics discovered at different time steps on each given PDE trajectory by integrating physics-informed constraints into the self-supervision signal. The alignment is derived from observation data without relying on known physics priors, indicating strong generalization ability to the out-of-distribution data. Extensive experiments show that PITA significantly enhances the accuracy and robustness of existing foundation models on diverse time-dependent PDE data. The code is available at https://github.com/SCAILab-USTC/PITA.
Chinese: 自回归偏微分方程基础模型存在误差累积问题,尤其对分布外数据表现不佳,而提出的物理信息时间对齐(PITA)框架通过自监督学习有效提升了其准确性和鲁棒性。
English: Auto-regressive PDE foundation models face error accumulation issues, especially with out-of-distribution data, but the proposed physics-informed temporal alignment (PITA) framework enhances their accuracy and robustness through self-supervised learning.
Authors:Filippo Leveni, Luca Magri, Cesare Alippi, Giacomo Boracchi
Abstract:
We focus on the problem of identifying samples in a set that do not conform to structured patterns represented by low-dimensional manifolds. An effective way to solve this problem is to embed data in a high dimensional space, called Preference Space, where anomalies can be identified as the most isolated points. In this work, we employ Locality Sensitive Hashing to avoid explicit computation of distances in high dimensions and thus improve Anomaly Detection efficiency. Specifically, we present an isolation-based anomaly detection technique designed to work in the Preference Space which achieves state-of-the-art performance at a lower computational cost. Code is publicly available at https://github.com/ineveLoppiliF/Hashing-for-Structure-based-Anomaly-Detection.
中文摘要:本研究提出一种高效异常检测方法,通过将数据嵌入高维偏好空间并采用局部敏感哈希技术,在保持最优性能的同时显著降低了计算成本。
English Summary: This study introduces an efficient anomaly detection method that identifies outliers by embedding data into a high-dimensional Preference Space and using Locality Sensitive Hashing to reduce computational costs while maintaining state-of-the-art performance.
Authors:Jiacheng Liang, Tanqiu Jiang, Yuhui Wang, Rongyi Zhu, Fenglong Ma, Ting Wang
Abstract:
This paper presents AutoRAN, the first automated, weak-to-strong jailbreak attack framework targeting large reasoning models (LRMs). At its core, AutoRAN leverages a weak, less-aligned reasoning model to simulate the target model's high-level reasoning structures, generates narrative prompts, and iteratively refines candidate prompts by incorporating the target model's intermediate reasoning steps. We evaluate AutoRAN against state-of-the-art LRMs including GPT-o3/o4-mini and Gemini-2.5-Flash across multiple benchmark datasets (AdvBench, HarmBench, and StrongReject). Results demonstrate that AutoRAN achieves remarkable success rates (approaching 100%) within one or a few turns across different LRMs, even when judged by a robustly aligned external model. This work reveals that leveraging weak reasoning models can effectively exploit the critical vulnerabilities of much more capable reasoning models, highlighting the need for improved safety measures specifically designed for reasoning-based models. The code for replicating AutoRAN and running records are available at: (https://github.com/JACKPURCELL/AutoRAN-public). (warning: this paper contains potentially harmful content generated by LRMs.)
中文: AutoRAN是首个自动化越狱攻击框架,通过利用弱对齐推理模型模拟目标模型的高级推理结构,成功揭示了大型推理模型的关键安全漏洞,在多种先进模型上实现了接近100%的攻击成功率。
English: AutoRAN is the first automated jailbreak framework that uses a weakly-aligned reasoning model to exploit vulnerabilities in advanced large reasoning models, achieving near-perfect success rates and revealing critical safety gaps in reasoning-based AI systems.
Authors:Yifei He, Siqi Zeng, Yuzheng Hu, Rui Yang, Tong Zhang, Han Zhao
Abstract:
Model merging provides a scalable alternative to multi-task training by combining specialized finetuned models through parameter arithmetic, enabling efficient deployment without the need for joint training or access to all task data. While recent methods have shown promise, existing evaluations are limited in both model scale and task diversity, leaving open questions about their applicability to large, domain-specialized LLMs. To tackle the challenges, we introduce MergeBench, a comprehensive evaluation suite designed to assess model merging at scale. MergeBench builds on state-of-the-art open-source language models, including Llama and Gemma families at 2B to 9B scales, and covers five key domains: instruction following, mathematics, multilingual understanding, coding and safety. We standardize finetuning and evaluation protocols, and assess eight representative merging methods across multi-task performance, forgetting and runtime efficiency. Based on extensive experiments, we provide practical guidelines for algorithm selection and share insights showing that model merging tends to perform better on stronger base models, with techniques such as merging coefficient tuning and sparsification improving knowledge retention. However, several challenges remain, including the computational cost on large models, the gap for in-domain performance compared to multi-task models, and the underexplored role of model merging in standard LLM training pipelines. We hope MergeBench provides a foundation for future research to advance the understanding and practical application of model merging. Our project page is at \href{https://yifei-he.github.io/mergebench/}{https://yifei-he.github.io/mergebench/}.
Authors:Wasu Top Piriyakulkij, Yichao Liang, Hao Tang, Adrian Weller, Marta Kryven, Kevin Ellis
Abstract:
Learning how the world works is central to building AI agents that can adapt to complex environments. Traditional world models based on deep learning demand vast amounts of training data, and do not flexibly update their knowledge from sparse observations. Recent advances in program synthesis using Large Language Models (LLMs) give an alternate approach which learns world models represented as source code, supporting strong generalization from little data. To date, application of program-structured world models remains limited to natural language and grid-world domains. We introduce a novel program synthesis method for effectively modeling complex, non-gridworld domains by representing a world model as an exponentially-weighted product of programmatic experts (PoE-World) synthesized by LLMs. We show that this approach can learn complex, stochastic world models from just a few observations. We evaluate the learned world models by embedding them in a model-based planning agent, demonstrating efficient performance and generalization to unseen levels on Atari's Pong and Montezuma's Revenge. We release our code and display the learned world models and videos of the agent's gameplay at https://topwasu.github.io/poe-world.
Authors:Ian Holmes, Min Chi
Abstract:
Sparse and delayed reward functions pose a significant obstacle for real-world Reinforcement Learning (RL) applications. In this work, we propose Attention-based REward Shaping (ARES), a general and robust algorithm which uses a transformer's attention mechanism to generate shaped rewards and create a dense reward function for any environment. ARES requires a set of episodes and their final returns as input. It can be trained entirely offline and is able to generate meaningful shaped rewards even when using small datasets or episodes produced by agents taking random actions. ARES is compatible with any RL algorithm and can handle any level of reward sparsity. In our experiments, we focus on the most challenging case where rewards are fully delayed until the end of each episode. We evaluate ARES across a diverse range of environments, widely used RL algorithms, and baseline methods to assess the effectiveness of the shaped rewards it produces. Our results show that ARES can significantly improve learning in delayed reward settings, enabling RL agents to train in scenarios that would otherwise require impractical amounts of data or even be unlearnable. To our knowledge, ARES is the first approach that works fully offline, remains robust to extreme reward delays and low-quality data, and is not limited to goal-based tasks.
Chinese: ARES是一种基于注意力机制的新型离线强化学习算法,通过转换器的注意力机制从稀疏或延迟奖励中生成密集奖励函数,能在各种环境中显著提升学习效率且无需在线交互。
English: ARES is a novel offline reinforcement learning algorithm that uses transformer attention mechanisms to create dense reward functions from sparse or delayed rewards, significantly improving learning efficiency across diverse environments without requiring online interaction.
Authors:Sayed Mehedi Azim, Brian Corbett, Iman Dehzangi
Abstract:
The hippocampus, a critical brain structure involved in memory processing and various neurodegenerative and psychiatric disorders, comprises three key subregions: the dentate gyrus (DG), Cornu Ammonis 1 (CA1), and Cornu Ammonis 3 (CA3). Accurate segmentation of these subregions from histological tissue images is essential for advancing our understanding of disease mechanisms, developmental dynamics, and therapeutic interventions. However, no existing methods address the automated segmentation of hippocampal subregions from tissue images, particularly from immunohistochemistry (IHC) images. To bridge this gap, we introduce a novel set of four comprehensive murine hippocampal IHC datasets featuring distinct staining modalities: cFos, NeuN, and multiplexed stains combining cFos, NeuN, and either ÎFosB or GAD67, capturing structural, neuronal activity, and plasticity associated information. Additionally, we propose ROIsGAN, a region-guided U-Net-based generative adversarial network tailored for hippocampal subregion segmentation. By leveraging adversarial learning, ROIsGAN enhances boundary delineation and structural detail refinement through a novel region-guided discriminator loss combining Dice and binary cross-entropy loss. Evaluated across DG, CA1, and CA3 subregions, ROIsGAN consistently outperforms conventional segmentation models, achieving performance gains ranging from 1-10% in Dice score and up to 11% in Intersection over Union (IoU), particularly under challenging staining conditions. Our work establishes foundational datasets and methods for automated hippocampal segmentation, enabling scalable, high-precision analysis of tissue images in neuroscience research. Our generated datasets, proposed model as a standalone tool, and its corresponding source code are publicly available at: https://github.com/MehediAzim/ROIsGAN
中文: 本研究提出了ROIsGAN这一新型生成对抗网络及四个海马体免疫组化数据集,实现了海马体亚区的自动分割,相比现有方法性能显著提升,为神经科学研究提供了基础资源。
English: This study introduces ROIsGAN, a novel generative adversarial network, along with four comprehensive hippocampal immunohistochemistry datasets to automate the segmentation of hippocampal subregions, achieving significant performance improvements over existing methods and providing foundational resources for neuroscience research.
Authors:Patara Trirat, Jae-Gil Lee
Abstract:
The growing use of smartphones and IoT devices necessitates efficient time-series analysis on resource-constrained hardware, which is critical for sensing applications such as human activity recognition and air quality prediction. Recent efforts in hardware-aware neural architecture search (NAS) automate architecture discovery for specific platforms; however, none focus on general time-series analysis with edge deployment. Leveraging the problem-solving and reasoning capabilities of large language models (LLM), we propose MONAQ, a novel framework that reformulates NAS into Multi-Objective Neural Architecture Querying tasks. MONAQ is equipped with multimodal query generation for processing multimodal time-series inputs and hardware constraints, alongside an LLM agent-based multi-objective search to achieve deployment-ready models via code generation. By integrating numerical data, time-series images, and textual descriptions, MONAQ improves an LLM's understanding of time-series data. Experiments on fifteen datasets demonstrate that MONAQ-discovered models outperform both handcrafted models and NAS baselines while being more efficient.
中文:MONAQ是一种创新框架,它利用大型语言模型将神经架构搜索重构为多目标查询任务,通过处理多模态时序数据和硬件约束,自动生成适用于边缘设备的高效模型,在实验中展现出优越性能。
English: MONAQ is a novel framework that transforms neural architecture search into multi-objective querying tasks using large language models, enabling automated discovery of efficient time-series analysis models for edge deployment while outperforming existing methods.
Authors:Anastasios Gerontopoulos, Spyros Gidaris, Nikos Komodakis
Abstract:
Multi-token prediction has emerged as a promising objective for improving language model pretraining, but its benefits have not consistently generalized to other settings such as fine-tuning. In this paper, we propose MuToR, a simple and effective approach to multi-token prediction that interleaves learnable register tokens into the input sequence, each tasked with predicting future targets. Compared to existing methods, MuToR offers several key advantages: it introduces only a negligible number of additional parameters, requires no architectural changes--ensuring compatibility with off-the-shelf pretrained language models--and remains aligned with the next-token pretraining objective, making it especially well-suited for supervised fine-tuning. Moreover, it naturally supports scalable prediction horizons. We demonstrate the effectiveness and versatility of MuToR across a range of use cases, including supervised fine-tuning, parameter-efficient fine-tuning (PEFT), and pretraining, on challenging generative tasks in both language and vision domains. Our code will be available at: https://github.com/nasosger/MuToR.
中文: MuToR是一种创新的多令牌预测方法,通过在输入序列中插入可学习的寄存器令牌来预测未来目标,具有参数增量极少、无需改变模型架构即可兼容现有预训练模型的特点,在语言和视觉任务的微调与预训练中均表现出优越性能。
English: MuToR is a novel multi-token prediction method that integrates learnable register tokens into input sequences to predict future targets, offering minimal parameter overhead, architectural compatibility with existing models, and enhanced performance across fine-tuning and pretraining scenarios in both language and vision tasks.
Authors:Wenhao Ding, Choon Hwai Yap, Kangjun Ji, Simão Castro
Abstract:
A generative model for the mesh geometry of intracranial aneurysms (IA) is crucial for training networks to predict blood flow forces in real time, which is a key factor affecting disease progression. This need is necessitated by the absence of a large IA image datasets. Existing shape generation methods struggle to capture realistic IA features and ignore the relationship between IA pouches and parent vessels, limiting physiological realism and their generation cannot be controlled to have specific morphological measurements. We propose AneuG, a two-stage Variational Autoencoder (VAE)-based IA mesh generator. In the first stage, AneuG generates low-dimensional Graph Harmonic Deformation (GHD) tokens to encode and reconstruct aneurysm pouch shapes, constrained to morphing energy statistics truths. GHD enables more accurate shape encoding than alternatives. In the second stage, AneuG generates parent vessels conditioned on GHD tokens, by generating vascular centreline and propagating the cross-section. AneuG's IA shape generation can further be conditioned to have specific clinically relevant morphological measurements. This is useful for studies to understand shape variations represented by clinical measurements, and for flow simulation studies to understand effects of specific clinical shape parameters on fluid dynamics. Source code and implementation details are available at https://github.com/anonymousaneug/AneuG.
中文摘要:AneuG是一种基于变分自编码器的两阶段生成器,它通过图谐波变形标记编码动脉瘤囊形状并生成母血管,能生成具有生理真实性的颅内动脉瘤网格,并可控制临床相关的形态学测量参数。
English Summary: AneuG is a two-stage VAE-based generator that creates realistic intracranial aneurysm meshes by first encoding pouch shapes with Graph Harmonic Deformation tokens and then generating parent vessels, while enabling control over clinically relevant morphological measurements.
Authors:Kaivalya Rawal, Zihao Fu, Eoin Delaney, Chris Russell
Abstract:
There can be many competing and contradictory explanations for a single model prediction, making it difficult to select which one to use. Current explanation evaluation frameworks measure quality by comparing against ideal "ground-truth" explanations, or by verifying model sensitivity to important inputs. We outline the limitations of these approaches, and propose three desirable principles to ground the future development of explanation evaluation strategies for local feature importance explanations. We propose a ground-truth Agnostic eXplanation Evaluation framework (AXE) for evaluating and comparing model explanations that satisfies these principles. Unlike prior approaches, AXE does not require access to ideal ground-truth explanations for comparison, or rely on model sensitivity - providing an independent measure of explanation quality. We verify AXE by comparing with baselines, and show how it can be used to detect explanation fairwashing. Our code is available at https://github.com/KaiRawal/Evaluating-Model-Explanations-without-Ground-Truth.
Chinese Summary: 本文提出了AXE框架,无需依赖理想基准或模型敏感度即可评估模型解释,克服了现有方法的局限,并能检测解释中的公平性粉饰问题。
English Summary: The paper introduces AXE, a ground-truth agnostic framework for evaluating model explanations without relying on ideal references or model sensitivity, addressing limitations in current methods and enabling detection of explanation fairwashing.
Authors:Rui Melo, Claudia Mamede, Andre Catarino, Rui Abreu, Henrique Lopes Cardoso
Abstract:
Software vulnerabilities such as buffer overflows and SQL injections are a major source of security breaches. Traditional methods for vulnerability detection remain essential but are limited by high false positive rates, scalability issues, and reliance on manual effort. These constraints have driven interest in AI-based approaches to automated vulnerability detection and secure code generation. While Large Language Models (LLMs) have opened new avenues for classification tasks, their complexity and opacity pose challenges for interpretability and deployment. Sparse Autoencoder offer a promising solution to this problem. We explore whether SAEs can serve as a lightweight, interpretable alternative for bug detection in Java functions. We evaluate the effectiveness of SAEs when applied to representations from GPT-2 Small and Gemma 2B, examining their capacity to highlight buggy behaviour without fine-tuning the underlying LLMs. We found that SAE-derived features enable bug detection with an F1 score of up to 89%, consistently outperforming fine-tuned transformer encoder baselines. Our work provides the first empirical evidence that SAEs can be used to detect software bugs directly from the internal representations of pretrained LLMs, without any fine-tuning or task-specific supervision. Code available at https://github.com/rufimelo99/SAE-Java-Bug-Detection
中文摘要:本研究表明稀疏自编码器无需微调即可利用预训练大语言模型的内部表征有效检测Java代码中的软件漏洞,最高达89%的F1分数,性能优于传统基线方法。
English Summary: This study demonstrates that Sparse Autoencoders can effectively detect software bugs in Java code using pre-trained LLMs' internal representations without fine-tuning, achieving up to 89% F1 score and outperforming traditional baselines.
Authors:Shihao Zou, Qingfeng Li, Wei Ji, Jingjing Li, Yongkui Yang, Guoqi Li, Chao Dong
Abstract:
Spiking Neural Networks (SNNs) have shown competitive performance to Artificial Neural Networks (ANNs) in various vision tasks, while offering superior energy efficiency. However, existing SNN-based Transformers primarily focus on single-image tasks, emphasizing spatial features while not effectively leveraging SNNs' efficiency in video-based vision tasks. In this paper, we introduce SpikeVideoFormer, an efficient spike-driven video Transformer, featuring linear temporal complexity $\mathcal{O}(T)$. Specifically, we design a spike-driven Hamming attention (SDHA) which provides a theoretically guided adaptation from traditional real-valued attention to spike-driven attention. Building on SDHA, we further analyze various spike-driven space-time attention designs and identify an optimal scheme that delivers appealing performance for video tasks, while maintaining only linear temporal complexity. The generalization ability and efficiency of our model are demonstrated across diverse downstream video tasks, including classification, human pose tracking, and semantic segmentation. Empirical results show our method achieves state-of-the-art (SOTA) performance compared to existing SNN approaches, with over 15\% improvement on the latter two tasks. Additionally, it matches the performance of recent ANN-based methods while offering significant efficiency gains, achieving $\times 16$, $\times 10$ and $\times 5$ improvements on the three tasks. https://github.com/JimmyZou/SpikeVideoFormer
中文: SpikeVideoFormer是一种高效的脉冲驱动视频Transformer,具有线性时间复杂度,在视频任务中实现最先进性能,同时相比现有方法显著提升了能效。
English: SpikeVideoFormer is an efficient spike-driven video Transformer with linear temporal complexity that achieves state-of-the-art performance in video tasks while significantly improving energy efficiency over existing methods.
Authors:Gabriel S. Gama, Valdir Grassi
Abstract:
Specialized Multi-Task Optimizers (SMTOs) balance task learning in Multi-Task Learning by addressing issues like conflicting gradients and differing gradient norms, which hinder equal-weighted task training. However, recent critiques suggest that equally weighted tasks can achieve competitive results compared to SMTOs, arguing that previous SMTO results were influenced by poor hyperparameter optimization and lack of regularization. In this work, we evaluate these claims through an extensive empirical evaluation of SMTOs, including some of the latest methods, on more complex multi-task problems to clarify this behavior. Our findings indicate that SMTOs perform well compared to uniform loss and that fixed weights can achieve competitive performance compared to SMTOs. Furthermore, we demonstrate why uniform loss perform similarly to SMTOs in some instances. The source code is available at https://github.com/Gabriel-SGama/UnitScal_vs_SMTOs.
中文: 专业多任务优化器(SMTOs)能有效平衡任务学习,与统一损失加权表现相当,同时固定权重在多任务学习中也展现出竞争力。
English: Specialized Multi-Task Optimizers (SMTOs) effectively balance task learning and can perform comparably to uniform loss weighting, while fixed weights also achieve competitive results in multi-task learning scenarios.
Authors:Alan Jeffares, Liyuan Liu
Abstract:
Variational Autoencoders (VAEs) are well-established as a principled approach to probabilistic unsupervised learning with neural networks. Typically, an encoder network defines the parameters of a Gaussian distributed latent space from which we can sample and pass realizations to a decoder network. This model is trained to reconstruct its inputs and is optimized through the evidence lower bound. In recent years, discrete latent spaces have grown in popularity, suggesting that they may be a natural choice for many data modalities (e.g. text). In this tutorial, we provide a rigorous, yet practical, introduction to discrete variational autoencoders -- specifically, VAEs in which the latent space is made up of latent variables that follow a categorical distribution. We assume only a basic mathematical background with which we carefully derive each step from first principles. From there, we develop a concrete training recipe and provide an example implementation, hosted at https://github.com/alanjeffares/discreteVAE.
中文: 本教程提供了关于离散变分自编码器的严谨而实用的介绍,它使用分类分布的潜在变量,并包含从基本原理出发的详细推导、训练指南及示例实现。
English: This tutorial offers a practical and rigorous introduction to discrete variational autoencoders, which utilize categorical latent variables, and includes a detailed derivation from basic principles along with a training guide and sample implementation.
Authors:Pavel Korotaev, Petr Surovtsev, Alexander Kapitanov, Karina Kvanchiani, Aleksandr Nagaev
Abstract:
Fingerspelling is a significant component of Sign Language (SL), allowing the interpretation of proper names, characterized by fast hand movements during signing. Although previous works on fingerspelling recognition have focused on processing the temporal dimension of videos, there remains room for improving the accuracy of these approaches. This paper introduces HandReader, a group of three architectures designed to address the fingerspelling recognition task. HandReader$_{RGB}$ employs the novel Temporal Shift-Adaptive Module (TSAM) to process RGB features from videos of varying lengths while preserving important sequential information. HandReader$_{KP}$ is built on the proposed Temporal Pose Encoder (TPE) operated on keypoints as tensors. Such keypoints composition in a batch allows the encoder to pass them through 2D and 3D convolution layers, utilizing temporal and spatial information and accumulating keypoints coordinates. We also introduce HandReader_RGB+KP - architecture with a joint encoder to benefit from RGB and keypoint modalities. Each HandReader model possesses distinct advantages and achieves state-of-the-art results on the ChicagoFSWild and ChicagoFSWild+ datasets. Moreover, the models demonstrate high performance on the first open dataset for Russian fingerspelling, Znaki, presented in this paper. The Znaki dataset and HandReader pre-trained models are publicly available.
中文: 本文提出HandReader,一套包含三种架构的手指拼写识别系统,通过新颖时序模块处理RGB和关键点数据,在现有及新推出的俄语数据集上均取得最优性能。
English: This paper introduces HandReader, a set of three architectures for fingerspelling recognition that achieve state-of-the-art results by processing RGB and keypoint data through novel temporal modules, validated on both existing and a new Russian dataset.
Authors:Xiangwen Zhuge, Xu Shen, Zeyu Wang, Fan Dang, Xuan Ding, Danyang Li, Yahui Han, Tianxiang Hao, Zheng Yang
Abstract:
Efficient LLM inference on resource-constrained devices presents significant challenges in compute and memory utilization. Due to limited GPU memory, existing systems offload model weights to CPU memory, incurring substantial I/O overhead between the CPU and GPU. This leads to two major inefficiencies: (1) GPU cores are underutilized, often remaining idle while waiting for data to be loaded; and (2) GPU memory has low impact on performance, as reducing its capacity has minimal effect on overall throughput.In this paper, we propose SpecOffload, a high-throughput inference engine that embeds speculative decoding into offloading. Our key idea is to unlock latent GPU resources for storing and executing a draft model used for speculative decoding, thus accelerating inference at near-zero additional cost. To support this, we carefully orchestrate the interleaved execution of target and draft models in speculative decoding within the offloading pipeline, and propose a planner to manage tensor placement and select optimal parameters. Compared to the best baseline, SpecOffload improves GPU core utilization by 4.49x and boosts inference throughput by 2.54x. Our code is available at https://github.com/MobiSense/SpecOffload-public .
中文: SpecOffload通过在卸载机制中嵌入推测性解码,有效利用GPU潜在资源,在几乎零额外成本下大幅提升了推理吞吐量和GPU核心利用率。
English: SpecOffload enhances LLM inference efficiency on resource-limited devices by integrating speculative decoding with offloading, boosting GPU utilization and throughput significantly without extra cost.
Authors:Saikat Barua, Mostafizur Rahman, Shehenaz Khaled, Md Jafor Sadek, Rafiul Islam, Shahnewaz Siddique
Abstract:
The emergence of hybrid quantum-classical machine learning (HQML) models opens new horizons of computational intelligence but their fundamental complexity frequently leads to black box behavior that undermines transparency and reliability in their application. Although XAI for quantum systems still in its infancy, a major research gap is evident in robust global and local explainability approaches that are designed for HQML architectures that employ quantized feature encoding followed by classical learning. The gap is the focus of this work, which introduces QuXAI, an framework based upon Q-MEDLEY, an explainer for explaining feature importance in these hybrid systems. Our model entails the creation of HQML models incorporating quantum feature maps, the use of Q-MEDLEY, which combines feature based inferences, preserving the quantum transformation stage and visualizing the resulting attributions. Our result shows that Q-MEDLEY delineates influential classical aspects in HQML models, as well as separates their noise, and competes well against established XAI techniques in classical validation settings. Ablation studies more significantly expose the virtues of the composite structure used in Q-MEDLEY. The implications of this work are critically important, as it provides a route to improve the interpretability and reliability of HQML models, thus promoting greater confidence and being able to engage in safer and more responsible use of quantum-enhanced AI technology.
Our code and experiments are open-sourced at: https://github.com/GitsSaikat/QuXAI
中文: 本文提出的QuXAI框架利用Q-MEDLEY增强混合量子-经典机器学习模型的可解释性,通过识别关键特征和分离噪声来提高模型透明度与可靠性。
English: This paper introduces QuXAI, a framework using Q-MEDLEY to enhance explainability in hybrid quantum-classical machine learning models by identifying influential features and separating noise, thereby improving their transparency and reliability.
Authors:Jing-Cheng Pang, Kaiyuan Li, Yidi Wang, Si-Hang Yang, Shengyi Jiang, Yang Yu
Abstract:
A central challenge in reinforcement learning (RL) is its dependence on extensive real-world interaction data to learn task-specific policies. While recent work demonstrates that large language models (LLMs) can mitigate this limitation by generating synthetic experience (noted as imaginary rollouts) for mastering novel tasks, progress in this emerging field is hindered due to the lack of a standard benchmark. To bridge this gap, we introduce ImagineBench, the first comprehensive benchmark for evaluating offline RL algorithms that leverage both real rollouts and LLM-imaginary rollouts. The key features of ImagineBench include: (1) datasets comprising environment-collected and LLM-imaginary rollouts; (2) diverse domains of environments covering locomotion, robotic manipulation, and navigation tasks; and (3) natural language task instructions with varying complexity levels to facilitate language-conditioned policy learning. Through systematic evaluation of state-of-the-art offline RL algorithms, we observe that simply applying existing offline RL algorithms leads to suboptimal performance on unseen tasks, achieving 35.44% success rate in hard tasks in contrast to 64.37% of method training on real rollouts for hard tasks. This result highlights the need for algorithm advancements to better leverage LLM-imaginary rollouts. Additionally, we identify key opportunities for future research: including better utilization of imaginary rollouts, fast online adaptation and continual learning, and extension to multi-modal tasks. Our code is publicly available at https://github.com/LAMDA-RL/ImagineBench.
中文:ImagineBench作为首个综合性基准被提出,旨在解决利用真实和语言模型生成虚拟经验的离线强化学习算法缺乏标准化评估的问题,揭示了现有方法在未见任务上表现欠佳,并强调需改进算法以更好地利用合成数据。
English: ImagineBench is introduced as the first comprehensive benchmark to address the lack of standardized evaluation for offline reinforcement learning algorithms that utilize both real and LLM-generated imaginary rollouts, revealing suboptimal performance of existing methods and highlighting the need for advancements to better leverage synthetic data.
Authors:Sajib Biswas, Mao Nishino, Samuel Jacob Chacko, Xiuwen Liu
Abstract:
As Large Language Models (LLMs) are widely used, understanding them systematically is key to improving their safety and realizing their full potential. Although many models are aligned using techniques such as reinforcement learning from human feedback (RLHF), they are still vulnerable to jailbreaking attacks. Some of the existing adversarial attack methods search for discrete tokens that may jailbreak a target model while others try to optimize the continuous space represented by the tokens of the model's vocabulary. While techniques based on the discrete space may prove to be inefficient, optimization of continuous token embeddings requires projections to produce discrete tokens, which might render them ineffective. To fully utilize the constraints and the structures of the space, we develop an intrinsic optimization technique using exponentiated gradient descent with the Bregman projection method to ensure that the optimized one-hot encoding always stays within the probability simplex. We prove the convergence of the technique and implement an efficient algorithm that is effective in jailbreaking several widely used LLMs. We demonstrate the efficacy of the proposed technique using five open-source LLMs on four openly available datasets. The results show that the technique achieves a higher success rate with great efficiency compared to three other state-of-the-art jailbreaking techniques. The source code for our implementation is available at: https://github.com/sbamit/Exponentiated-Gradient-Descent-LLM-Attack
中文: 本文提出了一种针对大型语言模型的高效越狱技术,采用指数梯度下降与Bregman投影方法,相比现有技术具有更高的成功率和更强的效率优势。
English: This paper introduces an efficient jailbreaking technique for Large Language Models using exponentiated gradient descent with Bregman projection, which demonstrates higher success rates and greater efficiency compared to existing methods.
Authors:Julian Büchel, Iason Chalas, Giovanni Acampa, An Chen, Omobayode Fagbohungbe, Sidney Tsai, Kaoutar El Maghraoui, Manuel Le Gallo, Abbas Rahimi, Abu Sebastian
Abstract:
Analog in-memory computing (AIMC) is a promising compute paradigm to improve speed and power efficiency of neural network inference beyond the limits of conventional von Neumann-based architectures. However, AIMC introduces fundamental challenges such as noisy computations and strict constraints on input and output quantization. Because of these constraints and imprecisions, off-the-shelf LLMs are not able to achieve 4-bit-level performance when deployed on AIMC-based hardware. While researchers previously investigated recovering this accuracy gap on small, mostly vision-based models, a generic method applicable to LLMs pre-trained on trillions of tokens does not yet exist. In this work, we introduce a general and scalable method to robustly adapt LLMs for execution on noisy, low-precision analog hardware. Our approach enables state-of-the-art models $\unicode{x2013}$ including Phi-3-mini-4k-instruct and Llama-3.2-1B-Instruct $\unicode{x2013}$ to retain performance comparable to 4-bit weight, 8-bit activation baselines, despite the presence of analog noise and quantization constraints. Additionally, we show that as a byproduct of our training methodology, analog foundation models can be quantized for inference on low-precision digital hardware. Finally, we show that our models also benefit from test-time compute scaling, showing better scaling behavior than models trained with 4-bit weight and 8-bit static input quantization. Our work bridges the gap between high-capacity LLMs and efficient analog hardware, offering a path toward energy-efficient foundation models. Code is available at https://github.com/IBM/analog-foundation-models.
中文: 本研究提出了一种可扩展的方法,使大语言模型能够适应噪声大、精度低的模拟硬件运行,让Phi-3和Llama-3.2等模型在保持与数字基准相当性能的同时,弥合了大模型与高效能模拟计算之间的鸿沟。
English: This work introduces a scalable method to adapt large language models for execution on noisy, low-precision analog hardware, enabling models like Phi-3 and Llama-3.2 to maintain performance comparable to digital baselines while bridging the gap between high-capacity LLMs and energy-efficient analog computing.
Authors:Long Chen, Xiaotian Song, Yanan Sun
Abstract:
Spiking Large Language Models (LLMs) have emerged as an energy-efficient alternative to conventional LLMs through their event-driven computation. To effectively obtain spiking LLMs, researchers develop different ANN-to-SNN conversion methods by leveraging pre-trained ANN parameters while inheriting the energy efficiency of SNN. However, existing conversion methods struggle with extreme activation outliers and incompatible nonlinear operations of ANN-based LLMs. To address this, we propose a loss-less ANN-SNN conversion for fully spike-driven LLMs, termed LAS. Specifically, LAS introduces two novel neurons to convert the activation outlier and nonlinear operation of ANN-based LLMs. Moreover, LAS tailors the spike-equivalent Transformer components for spiking LLMs, which can ensure full spiking conversion without any loss of performance. Experimental results on six language models and two vision-language models demonstrate that LAS achieves loss-less conversion. Notably, on OPT-66B, LAS even improves the accuracy of 2\% on the WSC task. In addition, the parameter and ablation studies further verify the effectiveness of LAS. The source code is available at https://github.com/lc783/LAS
Chinese: LAS通过解决激活异常值和非线性操作问题,实现了尖峰大语言模型的无损转换,在保持全脉冲驱动的同时不损失性能。
English: LAS introduces a loss-less conversion method for spiking large language models by addressing activation outliers and nonlinear operations, achieving full spiking performance without accuracy compromise.
Authors:Xiwen Chen, Wenhui Zhu, Peijie Qiu, Xuanzhao Dong, Hao Wang, Haiyu Wu, Huayu Li, Aristeidis Sotiras, Yalin Wang, Abolfazl Razi
Abstract:
Recent advances in reinforcement learning for language model post-training, such as Group Relative Policy Optimization (GRPO), have shown promise in low-resource settings. However, GRPO typically relies on solution-level and scalar reward signals that fail to capture the semantic diversity among sampled completions. This leads to what we identify as a diversity-quality inconsistency, where distinct reasoning paths may receive indistinguishable rewards. To address this limitation, we propose $\textit{Diversity-aware Reward Adjustment}$ (DRA), a method that explicitly incorporates semantic diversity into the reward computation. DRA uses Submodular Mutual Information (SMI) to downweight redundant completions and amplify rewards for diverse ones. This encourages better exploration during learning, while maintaining stable exploitation of high-quality samples. Our method integrates seamlessly with both GRPO and its variant DR.~GRPO, resulting in $\textit{DRA-GRPO}$ and $\textit{DGA-DR.~GRPO}$. We evaluate our method on five mathematical reasoning benchmarks and find that it outperforms recent strong baselines. It achieves state-of-the-art performance with an average accuracy of 58.2%, using only 7,000 fine-tuning samples and a total training cost of approximately $55. The code is available at https://github.com/xiwenc1/DRA-GRPO.
中文: 本文提出多样性感知奖励调整(DRA)方法,通过子模互信息将语义多样性融入奖励计算,在数学推理基准测试中以极低资源实现了最优性能。
English: This paper introduces Diversity-aware Reward Adjustment (DRA), a method that enhances reinforcement learning for language models by incorporating semantic diversity into rewards using Submodular Mutual Information, leading to state-of-the-art performance on mathematical reasoning benchmarks with minimal resources.
Authors:Xixuan Hao, Yutian Jiang, Xingchen Zou, Jiabo Liu, Yifang Yin, Yuxuan Liang
Abstract:
Location Intelligence (LI), the science of transforming location-centric geospatial data into actionable knowledge, has become a cornerstone of modern spatial decision-making. The rapid evolution of Geospatial Representation Learning is fundamentally reshaping LI development through two successive technological revolutions: the deep learning breakthrough and the emerging large language model (LLM) paradigm. While deep neural networks (DNNs) have demonstrated remarkable success in automated feature extraction from structured geospatial data (e.g., satellite imagery, GPS trajectories), the recent integration of LLMs introduces transformative capabilities for cross-modal geospatial reasoning and unstructured geo-textual data processing. This survey presents a comprehensive review of geospatial representation learning across both technological eras, organizing them into a structured taxonomy based on the complete pipeline comprising: (1) data perspective, (2) methodological perspective and (3) application perspective. We also highlight current advancements, discuss existing limitations, and propose potential future research directions in the LLM era. This work offers a thorough exploration of the field and providing a roadmap for further innovation in LI. The summary of the up-to-date paper list can be found in https://github.com/CityMind-Lab/Awesome-Location-Intelligence and will undergo continuous updates.
中文总结:位置智能正通过深度学习和大型语言模型的地理空间表示学习技术发生变革,实现了自动化特征提取与跨模态推理,本综述系统回顾了该领域进展并提出了未来研究方向。
English Summary: Location Intelligence is being transformed by geospatial representation learning through deep learning and large language models, which enable automated feature extraction and cross-modal reasoning, with this survey providing a comprehensive review, taxonomy, and future research directions.
Authors:Nick Sunday
Abstract:
The proliferation of Text-to-Music (TTM) platforms has democratized music creation, enabling users to effortlessly generate high-quality compositions. However, this innovation also presents new challenges to musicians and the broader music industry. This study investigates the detection of AI-generated songs using the FakeMusicCaps dataset by classifying audio as either deepfake or human. To simulate real-world adversarial conditions, tempo stretching and pitch shifting were applied to the dataset. Mel spectrograms were generated from the modified audio, then used to train and evaluate a convolutional neural network. In addition to presenting technical results, this work explores the ethical and societal implications of TTM platforms, arguing that carefully designed detection systems are essential to both protecting artists and unlocking the positive potential of generative AI in music.
中文: 本研究通过改进的音频频谱图开发了一种基于卷积神经网络的AI生成歌曲检测方法,强调此类检测系统对于平衡伦理问题与生成式AI在音乐中的积极潜力至关重要。
English: This study develops a CNN-based method using modified audio spectrograms to detect AI-generated songs, highlighting the need for such detection systems to balance ethical concerns with the benefits of generative AI in music.
Authors:Shengpeng Ji, Tianle Liang, Yangzhuo Li, Jialong Zuo, Minghui Fang, Jinzheng He, Yifu Chen, Zhengqing Liu, Ziyue Jiang, Xize Cheng, Siqi Zheng, Jin Xu, Junyang Lin, Zhou Zhao
Abstract:
End-to-end spoken dialogue models such as GPT-4o-audio have recently garnered significant attention in the speech domain. However, the evaluation of spoken dialogue models' conversational performance has largely been overlooked. This is primarily due to the intelligent chatbots convey a wealth of non-textual information which cannot be easily measured using text-based language models like ChatGPT. To address this gap, we propose WavReward, a reward feedback model based on audio language models that can evaluate both the IQ and EQ of spoken dialogue systems with speech input. Specifically, 1) based on audio language models, WavReward incorporates the deep reasoning process and the nonlinear reward mechanism for post-training. By utilizing multi-sample feedback via the reinforcement learning algorithm, we construct a specialized evaluator tailored to spoken dialogue models. 2) We introduce ChatReward-30K, a preference dataset used to train WavReward. ChatReward-30K includes both comprehension and generation aspects of spoken dialogue models. These scenarios span various tasks, such as text-based chats, nine acoustic attributes of instruction chats, and implicit chats. WavReward outperforms previous state-of-the-art evaluation models across multiple spoken dialogue scenarios, achieving a substantial improvement about Qwen2.5-Omni in objective accuracy from 53.4$\%$ to 91.5$\%$. In subjective A/B testing, WavReward also leads by a margin of 83$\%$. Comprehensive ablation studies confirm the necessity of each component of WavReward. All data and code will be publicly at https://github.com/jishengpeng/WavReward after the paper is accepted.
中文: 该摘要提出了WavReward,一种基于音频的奖励模型,通过利用音频语言模型和新数据集ChatReward-30K来评估口语对话系统的智商和情商,在准确性和主观测试中显著超越了以往模型。
English: The abstract introduces WavReward, a novel audio-based reward model designed to evaluate both the intelligence and emotional quotient of spoken dialogue systems by leveraging audio language models and a new dataset, ChatReward-30K, significantly outperforming previous models in accuracy and subjective tests.
Authors:Yujin Kim, Nathaniel Chin, Arnav Vasudev, Sanjiban Choudhury
Abstract:
We study policy distillation under privileged information, where a student policy with only partial observations must learn from a teacher with full-state access. A key challenge is information asymmetry: the student cannot directly access the teacher's state space, leading to distributional shifts and policy degradation. Existing approaches either modify the teacher to produce realizable but sub-optimal demonstrations or rely on the student to explore missing information independently, both of which are inefficient. Our key insight is that the student should strategically interact with the teacher --querying only when necessary and resetting from recovery states --to stay on a recoverable path within its own observation space. We introduce two methods: (i) an imitation learning approach that adaptively determines when the student should query the teacher for corrections, and (ii) a reinforcement learning approach that selects where to initialize training for efficient exploration. We validate our methods in both simulated and real-world robotic tasks, demonstrating significant improvements over standard teacher-student baselines in training efficiency and final performance. The project website is available at : https://portal-cornell.github.io/CritiQ_ReTRy/
Authors:Patrik Kenfack, Samira Ebrahimi Kahou, Ulrich Aïvodji
Abstract:
Transformer-based tabular foundation models have recently demonstrated promising in-context learning (ICL) performance on structured data, emerging as competitive alternatives to gradient-boosted trees. However, the fairness implications of this new paradigm remain largely unexplored. We present the first investigation of fairness in tabular ICL, evaluating three recently proposed foundation models -- TabPFNv2, TabICL, and TabDPT -- on multiple benchmark datasets. To mitigate biases, we explore three pre-processing fairness-enhancing methods: correlation removal (decorrelating input features from the sensitive attribute), group-balanced sample selection (ensuring equal representation of protected groups in context examples), and uncertainty-based sample selection (prioritizing context examples with high sensitive-attribute prediction uncertainty). Our experiments show that the uncertainty-based strategy consistently improves group fairness metrics (e.g., demographic parity, equalized odds, and equal opportunity) with minimal impact on predictive accuracy. We release our code to facilitate reproducibility (https://github.com/patrikken/Fair-TabICL)
中文摘要:本研究开创性地探讨了表格上下文学习中的公平性问题,发现基于不确定性的样本选择策略能在保持预测精度的同时,显著提升三个基础模型的群体公平性指标。
English Summary: This study pioneers the examination of fairness in tabular in-context learning, revealing that uncertainty-based sample selection effectively enhances group fairness metrics while preserving predictive accuracy across three foundation models.
Authors:Fan Xu, Wuyang Chen, Wei Gao
Abstract:
Decision trees and forests have achieved successes in various real applications, most working with all testing classes known in training data. In this work, we focus on learning with augmented class via forests, where an augmented class may appear in testing data yet not in training data. We incorporate information of augmented class into trees' splitting, that is, augmented Gini impurity, a new splitting criterion is introduced to exploit some unlabeled data from testing distribution. We then develop the Learning with Augmented Class via Forests (short for LACForest) approach, which constructs shallow forests according to the augmented Gini impurity and then splits forests with pseudo-labeled augmented instances for better performance. We also develop deep neural forests via an optimization objective based on our augmented Gini impurity, which essentially utilizes the representation power of neural networks for forests. Theoretically, we present the convergence analysis for our augmented Gini impurity, and we finally conduct experiments to evaluate our approaches. The code is available at https://github.com/nju-xuf/LACForest.
中文摘要:本文提出LACForest方法,通过引入增强基尼不纯度这一新分裂准则,将训练数据中未出现的增强类信息融入决策森林,结合浅层森林和深度神经森林提升模型性能。
English Summary: This paper introduces LACForest, a method that enhances decision forests by incorporating an augmented class not present in training data through a novel splitting criterion called augmented Gini impurity, utilizing both shallow and deep neural forests for improved performance.
Authors:Faruk Alpay
Abstract:
The Information Bottleneck (IB) method frequently suffers from unstable optimization, characterized by abrupt representation shifts near critical points of the IB trade-off parameter, beta. In this paper, I introduce a novel approach to achieve stable and convex IB optimization through symbolic continuation and entropy-regularized trajectories. I analytically prove convexity and uniqueness of the IB solution path when an entropy regularization term is included, and demonstrate how this stabilizes representation learning across a wide range of \b{eta} values. Additionally, I provide extensive sensitivity analyses around critical points (beta) with statistically robust uncertainty quantification (95% confidence intervals). The open-source implementation, experimental results, and reproducibility framework included in this work offer a clear path for practical deployment and future extension of my proposed method.
中文: 本文通过符号延拓和熵正则化方法,提出了一种稳定且凸的信息瓶颈优化方法,在β参数临界点附近保证了解路径的唯一性,并提供了具有统计鲁棒性的不确定性量化。
English: This paper introduces a stable and convex optimization method for the Information Bottleneck problem using symbolic continuation and entropy regularization, ensuring unique solution paths and robust uncertainty quantification across beta values.
Authors:Owen Kwon, Abraham George, Alison Bartsch, Amir Barati Farimani
Abstract:
Real robots are expected to repeat the same behavior in new environments with very little new data, yet modern controllers either incur heavy per-step inference or require deployment-time fine-tuning. We propose RT-Cache, a training-free retrieval-as-control pipeline that caches diverse image action trajectories in a unified vector memory and, at test time, embeds the current frame to retrieve and replay multi-step snippets, replacing per-step model calls. A hierarchical search keeps lookups sub-second at million scale, shifting cost from compute to storage and enabling real-time control on modest GPUs. Across real-robot tasks and large open logs, RT-Cache achieves higher success and lower completion time than strong retrieval baselines (approximately x2 higher success and ~30% faster in our settings), and a single-episode anchoring study shows immediate adaptation to a more complex, contact-rich task without fine-tuning. RT-Cache turns experience into an append-only memory, offering a simple, scalable path to few-shot deployment today and a foundation for multimodal keys and optional integration with high-level policies. Project page: https://rt-cache.github.io/.
Authors:Yangyi Chen, Hao Peng, Tong Zhang, Heng Ji
Abstract:
In standard large vision-language models (LVLMs) pre-training, the model typically maximizes the joint probability of the caption conditioned on the image via next-token prediction (NTP); however, since only a small subset of caption tokens directly relates to the visual content, this naive NTP unintentionally fits the model to noise and increases the risk of hallucination. We present PRIOR, a simple vision-language pre-training approach that addresses this issue by prioritizing image-related tokens through differential weighting in the NTP loss, drawing from the importance sampling framework. PRIOR introduces a reference model-a text-only large language model (LLM) trained on the captions without image inputs, to weight each token based on its probability for LVLMs training. Intuitively, tokens that are directly related to the visual inputs are harder to predict without the image and thus receive lower probabilities from the text-only reference LLM. During training, we implement a token-specific re-weighting term based on the importance scores to adjust each token's loss. We implement PRIOR in two distinct settings: LVLMs with visual encoders and LVLMs without visual encoders. We observe 19% and 8% average relative improvement, respectively, on several vision-language benchmarks compared to NTP. In addition, PRIOR exhibits superior scaling properties, as demonstrated by significantly higher scaling coefficients, indicating greater potential for performance gains compared to NTP given increasing compute and data.
Chinese: PRIOR是一种新颖的视觉语言预训练方法,通过基于纯文本参考模型的重要性分数对图像相关标记进行差异化损失加权,有效减少大型视觉语言模型的幻觉现象,相比标准的下一个标记预测方法实现了显著的性能提升。
English: PRIOR is a novel vision-language pre-training method that reduces hallucination in large vision-language models by prioritizing image-related tokens through differential loss weighting based on importance scores from a text-only reference model, achieving significant performance improvements over standard next-token prediction.
Authors:Yancheng Wang, Nebojsa Jojic, Yingzhen Yang
Abstract:
In this paper, we propose a novel attention module termed the Differentiable Channel Selection Attention module, or the DCS-Attention module. In contrast with conventional self-attention, the DCS-Attention module features selection of informative channels in the computation of the attention weights. The selection of the feature channels is performed in a differentiable manner, enabling seamless integration with DNN training. Our DCS-Attention is compatible with either fixed neural network backbones or learnable backbones with Differentiable Neural Architecture Search (DNAS), leading to DCS with Fixed Backbone (DCS-FB) and DCS-DNAS, respectively. Importantly, our DCS-Attention is motivated by the principle of Information Bottleneck (IB), and a novel variational upper bound for the IB loss, which can be optimized by SGD, is derived and incorporated into the training loss of the networks with the DCS-Attention modules. In this manner, a neural network with DCS-Attention modules is capable of selecting the most informative channels for feature extraction so that it enjoys state-of-the-art performance for the Re-ID task. Extensive experiments on multiple person Re-ID benchmarks using both DCS-FB and DCS-DNAS show that DCS-Attention significantly enhances the prediction accuracy of DNNs for person Re-ID, which demonstrates the effectiveness of DCS-Attention in learning discriminative features critical to identifying person identities. The code of our work is available at https://github.com/Statistical-Deep-Learning/DCS-Attention.
中文: 本文提出的DCS-Attention模块通过可微分通道选择机制,基于信息瓶颈原理优化特征提取,在行人重识别任务中实现了最优性能。
English: The paper introduces the DCS-Attention module, which differentially selects informative channels based on the Information Bottleneck principle to enhance feature extraction and achieve state-of-the-art performance in person Re-ID tasks.
Authors:Ippokratis Koukoulis, Ilias Syrigos, Thanasis Korakis
Abstract:
As the digital landscape becomes more interconnected, the frequency and severity of zero-day attacks, have significantly increased, leading to an urgent need for innovative Intrusion Detection Systems (IDS). Machine Learning-based IDS that learn from the network traffic characteristics and can discern attack patterns from benign traffic offer an advanced solution to traditional signature-based IDS. However, they heavily rely on labeled datasets, and their ability to generalize when encountering unseen traffic patterns remains a challenge. This paper proposes a novel self-supervised contrastive learning approach based on transformer encoders, specifically tailored for generalizable intrusion detection on raw packet sequences. Our proposed learning scheme employs a packet-level data augmentation strategy combined with a transformer-based architecture to extract and generate meaningful representations of traffic flows. Unlike traditional methods reliant on handcrafted statistical features (NetFlow), our approach automatically learns comprehensive packet sequence representations, significantly enhancing performance in anomaly identification tasks and supervised learning for intrusion detection. Our transformer-based framework exhibits better performance in comparison to existing NetFlow self-supervised methods. Specifically, we achieve up to a 3% higher AUC in anomaly detection for intra-dataset evaluation and up to 20% higher AUC scores in inter-dataset evaluation. Moreover, our model provides a strong baseline for supervised intrusion detection with limited labeled data, exhibiting an improvement over self-supervised NetFlow models of up to 1.5% AUC when pretrained and evaluated on the same dataset. Additionally, we show the adaptability of our pretrained model when fine-tuned across different datasets, demonstrating strong performance even when lacking benign data from the target domain.
中文: 本文提出了一种基于Transformer编码器的自监督对比学习方法,通过自动学习数据包序列表示来提升入侵检测性能,在异常检测和监督学习任务中均优于传统方法。
English: This paper introduces a self-supervised contrastive learning method using transformer encoders to enhance intrusion detection by automatically learning packet sequence representations, achieving superior performance in both anomaly detection and supervised learning tasks compared to traditional approaches.
Authors:Shanda Li, Tanya Marwah, Junhong Shen, Weiwei Sun, Andrej Risteski, Yiming Yang, Ameet Talwalkar
Abstract:
Partial differential equations (PDEs) are fundamental to modeling physical systems, yet solving them remains a complex challenge. Traditional numerical solvers rely on expert knowledge to implement and are computationally expensive, while neural-network-based solvers require large training datasets and often lack interpretability. In this work, we frame PDE solving as a code generation task and introduce CodePDE, the first inference framework for generating PDE solvers using large language models (LLMs). Leveraging advanced inference-time algorithms and scaling strategies, CodePDE unlocks critical capacities of LLM for PDE solving: reasoning, debugging, selfrefinement, and test-time scaling -- all without task-specific tuning. CodePDE achieves superhuman performance across a range of representative PDE problems. We also present a systematic empirical analysis of LLM generated solvers, analyzing their accuracy, efficiency, and numerical scheme choices. Our findings highlight the promise and the current limitations of LLMs in PDE solving, offering a new perspective on solver design and opportunities for future model development. Our code is available at https://github.com/LithiumDA/CodePDE.
中文摘要:CodePDE首次提出通过大语言模型生成代码来求解偏微分方程的推理框架,无需特定任务调优即实现超越人类的表现,同时具备自主推理与调试能力。
English Summary: CodePDE introduces a novel framework that uses large language models to generate PDE solvers through code generation, achieving superior performance without task-specific training while enabling reasoning and self-improvement capabilities.
Authors:Fanyu Meng, Ziwen Kan, Shahbaz Rezaei, Zhaodan Kong, Xin Chen, Xin Liu
Abstract:
Explainability in time series models is crucial for fostering trust, facilitating debugging, and ensuring interpretability in real-world applications. In this work, we introduce Implet, a novel post-hoc explainer that generates accurate and concise subsequence-level explanations for time series models. Our approach identifies critical temporal segments that significantly contribute to the model's predictions, providing enhanced interpretability beyond traditional feature-attribution methods. Based on it, we propose a cohort-based (group-level) explanation framework designed to further improve the conciseness and interpretability of our explanations. We evaluate Implet on several standard time-series classification benchmarks, demonstrating its effectiveness in improving interpretability. The code is available at https://github.com/LbzSteven/implet
中文: 本文提出Implet,一种后置解释器,通过识别关键时间片段为时间序列模型生成精确的子序列级解释,并引入基于群体的解释框架以增强可解释性,在标准基准测试中验证了其有效性。
English: This paper presents Implet, a post-hoc explainer that generates precise subsequence-level explanations for time series models by identifying key temporal segments and offers a cohort-based framework to enhance interpretability, validated on standard benchmarks.
Authors:Abdolmehdi Behroozi, Chaopeng Shen and, Daniel Kifer
Abstract:
Parametric differential equations of the form du/dt = f(u, x, t, p) are fundamental in science and engineering. While deep learning frameworks such as the Fourier Neural Operator (FNO) can efficiently approximate solutions, they struggle with inverse problems, sensitivity estimation (du/dp), and concept drift. We address these limitations by introducing a sensitivity-based regularization strategy, called Sensitivity-Constrained Fourier Neural Operators (SC-FNO). SC-FNO achieves high accuracy in predicting solution paths and consistently outperforms standard FNO and FNO with physics-informed regularization. It improves performance in parameter inversion tasks, scales to high-dimensional parameter spaces (tested with up to 82 parameters), and reduces both data and training requirements. These gains are achieved with a modest increase in training time (30% to 130% per epoch) and generalize across various types of differential equations and neural operators. Code and selected experiments are available at: https://github.com/AMBehroozi/SC_Neural_Operators
Chinese: SC-FNO 通过引入基于敏感性的正则化策略,解决了 FNO 在逆问题和敏感性估计方面的不足,实现了更高的精度、可扩展性及更低的数据需求,同时训练时间仅适度增加。
English: SC-FNO introduces sensitivity-based regularization to overcome FNO's limitations in inverse problems and sensitivity estimation, achieving higher accuracy, scalability, and reduced data needs with a moderate training time increase.
Authors:Shan Zhao, Zhitong Xiong, Jie Zhao, Xiao Xiang Zhu
Abstract:
Our planet is facing increasingly frequent extreme events, which pose major risks to human lives and ecosystems. Recent advances in machine learning (ML), especially with foundation models (FMs) trained on extensive datasets, excel in extracting features and show promise in disaster management. Nevertheless, these models often inherit biases from training data, challenging their performance over extreme values. To explore the reliability of FM in the context of extreme events, we introduce \textbf{ExE}Bench (\textbf{Ex}treme \textbf{E}arth Benchmark), a collection of seven extreme event categories across floods, wildfires, storms, tropical cyclones, extreme precipitation, heatwaves, and cold waves. The dataset features global coverage, varying data volumes, and diverse data sources with different spatial, temporal, and spectral characteristics. To broaden the real-world impact of FMs, we include multiple challenging ML tasks that are closely aligned with operational needs in extreme events detection, monitoring, and forecasting. ExEBench aims to (1) assess FM generalizability across diverse, high-impact tasks and domains, (2) promote the development of novel ML methods that benefit disaster management, and (3) offer a platform for analyzing the interactions and cascading effects of extreme events to advance our understanding of Earth system, especially under the climate change expected in the decades to come. The dataset and code are public https://github.com/zhaoshan2/EarthExtreme-Bench.
中文: 摘要介绍了ExEBench基准数据集,用于评估基础模型在极端事件管理中的表现,旨在提升灾害应对能力并深化对气候变化下地球系统的理解。
English: The abstract introduces ExEBench, a benchmark dataset for evaluating foundation models' performance in managing extreme events, aiming to enhance disaster management and understanding of Earth systems under climate change.
Authors:Huiyun Jiang, Zhuang Yang
Abstract:
Recent studies have shown the great potential of diffusion models in improving reinforcement learning (RL) by modeling complex policies, expressing a high degree of multi-modality, and efficiently handling high-dimensional continuous control tasks. However, there is currently limited research on how to optimize diffusion-based polices (e.g., Diffusion Policy) fast and stably. In this paper, we propose an Adam-based Diffusion Policy Optimization (ADPO), a fast algorithmic framework containing best practices for fine-tuning diffusion-based polices in robotic control tasks using the adaptive gradient descent method in RL. Adaptive gradient method is less studied in training RL, let alone diffusion-based policies. We confirm that ADPO outperforms other diffusion-based RL methods in terms of overall effectiveness for fine-tuning on standard robotic tasks. Concretely, we conduct extensive experiments on standard robotic control tasks to test ADPO, where, particularly, six popular diffusion-based RL methods are provided as benchmark methods. Experimental results show that ADPO acquires better or comparable performance than the baseline methods. Finally, we systematically analyze the sensitivity of multiple hyperparameters in standard robotics tasks, providing guidance for subsequent practical applications. Our video demonstrations are released in https://github.com/Timeless-lab/ADPO.git.
中文摘要:本文提出ADPO框架,利用自适应梯度方法优化扩散策略,在机器人控制任务中展现出优于现有方法的性能,并通过系统实验验证了其有效性。
English Summary: This paper introduces ADPO, an Adam-based framework that efficiently fine-tunes diffusion policies for reinforcement learning in robotics, demonstrating superior performance over existing methods through extensive experiments.
Authors:Haodong Zhao, Peng Peng, Chiyu Chen, Linqing Huang, Gongshen Liu
Abstract:
Remote sensing (RS) images are usually produced at an unprecedented scale, yet they are geographically and institutionally distributed, making centralized model training challenging due to data-sharing restrictions and privacy concerns. Federated learning (FL) offers a solution by enabling collaborative model training across decentralized RS data sources without exposing raw data. However, there lacks a realistic federated dataset and benchmark in RS. Prior works typically rely on manually partitioned single dataset, which fail to capture the heterogeneity and scale of real-world RS data, and often use inconsistent experimental setups, hindering fair comparison. To address this gap, we propose a realistic federated RS dataset, termed FedRS. FedRS consists of eight datasets that cover various sensors and resolutions and builds 135 clients, which is representative of realistic operational scenarios. Data for each client come from the same source, exhibiting authentic federated properties such as skewed label distributions, imbalanced client data volumes, and domain heterogeneity across clients. These characteristics reflect practical challenges in federated RS and support evaluation of FL methods at scale. Based on FedRS, we implement 10 baseline FL algorithms and evaluation metrics to construct the comprehensive FedRS-Bench. The experimental results demonstrate that FL can consistently improve model performance over training on isolated data silos, while revealing performance trade-offs of different methods under varying client heterogeneity and availability conditions. We hope FedRS-Bench will accelerate research on large-scale, realistic FL in RS by providing a standardized, rich testbed and facilitating fair comparisons across future works. The source codes and dataset are available at https://fedrs-bench.github.io/.
Authors:Wangxuan Fan, Siqi Li, Doudou Zhou, Yohei Okada, Chuan Hong, Molei Liu, Nan Liu
Abstract:
Explainable artificial intelligence (XAI) is essential for trustworthy machine learning (ML), particularly in high-stakes domains such as healthcare and finance. Shapley value (SV) methods provide a principled framework for feature attribution in complex models but incur high computational costs, limiting their scalability in high-dimensional settings. We propose Stochastic Iterative Momentum for Shapley Value Approximation (SIM-Shapley), a stable and efficient SV approximation method inspired by stochastic optimization. We analyze variance theoretically, prove linear $Q$-convergence, and demonstrate improved empirical stability and low bias in practice on real-world datasets. In our numerical experiments, SIM-Shapley reduces computation time by up to 85% relative to state-of-the-art baselines while maintaining comparable feature attribution quality. Beyond feature attribution, our stochastic mini-batch iterative framework extends naturally to a broader class of sample average approximation problems, offering a new avenue for improving computational efficiency with stability guarantees. Code is publicly available at https://github.com/nliulab/SIM-Shapley.
中文摘要:SIM-Shapley是一种高效的沙普利值近似方法,能在保持特征归因质量的同时将计算时间减少高达85%,其随机优化框架可扩展至更广泛的样本平均近似问题,并具有理论收敛性和稳定性保证。
English Summary: SIM-Shapley is an efficient Shapley value approximation method that reduces computation time by up to 85% while maintaining attribution quality, extending to broader sample average approximation problems with proven convergence and stability.
Authors:Xiannan Huang, Shuhan Qiu
Abstract:
Time series forecasting is critical for many applications, where deep learning-based point prediction models have demonstrated strong performance. However, in practical scenarios, there is also a need to quantify predictive uncertainty through online confidence intervals. Existing confidence interval modeling approaches building upon these deep point prediction models suffer from key limitations: they either require costly retraining, fail to fully leverage the representational strengths of deep models, or lack theoretical guarantees. To address these gaps, we propose a lightweight conformal prediction method that provides valid coverage and shorter interval lengths without retraining. Our approach leverages features extracted from pre-trained point prediction models to fit a residual predictor and construct confidence intervals, further enhanced by an adaptive coverage control mechanism. Theoretically, we prove that our method achieves asymptotic coverage convergence, with error bounds dependent on the feature quality of the underlying point prediction model. Experiments on 12 datasets demonstrate that our method delivers tighter confidence intervals while maintaining desired coverage rates. Code, model and dataset in \href{https://github.com/xiannanhuang/FFDCI}{Github}
中文摘要:本文提出一种轻量级共形预测方法,无需重新训练即可利用预训练模型特征生成有效且更短的置信区间,在12个数据集上验证了其优越性能。
English Summary: This paper introduces a lightweight conformal prediction method that generates valid confidence intervals with shorter lengths without retraining, leveraging features from pre-trained models and demonstrating superior performance across 12 datasets.
Authors:Licheng Zhang, Bach Le, Naveed Akhtar, Siew-Kei Lam, Tuan Ngo
Abstract:
Large Language Models (LLMs) have seen rapid advancements in recent years, with models like ChatGPT and DeepSeek, showcasing their remarkable capabilities across diverse domains. While substantial research has been conducted on LLMs in various fields, a comprehensive review focusing on their integration with Computer-Aided Design (CAD) remains notably absent. CAD is the industry standard for 3D modeling and plays a vital role in the design and development of products across different industries. As the complexity of modern designs increases, the potential for LLMs to enhance and streamline CAD workflows presents an exciting frontier. This article presents the first systematic survey exploring the intersection of LLMs and CAD. We begin by outlining the industrial significance of CAD, highlighting the need for AI-driven innovation. Next, we provide a detailed overview of the foundation of LLMs. We also examine both closed-source LLMs as well as publicly available models. The core of this review focuses on the various applications of LLMs in CAD, providing a taxonomy of six key areas where these models are making considerable impact. Finally, we propose several promising future directions for further advancements, which offer vast opportunities for innovation and are poised to shape the future of CAD technology. Github: https://github.com/lichengzhanguom/LLMs-CAD-Survey-Taxonomy
中文摘要:本文首次系统综述了大语言模型与计算机辅助设计的融合应用,梳理了六大关键应用领域并提出了未来研究方向。
English Summary: This article presents the first systematic survey exploring how Large Language Models can enhance Computer-Aided Design workflows, examining six key application areas and proposing future research directions.
Authors:Luu Tung Hai, Thinh D. Le, Zhicheng Ding, Qing Tian, Truong-Son Hy
Abstract:
Point cloud processing has gained significant attention due to its critical role in applications such as autonomous driving and 3D object recognition. However, deploying high-performance models like Point Transformer V3 in resource-constrained environments remains challenging due to their high computational and memory demands. This work introduces a novel distillation framework that leverages topology-aware representations and gradient-guided knowledge distillation to effectively transfer knowledge from a high-capacity teacher to a lightweight student model. Our approach captures the underlying geometric structures of point clouds while selectively guiding the student model's learning process through gradient-based feature alignment. Experimental results in the Nuscenes, SemanticKITTI, and Waymo datasets demonstrate that the proposed method achieves competitive performance, with an approximately 16x reduction in model size and a nearly 1.9x decrease in inference time compared to its teacher model. Notably, on NuScenes, our method achieves state-of-the-art performance among knowledge distillation techniques trained solely on LiDAR data, surpassing prior knowledge distillation baselines in segmentation performance. Our implementation is available publicly at:
https://github.com/HySonLab/PointDistill
中文: 本研究提出了一种新颖的蒸馏框架,利用拓扑感知表示和梯度引导对齐将知识从高容量教师模型迁移至轻量级学生模型,在多个数据集上实现了具有竞争力的性能,同时显著减小了模型规模并缩短了推理时间。
English: This study presents a novel distillation framework using topology-aware representations and gradient-guided alignment to transfer knowledge from a high-capacity teacher to a lightweight student model, achieving competitive performance with significantly reduced model size and inference time across multiple datasets.
Authors:Alexandre Cotorobai, Jorge Miguel Silva, Jose Luis Oliveira
Abstract:
Privacy and regulatory barriers often hinder centralized machine learning solutions, particularly in sectors like healthcare where data cannot be freely shared. Federated learning has emerged as a powerful paradigm to address these concerns; however, existing frameworks primarily support gradient-based models, leaving a gap for more interpretable, tree-based approaches. This paper introduces a federated learning framework for Random Forest classifiers that preserves data privacy and provides robust performance in distributed settings. By leveraging PySyft for secure, privacy-aware computation, our method enables multiple institutions to collaboratively train Random Forest models on locally stored data without exposing sensitive information. The framework supports weighted model averaging to account for varying data distributions, incremental learning to progressively refine models, and local evaluation to assess performance across heterogeneous datasets. Experiments on two real-world healthcare benchmarks demonstrate that the federated approach maintains competitive predictive accuracy - within a maximum 9\% margin of centralized methods - while satisfying stringent privacy requirements. These findings underscore the viability of tree-based federated learning for scenarios where data cannot be centralized due to regulatory, competitive, or technical constraints. The proposed solution addresses a notable gap in existing federated learning libraries, offering an adaptable tool for secure distributed machine learning tasks that demand both transparency and reliable performance. The tool is available at https://github.com/ieeta-pt/fed_rf.
This paper presents a federated learning framework for Random Forest classifiers that enables secure collaborative training across distributed healthcare data while maintaining competitive accuracy within 9% of centralized methods and addressing the existing gap in tree-based federated approaches.
English Summary:
Authors:Héber H. Arcolezi, Mina Alishahi, Adda-Akram Bendoukha, Nesrine Kaaniche
Abstract:
Machine learning (ML) algorithms are heavily based on the availability of training data, which, depending on the domain, often includes sensitive information about data providers. This raises critical privacy concerns. Anonymization techniques have emerged as a practical solution to address these issues by generalizing features or suppressing data to make it more difficult to accurately identify individuals. Although recent studies have shown that privacy-enhancing technologies can influence ML predictions across different subgroups, thus affecting fair decision-making, the specific effects of anonymization techniques, such as $k$-anonymity, $\ell$-diversity, and $t$-closeness, on ML fairness remain largely unexplored. In this work, we systematically audit the impact of anonymization techniques on ML fairness, evaluating both individual and group fairness. Our quantitative study reveals that anonymization can degrade group fairness metrics by up to four orders of magnitude. Conversely, similarity-based individual fairness metrics tend to improve under stronger anonymization, largely as a result of increased input homogeneity. By analyzing varying levels of anonymization across diverse privacy settings and data distributions, this study provides critical insights into the trade-offs between privacy, fairness, and utility, offering actionable guidelines for responsible AI development. Our code is publicly available at: https://github.com/hharcolezi/anonymity-impact-fairness.
Chinese: 本研究系统评估了匿名化技术对机器学习公平性的影响,发现群体公平性可能显著下降,而个体公平性因数据同质性增强常得到改善,揭示了隐私、公平与实用性之间的关键权衡。
English: This study systematically examines how anonymization techniques impact machine learning fairness, revealing that while group fairness can degrade significantly, individual fairness often improves due to increased data homogeneity, highlighting critical trade-offs between privacy, fairness, and utility.
Authors:Qian Xu, Lei Zhang, Yixiao Liu
Abstract:
Connected Autonomous Vehicles (CAVs) operate in dynamic, open, and multi-domain networks, rendering them vulnerable to various threats. Trust Management Systems (TMS) systematically organize essential steps in the trust mechanism, identifying malicious nodes against internal threats and external threats, as well as ensuring reliable decision-making for more cooperative tasks. Recent advances in machine learning (ML) offer significant potential to enhance TMS, especially for the strict requirements of CAVs, such as CAV nodes moving at varying speeds, and opportunistic and intermittent network behavior. Those features distinguish ML-based TMS from social networks, static IoT, and Social IoT. This survey proposes a novel three-layer ML-based TMS framework for CAVs in the vehicle-road-cloud integration system, i.e., trust data layer, trust calculation layer and trust incentive layer. A six-dimensional taxonomy of objectives is proposed. Furthermore, the principles of ML methods for each module in each layer are analyzed. Then, recent studies are categorized based on traffic scenarios that are against the proposed objectives. Finally, future directions are suggested, addressing the open issues and meeting the research trend. We maintain an active repository that contains up-to-date literature and open-source projects at https://github.com/octoberzzzzz/ML-based-TMS-CAV-Survey.
中文: 本综述为互联自动驾驶车辆提出了一种新颖的三层机器学习信任管理系统框架,分析了目标、原理及近期研究,并指出了未来方向和维护了活跃资源库。
English: This survey introduces a novel three-layer machine learning-based Trust Management System framework for Connected Autonomous Vehicles, analyzing objectives, principles, and recent studies while providing future directions and an active repository.
Authors:HsiaoYuan Hsu, Yuxin Peng
Abstract:
In poster design, content-aware layout generation is crucial for automatically arranging visual-textual elements on the given image. With limited training data, existing work focused on image-centric enhancement. However, this neglects the diversity of layouts and fails to cope with shape-variant elements or diverse design intents in generalized settings. To this end, we proposed a layout-centric approach that leverages layout knowledge implicit in large language models (LLMs) to create posters for omnifarious purposes, hence the name PosterO. Specifically, it structures layouts from datasets as trees in SVG language by universal shape, design intent vectorization, and hierarchical node representation. Then, it applies LLMs during inference to predict new layout trees by in-context learning with intent-aligned example selection. After layout trees are generated, we can seamlessly realize them into poster designs by editing the chat with LLMs. Extensive experimental results have demonstrated that PosterO can generate visually appealing layouts for given images, achieving new state-of-the-art performance across various benchmarks. To further explore PosterO's abilities under the generalized settings, we built PStylish7, the first dataset with multi-purpose posters and various-shaped elements, further offering a challenging test for advanced research.
Authors:Assaf Ben-Kish, Itamar Zimerman, M. Jehanzeb Mirza, Lior Wolf, James Glass, Leonid Karlinsky, Raja Giryes
Abstract:
A recent trend in LLMs is developing recurrent sub-quadratic models that improve long-context processing efficiency. We investigate leading large long-context models, focusing on how their fixed-size recurrent memory affects their performance. Our experiments reveal that, even when these models are trained for extended contexts, their use of long contexts remains underutilized. Specifically, we demonstrate that a chunk-based inference procedure, which identifies and processes only the most relevant portion of the input can mitigate recurrent memory failures and be effective for many long-context tasks: On LongBench, our method improves the overall performance of Falcon3-Mamba-Inst-7B by 14%, Falcon-Mamba-Inst-7B by 28%, RecurrentGemma-IT-9B by 50%, and RWKV6-Finch-7B by 51%. Surprisingly, this simple approach also leads to state-of-the-art results in the challenging LongBench v2 benchmark, showing competitive performance with equivalent size Transformers. Furthermore, our findings raise questions about whether recurrent models genuinely exploit long-range dependencies, as our single-chunk strategy delivers stronger performance - even in tasks that presumably require cross-context relations.
Chinese: 研究发现,尽管经过扩展训练,循环大语言模型仍未充分利用长上下文,但采用基于分块的推理方法仅处理相关输入部分,可显著提升长上下文任务性能并达到最先进水平。
English: The study finds that recurrent large language models underutilize long contexts despite extended training, but a chunk-based inference method that processes only relevant input portions significantly boosts performance on long-context tasks and achieves state-of-the-art results.
Authors:Paul Primus, Florian Schmid, Gerhard Widmer
Abstract:
Learning to associate audio with textual descriptions is valuable for a range of tasks, including pretraining, zero-shot classification, audio retrieval, audio captioning, and text-conditioned audio generation. Existing contrastive language-audio pretrained models are typically trained using global, clip-level descriptions, which provide only weak temporal supervision. We hypothesize that CLAP-like language-audio models - particularly, if they are expected to produce frame-level embeddings - can benefit from a stronger temporal supervision. To confirm our hypothesis, we curate a novel dataset of approximately 12,000 audio recordings from Freesound, each annotated with single-sentence free-text descriptions linked to a specific temporal segment in an audio recording. We use large language models to clean these annotations by removing references to non-audible events, transcribed speech, typos, and annotator language bias. We further propose a frame-wise contrastive training strategy that learns to align text descriptions with temporal regions in an audio recording and demonstrate that our model has better temporal text-audio alignment abilities compared to models trained only on global captions when evaluated on the AudioSet Strong benchmark. The dataset and our source code are available on Zenodo and GitHub, respectively.
中文: 本研究通过引入带有时序分段描述的数据集和逐帧对比训练方法,增强了音频与文本的时序对齐能力,相比仅使用全局描述的模型,在时序任务上表现更优。
English: The study enhances audio-text alignment by introducing a dataset with temporally segmented descriptions and a frame-wise contrastive training method, improving performance in temporal tasks compared to global caption models.
Authors:LLM-Core Xiaomi, :, Bingquan Xia, Bowen Shen, Cici, Dawei Zhu, Di Zhang, Gang Wang, Hailin Zhang, Huaqiu Liu, Jiebao Xiao, Jinhao Dong, Liang Zhao, Peidian Li, Peng Wang, Shihua Yu, Shimao Chen, Weikun Wang, Wenhan Ma, Xiangwei Deng, Yi Huang, Yifan Song, Zihan Jiang, Bowen Ye, Can Cai, Chenhong He, Dong Zhang, Duo Zhang, Guoan Wang, Hao Tian, Haochen Zhao, Heng Qu, Hongshen Xu, Jun Shi, Kainan Bao, Kai Fang, Kang Zhou, Kangyang Zhou, Lei Li, Menghang Zhu, Nuo Chen, Qiantong Wang, Shaohui Liu, Shicheng Li, Shuhao Gu, Shuhuai Ren, Shuo Liu, Sirui Deng, Weiji Zhuang, Weiwei Lv, Wenyu Yang, Xin Zhang, Xing Yong, Xing Zhang, Xingchen Song, Xinzhe Xu, Xu Wang, Yihan Yan, Yu Tu, Yuanyuan Tian, Yudong Wang, Yue Yu, Zhenru Lin, Zhichao Song, Zihao Yue
Abstract:
We present MiMo-7B, a large language model born for reasoning tasks, with optimization across both pre-training and post-training stages. During pre-training, we enhance the data preprocessing pipeline and employ a three-stage data mixing strategy to strengthen the base model's reasoning potential. MiMo-7B-Base is pre-trained on 25 trillion tokens, with additional Multi-Token Prediction objective for enhanced performance and accelerated inference speed. During post-training, we curate a dataset of 130K verifiable mathematics and programming problems for reinforcement learning, integrating a test-difficulty-driven code-reward scheme to alleviate sparse-reward issues and employing strategic data resampling to stabilize training. Extensive evaluations show that MiMo-7B-Base possesses exceptional reasoning potential, outperforming even much larger 32B models. The final RL-tuned model, MiMo-7B-RL, achieves superior performance on mathematics, code and general reasoning tasks, surpassing the performance of OpenAI o1-mini. The model checkpoints are available at https://github.com/xiaomimimo/MiMo.
中文: MiMo-7B 是一款专为推理任务优化的语言模型,通过改进的预训练和强化学习方法,在数学、编程及通用推理任务上表现卓越,性能超越更大规模模型及 OpenAI o1-mini。
English: MiMo-7B is a reasoning-optimized language model that demonstrates exceptional performance across mathematics, coding, and general reasoning tasks, surpassing larger models and OpenAI o1-mini through enhanced pre-training and reinforcement learning techniques.
Authors:Hu Wang, Congbo Ma, Ian Reid, Mohammad Yaqub
Abstract:
The advantage function is a central concept in RL that helps reduce variance in policy gradient estimates. Recently, for language modeling, Group Relative Policy Optimization (GRPO) was proposed to compute the advantage for each output by subtracting the mean reward, as the baseline, for all outputs in the group. However, it can lead to high variance when the reward advantage is inaccurately predicted. In this work, we propose Kalman Filter Enhanced Group Relative Policy Optimization (KRPO) model, by using lightweight Kalman filtering to dynamically estimate the latent reward baseline and uncertainty. This filtering technique replaces the naive group mean, enabling more adaptive advantage normalization. Our method does not require additional learned parameters over GRPO. This approach offers a simple yet effective way to incorporate multiple outputs of GRPO into advantage estimation, improving policy optimization in settings where highly dynamic reward signals are difficult to model for language models. Through the accuracies and rewards obtained from math question answering and reasoning, we show that using a more adaptive advantage estimation model, KRPO can improve the stability and performance of GRPO. The code is available at https://github.com/billhhh/KRPO_LLMs_RL.
中文:提出的KRPO模型通过轻量级卡尔曼滤波器动态估计奖励基线和不确定性,改进了GRPO的优势归一化和策略优化稳定性,且无需额外参数。
English: The proposed KRPO model enhances GRPO by using a lightweight Kalman filter to dynamically estimate the reward baseline and uncertainty, improving advantage normalization and policy optimization stability without extra parameters.
Authors:Mohamed Ali Souibgui, Changkyu Choi, Andrey Barsky, Kangsoo Jung, Ernest Valveny, Dimosthenis Karatzas
Abstract:
We propose DocVXQA, a novel framework for visually self-explainable document question answering. The framework is designed not only to produce accurate answers to questions but also to learn visual heatmaps that highlight contextually critical regions, thereby offering interpretable justifications for the model's decisions. To integrate explanations into the learning process, we quantitatively formulate explainability principles as explicit learning objectives. Unlike conventional methods that emphasize only the regions pertinent to the answer, our framework delivers explanations that are \textit{contextually sufficient} while remaining \textit{representation-efficient}. This fosters user trust while achieving a balance between predictive performance and interpretability in DocVQA applications. Extensive experiments, including human evaluation, provide strong evidence supporting the effectiveness of our method. The code is available at https://github.com/dali92002/DocVXQA.
中文: DocVXQA是一种新型视觉自解释文档问答框架,不仅能生成准确答案,还能通过视觉热图突出关键区域,提供上下文充分且表征高效的解释,从而在保证性能的同时增强用户信任。
English: DocVXQA is a visually self-explainable document question answering framework that generates accurate answers while learning visual heatmaps to highlight critical regions, ensuring contextually sufficient and representation-efficient explanations for enhanced user trust and balanced performance.
Authors:Hongkun Dou, Zeyu Li, Xingyu Jiang, Hongjue Li, Lijun Yang, Wen Yao, Yue Deng
Abstract:
Diffusion models (DMs) have recently demonstrated remarkable success in modeling large-scale data distributions. However, many downstream tasks require guiding the generated content based on specific differentiable metrics, typically necessitating backpropagation during the generation process. This approach is computationally expensive, as generating with DMs often demands tens to hundreds of recursive network calls, resulting in high memory usage and significant time consumption. In this paper, we propose a more efficient alternative that approaches the problem from the perspective of parallel denoising. We show that full backpropagation throughout the entire generation process is unnecessary. The downstream metrics can be optimized by retaining the computational graph of only one step during generation, thus providing a shortcut for gradient propagation. The resulting method, which we call Shortcut Diffusion Optimization (SDO), is generic, high-performance, and computationally lightweight, capable of optimizing all parameter types in diffusion sampling. We demonstrate the effectiveness of SDO on several real-world tasks, including controlling generation by optimizing latent and aligning the DMs by fine-tuning network parameters. Compared to full backpropagation, our approach reduces computational costs by $\sim 90\%$ while maintaining superior performance. Code is available at https://github.com/deng-ai-lab/SDO.
中文: 本文提出了一种名为“快捷扩散优化”(SDO)的高效计算方法,通过在生成过程中仅保留单步计算图来优化扩散模型,在保持性能的同时将计算成本降低约90%。
English: This paper introduces Shortcut Diffusion Optimization (SDO), a computationally efficient method that optimizes diffusion models by retaining only one step's computational graph during generation, reducing costs by approximately 90% while maintaining performance.
Authors:Peng Sun, Yi Jiang, Tao Lin
Abstract:
Recent advances in continuous generative models, including multi-step approaches like diffusion and flow-matching (typically requiring 8-1000 sampling steps) and few-step methods such as consistency models (typically 1-8 steps), have demonstrated impressive generative performance. However, existing work often treats these approaches as distinct paradigms, resulting in separate training and sampling methodologies. We introduce a unified framework for training, sampling, and analyzing these models. Our implementation, the Unified Continuous Generative Models Trainer and Sampler (UCGM-{T,S}), achieves state-of-the-art (SOTA) performance. For example, on ImageNet 256x256 using a 675M diffusion transformer, UCGM-T trains a multi-step model achieving 1.30 FID in 20 steps and a few-step model reaching 1.42 FID in just 2 steps. Additionally, applying UCGM-S to a pre-trained model (previously 1.26 FID at 250 steps) improves performance to 1.06 FID in only 40 steps. Code is available at: https://github.com/LINs-lab/UCGM.
中文: 本文提出了一个统一框架来训练、采样和分析连续生成模型,在减少采样步骤的同时,为多步和少步方法均实现了最先进的性能。
English: This paper introduces a unified framework for training, sampling, and analyzing continuous generative models, achieving state-of-the-art performance with fewer steps across both multi-step and few-step approaches.
Authors:Wenhao Hu, Paul Henderson, José Cano
Abstract:
Pruning is a widely used method for compressing Deep Neural Networks (DNNs), where less relevant parameters are removed from a DNN model to reduce its size. However, removing parameters reduces model accuracy, so pruning is typically combined with fine-tuning, and sometimes other operations such as rewinding weights, to recover accuracy. A common approach is to repeatedly prune and then fine-tune, with increasing amounts of model parameters being removed in each step. While straightforward to implement, pruning pipelines that follow this approach are computationally expensive due to the need for repeated fine-tuning.
In this paper we propose ICE-Pruning, an iterative pruning pipeline for DNNs that significantly decreases the time required for pruning by reducing the overall cost of fine-tuning, while maintaining a similar accuracy to existing pruning pipelines. ICE-Pruning is based on three main components: i) an automatic mechanism to determine after which pruning steps fine-tuning should be performed; ii) a freezing strategy for faster fine-tuning in each pruning step; and iii) a custom pruning-aware learning rate scheduler to further improve the accuracy of each pruning step and reduce the overall time consumption. We also propose an efficient auto-tuning stage for the hyperparameters (e.g., freezing percentage) introduced by the three components. We evaluate ICE-Pruning on several DNN models and datasets, showing that it can accelerate pruning by up to 9.61x. Code is available at https://github.com/gicLAB/ICE-Pruning
Chinese: ICE-Pruning是一种创新的迭代剪枝流程,通过优化微调过程在保持模型精度的同时显著加速深度神经网络压缩,最高可实现9.61倍的加速效果。
English: ICE-Pruning is an innovative iterative pipeline that accelerates deep neural network compression by optimizing fine-tuning processes while preserving model accuracy, achieving up to 9.61x speedup.
Authors:Zhiye Xie, Enmei Tu, Xianping Fu, Guoliang Yuan, Yi Han
Abstract:
With the increasing demands for safety, efficiency, and sustainability in global shipping, Automatic Identification System (AIS) data plays an increasingly important role in maritime monitoring. AIS data contains spatial-temporal variation patterns of vessels that hold significant research value in the marine domain. However, due to its massive scale, the full potential of AIS data has long remained untapped. With its powerful sequence modeling capabilities, particularly its ability to capture long-range dependencies and complex temporal dynamics, the Transformer model has emerged as an effective tool for processing AIS data. Therefore, this paper reviews the research on Transformer-based AIS data-driven maritime monitoring, providing a comprehensive overview of the current applications of Transformer models in the marine field. The focus is on Transformer-based trajectory prediction methods, behavior detection, and prediction techniques. Additionally, this paper collects and organizes publicly available AIS datasets from the reviewed papers, performing data filtering, cleaning, and statistical analysis. The statistical results reveal the operational characteristics of different vessel types, providing data support for further research on maritime monitoring tasks. Finally, we offer valuable suggestions for future research, identifying two promising research directions. Datasets are available at https://github.com/eyesofworld/Maritime-Monitoring.
中文: 本文综述了基于Transformer模型的AIS数据在海事监测中的应用,重点关注轨迹预测和行为检测,并整理了公开数据集以支持后续研究。
English: This paper reviews Transformer-based AIS data applications in maritime monitoring, focusing on trajectory prediction and behavior detection while organizing public datasets to support further research.
Authors:Prateek Garg, Lokesh Nagalapatti, Sunita Sarawagi
Abstract:
Algorithmic Recourse provides recommendations to individuals who are adversely impacted by automated model decisions, on how to alter their profiles to achieve a favorable outcome. Effective recourse methods must balance three conflicting goals: proximity to the original profile to minimize cost, plausibility for realistic recourse, and validity to ensure the desired outcome. We show that existing methods train for these objectives separately and then search for recourse through a joint optimization over the recourse goals during inference, leading to poor recourse recommendations. We introduce GenRe, a generative recourse model designed to train the three recourse objectives jointly. Training such generative models is non-trivial due to lack of direct recourse supervision. We propose efficient ways to synthesize such supervision and further show that GenRe's training leads to a consistent estimator. Unlike most prior methods, that employ non-robust gradient descent based search during inference, GenRe simply performs a forward sampling over the generative model to produce minimum cost recourse, leading to superior performance across multiple metrics. We also demonstrate GenRe provides the best trade-off between cost, plausibility and validity, compared to state-of-art baselines. Our code is available at: https://github.com/prateekgargx/genre.
中文摘要:GenRe是一种生成式救济模型,通过联合训练邻近性、合理性和有效性目标,利用前向采样生成最低成本救济方案,在多项指标上优于现有方法。
English Summary: GenRe is a generative recourse model that jointly trains for proximity, plausibility, and validity to provide superior recourse recommendations through forward sampling, outperforming existing methods across multiple metrics.
Authors:Keyue Qiu, Yuxuan Song, Zhehuan Fan, Peidong Liu, Zhe Zhang, Mingyue Zheng, Hao Zhou, Wei-Ying Ma
Abstract:
Structure-Based Drug Design (SBDD) is crucial for identifying bioactive molecules. Recent deep generative models are faced with challenges in geometric structure modeling. A major bottleneck lies in the twisted probability path of multi-modalities -- continuous 3D positions and discrete 2D topologies -- which jointly determine molecular geometries. By establishing the fact that noise schedules decide the Variational Lower Bound (VLB) for the twisted probability path, we propose VLB-Optimal Scheduling (VOS) strategy in this under-explored area, which optimizes VLB as a path integral for SBDD. Our model effectively enhances molecular geometries and interaction modeling, achieving state-of-the-art PoseBusters passing rate of 95.9% on CrossDock, more than 10% improvement upon strong baselines, while maintaining high affinities and robust intramolecular validity evaluated on held-out test set. Code is available at https://github.com/AlgoMole/MolCRAFT.
中文摘要:本研究提出的VLB最优调度策略通过优化噪声调度方案,解决了基于结构的药物设计中几何建模的多模态概率路径扭曲问题,在CrossDock数据集上实现了95.9%的最优姿态通过率,性能提升超过10%。
English Summary: The proposed VLB-Optimal Scheduling (VOS) strategy addresses geometric modeling challenges in Structure-Based Drug Design by optimizing noise schedules, achieving a state-of-the-art 95.9% PoseBusters passing rate with significant performance improvements.
Authors:Javier Salazar Cavazos, Jeffrey A. Fessler, Laura Balzano
Abstract:
Principal component analysis (PCA) is a key tool in the field of data dimensionality reduction. However, some applications involve heterogeneous data that vary in quality due to noise characteristics associated with each data sample. Heteroscedastic methods aim to deal with such mixed data quality. This paper develops a subspace learning method, named ALPCAH, that can estimate the sample-wise noise variances and use this information to improve the estimate of the subspace basis associated with the low-rank structure of the data. Our method makes no distributional assumptions of the low-rank component and does not assume that the noise variances are known. Further, this method uses a soft rank constraint that does not require subspace dimension to be known. Additionally, this paper develops a matrix factorized version of ALPCAH, named LR-ALPCAH, that is much faster and more memory efficient at the cost of requiring subspace dimension to be known or estimated. Simulations and real data experiments show the effectiveness of accounting for data heteroscedasticity compared to existing algorithms. Code available at https://github.com/javiersc1/ALPCAH.
中文: 本文提出了ALPCAH方法,通过估计样本噪声方差来改进异构数据的子空间基估计,无需分布假设或已知噪声方差,并开发了更快的矩阵分解版本LR-ALPCAH。
English: This paper introduces ALPCAH, a subspace learning method that estimates sample-wise noise variances to enhance subspace basis estimation for heterogeneous data without requiring distributional assumptions or known noise variances, with its faster matrix-factorized version LR-ALPCAH also presented.
Authors:Jiwoo Hong, Noah Lee, Eunki Kim, Guijin Son, Woojin Chung, Aman Gupta, Shao Tang, James Thorne
Abstract:
The Bradley-Terry (BT) model is widely practiced in reward modeling for reinforcement learning with human feedback (RLHF). Despite its effectiveness, reward models (RMs) trained with BT model loss are prone to over-optimization, losing generalizability to unseen input distributions. In this paper, we study the cause of over-optimization in RM training and its downstream effects on the RLHF procedure, accentuating the importance of distributional robustness of RMs in unseen data. First, we show that the excessive dispersion of hidden state norms is the main source of over-optimization. Then, we propose batch-wise sum-to-zero regularization (BSR) to enforce zero-centered reward sum per batch, constraining the rewards with extreme magnitudes. We assess the impact of BSR in improving robustness in RMs through four scenarios of over-optimization, where BSR consistently manifests better robustness. Subsequently, we compare the plain BT model and BSR on RLHF training and empirically show that robust RMs better align the policy to the gold preference model. Finally, we apply BSR to high-quality data and models, which surpasses state-of-the-art RMs in the 8B scale by adding more than 5% in complex preference prediction tasks. By conducting RLOO training with 8B RMs, AlpacaEval 2.0 reduces generation length by 40% while adding a 7% increase in win rate, further highlighting that robustness in RMs induces robustness in RLHF training. We release the code, data, and models: https://github.com/LinkedIn-XFACT/RM-Robustness.
Bradley-Terry模型在人类反馈强化学习中易因隐藏状态范数过度分散导致过优化,而提出的批处理零和正则化通过约束奖励极值增强了奖励模型的分布鲁棒性,显著提升了策略与黄金偏好模型的对齐效果。
The Bradley-Terry model in RLHF suffers from over-optimization due to excessive dispersion of hidden states, but the proposed batch-wise sum-to-zero regularization enhances reward model robustness and improves policy alignment in unseen data scenarios.
Authors:Prithwish Dan, Kushal Kedia, Angela Chao, Edward Weiyi Duan, Maximus Adrian Pace, Wei-Chiu Ma, Sanjiban Choudhury
Abstract:
Human videos offer a scalable way to train robot manipulation policies, but lack the action labels needed by standard imitation learning algorithms. Existing cross-embodiment approaches try to map human motion to robot actions, but often fail when the embodiments differ significantly. We propose X-Sim, a real-to-sim-to-real framework that uses object motion as a dense and transferable signal for learning robot policies. X-Sim starts by reconstructing a photorealistic simulation from an RGBD human video and tracking object trajectories to define object-centric rewards. These rewards are used to train a reinforcement learning (RL) policy in simulation. The learned policy is then distilled into an image-conditioned diffusion policy using synthetic rollouts rendered with varied viewpoints and lighting. To transfer to the real world, X-Sim introduces an online domain adaptation technique that aligns real and simulated observations during deployment. Importantly, X-Sim does not require any robot teleoperation data. We evaluate it across 5 manipulation tasks in 2 environments and show that it: (1) improves task progress by 30% on average over hand-tracking and sim-to-real baselines, (2) matches behavior cloning with 10x less data collection time, and (3) generalizes to new camera viewpoints and test-time changes. Code and videos are available at https://portal-cornell.github.io/X-Sim/.
Authors:Jinuk Kim, Marwa El Halabi, Wonpyo Park, Clemens JS Schaefer, Deokjae Lee, Yeonhong Park, Jae W. Lee, Hyun Oh Song
Abstract:
Post-training quantization is a key technique for reducing the memory and inference latency of large language models by quantizing weights and activations without requiring retraining. However, existing methods either (1) fail to account for the varying importance of hidden features to the end loss or, when incorporating end loss, (2) neglect the critical interactions between model weights. To address these limitations, we propose GuidedQuant, a novel quantization approach that integrates gradient information from the end loss into the quantization objective while preserving cross-weight dependencies within output channels. GuidedQuant consistently boosts the performance of state-of-the-art quantization methods across weight-only scalar, weight-only vector, and weight-and-activation quantization. Additionally, we introduce a novel non-uniform scalar quantization algorithm, which is guaranteed to monotonically decrease the quantization objective value, and outperforms existing methods in this category. We release the code at https://github.com/snu-mllab/GuidedQuant.
Chinese: GuidedQuant是一种新颖的量化方法,通过整合末端损失的梯度信息并保持权重间的依赖关系,在各种量化类型中均优于现有技术,提升了模型性能。
English: GuidedQuant is a novel quantization method that enhances model performance by incorporating gradient information from the end loss and maintaining cross-weight dependencies, outperforming existing techniques across various quantization types.
Authors:Bidur Khanal, Sandesh Pokhrel, Sanjay Bhandari, Ramesh Rana, Nikesh Shrestha, Ram Bahadur Gurung, Cristian Linte, Angus Watson, Yash Raj Shrestha, Binod Bhattarai
Abstract:
Vision-Language Models (VLMs) are becoming increasingly popular in the medical domain, bridging the gap between medical images and clinical language. Existing VLMs demonstrate an impressive ability to comprehend medical images and text queries to generate detailed, descriptive diagnostic medical reports. However, hallucination--the tendency to generate descriptions that are inconsistent with the visual content--remains a significant issue in VLMs, with particularly severe implications in the medical field. To facilitate VLM research on gastrointestinal (GI) image analysis and study hallucination, we curate a multimodal image-text GI dataset: Gut-VLM. This dataset is created using a two-stage pipeline: first, descriptive medical reports of Kvasir-v2 images are generated using ChatGPT, which introduces some hallucinated or incorrect texts. In the second stage, medical experts systematically review these reports, and identify and correct potential inaccuracies to ensure high-quality, clinically reliable annotations. Unlike traditional datasets that contain only descriptive texts, our dataset also features tags identifying hallucinated sentences and their corresponding corrections. A common approach to reducing hallucination in VLM is to finetune the model on a small-scale, problem-specific dataset. However, we take a different strategy using our dataset. Instead of finetuning the VLM solely for generating textual reports, we finetune it to detect and correct hallucinations, an approach we call hallucination-aware finetuning. Our results show that this approach is better than simply finetuning for descriptive report generation. Additionally, we conduct an extensive evaluation of state-of-the-art VLMs across several metrics, establishing a benchmark. GitHub Repo: https://github.com/bhattarailab/Hallucination-Aware-VLM.
中文: 视觉语言模型在医学影像应用中潜力巨大,但存在幻觉生成问题;本研究通过构建胃肠道数据集并采用幻觉感知微调方法,有效提升了诊断报告的准确性。
English: Vision-Language Models (VLMs) show promise in medical imaging but suffer from hallucination issues, which are addressed in this study through a curated gastrointestinal dataset and a novel hallucination-aware fine-tuning approach that improves diagnostic report accuracy.
Authors:Lishan Yang, Wei Emma Zhang, Quan Z. Sheng, Lina Yao, Weitong Chen, Ali Shakeri
Abstract:
In the era of big data, data mining has become indispensable for uncovering hidden patterns and insights from vast and complex datasets. The integration of multimodal data sources further enhances its potential. Multimodal Federated Learning (MFL) is a distributed approach that enhances the efficiency and quality of multimodal learning, ensuring collaborative work and privacy protection. However, missing modalities pose a significant challenge in MFL, often due to data quality issues or privacy policies across the clients. In this work, we present MMiC, a framework for Mitigating Modality incompleteness in MFL within the Clusters. MMiC replaces partial parameters within client models inside clusters to mitigate the impact of missing modalities. Furthermore, it leverages the Banzhaf Power Index to optimize client selection under these conditions. Finally, MMiC employs an innovative approach to dynamically control global aggregation by utilizing Markovitz Portfolio Optimization. Extensive experiments demonstrate that MMiC consistently outperforms existing federated learning architectures in both global and personalized performance on multimodal datasets with missing modalities, confirming the effectiveness of our proposed solution. Our code is available at https://github.com/gotobcn8/MMiC.
中文摘要:MMiC框架通过参数替换、优化客户端选择和动态全局聚合控制,有效解决了多模态联邦学习中的模态缺失问题,其性能显著优于现有方法。
English Summary: The MMiC framework effectively addresses missing modality challenges in Multimodal Federated Learning by implementing parameter replacement, optimized client selection, and dynamic global aggregation control, demonstrating superior performance over existing methods.
Authors:Zihan Guan, Mengxuan Hu, Ronghang Zhu, Sheng Li, Anil Vullikanti
Abstract:
Recent studies have uncovered a troubling vulnerability in the fine-tuning stage of large language models (LLMs): even fine-tuning on entirely benign datasets can lead to a significant increase in the harmfulness of LLM outputs. Building on this finding, our red teaming study takes this threat one step further by developing a more effective attack. Specifically, we analyze and identify samples within benign datasets that contribute most to safety degradation, then fine-tune LLMs exclusively on these samples. We approach this problem from an outlier detection perspective and propose Self-Inf-N, to detect and extract outliers for fine-tuning. Our findings reveal that fine-tuning LLMs on 100 outlier samples selected by Self-Inf-N in the benign datasets severely compromises LLM safety alignment. Extensive experiments across seven mainstream LLMs demonstrate that our attack exhibits high transferability across different architectures and remains effective in practical scenarios. Alarmingly, our results indicate that most existing mitigation strategies fail to defend against this attack, underscoring the urgent need for more robust alignment safeguards. Codes are available at https://github.com/GuanZihan/Benign-Samples-Matter.
中文摘要:研究发现,即使在良性数据集上微调大型语言模型也会显著增加其输出的危害性,而一种利用异常样本的新攻击方法严重破坏了多种模型的安全对齐,且现有防御措施大多无效。
English Summary: Fine-tuning large language models on even benign datasets can dangerously increase their harmfulness, and a new attack method using outlier samples severely compromises safety alignment across various models, with most existing defenses proving ineffective.
Authors:Ammar Daskin
Abstract:
In this paper, we discuss how quantum recurrent neural networks (RNNs) and their enhanced version, long short-term memory (LSTM) networks, can be modeled using the core ideas presented in Ref.[1], where the entangling and disentangling power of unitary transformations is investigated. In particular, we interpret entangling and disentangling power as information retention and forgetting mechanisms in LSTMs. Therefore, entanglement becomes a key component of the optimization (training) process. We believe that, by leveraging prior knowledge of the entangling power of unitaries, the proposed quantum-classical framework can guide and help to design better-parameterized quantum circuits for various real-world applications.
中文: 本文通过将幺正变换的纠缠与解纠缠能力解释为信息保留与遗忘机制,将量子循环神经网络及其长短期记忆网络建模,使纠缠成为优化训练过程的核心,以指导设计更优参数化量子电路。
English: This paper models quantum RNNs and LSTMs by interpreting the entangling and disentangling power of unitaries as information retention and forgetting mechanisms, making entanglement central to the training process for designing optimized quantum circuits.
Authors:Shalin Anand Jain, Jiazhen Liu, Siva Kailas, Harish Ravichandar
Abstract:
Multi-agent reinforcement learning (MARL) has emerged as a promising solution for learning complex and scalable coordination behaviors in multi-robot systems. However, established MARL platforms (e.g., SMAC and MPE) lack robotics relevance and hardware deployment, leaving multi-robot learning researchers to develop bespoke environments and hardware testbeds dedicated to the development and evaluation of their individual contributions. The Multi-Agent RL Benchmark and Learning Environment for the Robotarium (MARBLER) is an exciting recent step in providing a standardized robotics-relevant platform for MARL, by bridging the Robotarium testbed with existing MARL software infrastructure. However, MARBLER lacks support for parallelization and GPU/TPU execution, making the platform prohibitively slow compared to modern MARL environments and hindering adoption. We contribute JaxRobotarium, a Jax-powered end-to-end simulation, learning, deployment, and benchmarking platform for the Robotarium. JaxRobotarium enables rapid training and deployment of multi-robot RL (MRRL) policies with realistic robot dynamics and safety constraints, supporting parallelization and hardware acceleration. Our generalizable learning interface integrates easily with SOTA MARL libraries (e.g., JaxMARL). In addition, JaxRobotarium includes eight standardized coordination scenarios, including four novel scenarios that bring established MARL benchmark tasks (e.g., RWARE and Level-Based Foraging) to a robotics setting. We demonstrate that JaxRobotarium retains high simulation fidelity while achieving dramatic speedups over baseline (20x in training and 150x in simulation), and provides an open-access sim-to-real evaluation pipeline through the Robotarium testbed, accelerating and democratizing access to multi-robot learning research and evaluation. Our code is available at https://github.com/GT-STAR-Lab/JaxRobotarium.
中文: JaxRobotarium 是一个基于 Jax 的平台,支持多机器人强化学习的并行化训练与部署,具备真实动力学和安全约束,实现了显著加速,并提供标准化场景和仿真到现实的评估流程。
English: JaxRobotarium is a Jax-powered platform that enables fast, parallelized training and deployment of multi-robot reinforcement learning policies with realistic dynamics and safety constraints, achieving significant speedups and providing standardized scenarios and sim-to-real evaluation.
Authors:Youcef Djenouri, Nassim Belmecheri, Tomasz Michalak, Jan DubiÅski, Ahmed Nabil Belbachir, Anis Yazidi
Abstract:
Diffusion-based generative models have significantly advanced text-to-image synthesis, demonstrating impressive text comprehension and zero-shot generalization. These models refine images from random noise based on textual prompts, with initial reliance on text input shifting towards enhanced visual fidelity over time. This transition suggests that static model parameters might not optimally address the distinct phases of generation. We introduce LGR-AD (Learning Graph Representation of Agent Diffusers), a novel multi-agent system designed to improve adaptability in dynamic computer vision tasks. LGR-AD models the generation process as a distributed system of interacting agents, each representing an expert sub-model. These agents dynamically adapt to varying conditions and collaborate through a graph neural network that encodes their relationships and performance metrics. Our approach employs a coordination mechanism based on top-$k$ maximum spanning trees, optimizing the generation process. Each agent's decision-making is guided by a meta-model that minimizes a novel loss function, balancing accuracy and diversity. Theoretical analysis and extensive empirical evaluations show that LGR-AD outperforms traditional diffusion models across various benchmarks, highlighting its potential for scalable and flexible solutions in complex image generation tasks. Code is available at: https://github.com/YousIA/LGR_AD
中文摘要:LGR-AD提出了一种基于图神经网络的多智能体系统,通过动态协调专家子模型优化扩散模型的图像生成过程,在保持准确性与多样性的同时显著超越传统方法。
English Summary: LGR-AD introduces a multi-agent system using graph neural networks to dynamically optimize diffusion-based image generation, outperforming traditional models by balancing accuracy and diversity through collaborative agent coordination.
Authors:Maxim Vashkevich, Egor Krivalcevich
Abstract:
The paper presents a learned two-dimensional separable transform (LST) that can be considered as a new type of computational layer for constructing neural network (NN) architecture for image recognition tasks. The LST based on the idea of sharing the weights of one fullyconnected (FC) layer to process all rows of an image. After that, a second shared FC layer is used to process all columns of image representation obtained from the first layer. The use of LST layers in a NN architecture significantly reduces the number of model parameters compared to models that use stacked FC layers. We show that a NN-classifier based on a single LST layer followed by an FC layer achieves 98.02\% accuracy on the MNIST dataset, while having only 9.5k parameters. We also implemented a LST-based classifier for handwritten digit recognition on the FPGA platform to demonstrate the efficiency of the suggested approach for designing a compact and high-performance implementation of NN models. Git repository with supplementary materials: https://github.com/Mak-Sim/LST-2d
该论文提出一种学习型二维可分离变换(LST)作为神经网络层,通过共享全连接层权重处理图像行列,在MNIST数据集上以仅9.5k参数实现98.02%准确率,并展示了在FPGA平台的高效部署。
This paper introduces a learned two-dimensional separable transform (LST) as a neural network layer that reduces parameters while maintaining high accuracy, achieving 98.02% on MNIST with only 9.5k parameters and demonstrating efficient FPGA implementation.
Authors:Woosang Lim, Zekun Li, Gyuwan Kim, Sungyoung Ji, HyeonJung Kim, Kyuri Choi, Jin Hyuk Lim, Kyungpyo Park, William Yang Wang
Abstract:
Long-context large language models (LC LLMs) combined with retrieval-augmented generation (RAG) hold strong potential for complex multi-hop and large-document tasks. However, existing RAG systems often suffer from imprecise retrieval, incomplete context coverage under constrained windows, and fragmented information from suboptimal context construction. We introduce Multi-scale Adaptive Context RAG (MacRAG), a hierarchical RAG framework that compresses and partitions documents into coarse-to-fine granularities, then adaptively merges relevant contexts through real-time chunk- and document-level expansions. By initiating with finest-level retrieval and progressively incorporating broader, higher-level context, MacRAG constructs effective query-specific long contexts, optimizing both precision and coverage. Evaluations on challenging LongBench expansions of HotpotQA, 2WikiMultihopQA, and Musique confirm MacRAG consistently surpasses baseline RAG pipelines in single- and multi-step generation using Llama-3.1-8B, Gemini-1.5-pro, and GPT-4o. Our results establish MacRAG as an efficient, scalable solution for real-world long-context, multi-hop reasoning. Our code is available at https://github.com/Leezekun/MacRAG.
中文: MacRAG提出了一种分层检索增强生成框架,通过自适应融合从粗到细的文档粒度来优化精度和覆盖范围,在多种大语言模型的多跳推理任务中持续超越基线模型。
English: MacRAG introduces a hierarchical RAG framework that adaptively merges coarse-to-fine document granularities to optimize precision and coverage, consistently outperforming baseline models in multi-hop reasoning tasks across various LLMs.
Authors:Ummay Maria Muna, Fahim Hafiz, Shanta Biswas, Riasat Azim
Abstract:
Small nucleolar RNAs (snoRNAs) are increasingly recognized for their critical role in the pathogenesis and characterization of various human diseases. Consequently, the precise identification of snoRNA-disease associations (SDAs) is essential for the progression of diseases and the advancement of treatment strategies. However, conventional biological experimental approaches are costly, time-consuming, and resource-intensive; therefore, machine learning-based computational methods offer a promising solution to mitigate these limitations. This paper proposes a model called 'GBDTSVM', representing a novel and efficient machine learning approach for predicting snoRNA-disease associations by leveraging a Gradient Boosting Decision Tree (GBDT) and Support Vector Machine (SVM). 'GBDTSVM' effectively extracts integrated snoRNA-disease feature representations utilizing GBDT and SVM is subsequently utilized to classify and identify potential associations. Furthermore, the method enhances the accuracy of these predictions by incorporating Gaussian kernel profile similarity for both snoRNAs and diseases. Experimental evaluation of the GBDTSVM model demonstrated superior performance compared to state-of-the-art methods in the field, achieving an area under the receiver operating characteristic (AUROC) of 0.96 and an area under the precision-recall curve (AUPRC) of 0.95 on MDRF dataset. Moreover, our model shows superior performance on two more datasets named LSGT and PsnoD. Additionally, a case study on the predicted snoRNA-disease associations verified the top 10 predicted snoRNAs across nine prevalent diseases, further validating the efficacy of the GBDTSVM approach. These results underscore the model's potential as a robust tool for advancing snoRNA-related disease research. Source codes and datasets our proposed framework can be obtained from: https://github.com/mariamuna04/gbdtsvm
中文: 本文提出GBDTSVM模型,通过结合梯度提升决策树与支持向量机来精准预测snoRNA与疾病关联,在多个数据集上展现出卓越性能(AUROC达0.96,AUPRC达0.95),显著优于现有方法。
English: This paper introduces GBDTSVM, a novel machine learning model that combines Gradient Boosting Decision Tree and Support Vector Machine to accurately predict snoRNA-disease associations, demonstrating superior performance with AUROC of 0.96 and AUPRC of 0.95 compared to existing methods.
Authors:Haoyang Xie, Feng Ju
Abstract:
Computer-aided design (CAD) is fundamental to modern engineering and manufacturing, but creating CAD models still requires expert knowledge and specialized software. Recent advances in large language models (LLMs) open up the possibility of generative CAD, where natural language is directly translated into parametric 3D models. However, most existing methods generate task-specific command sequences that pretrained models cannot directly handle. These sequences must be converted into CAD representations such as CAD vectors before a 3D model can be produced, which requires training models from scratch and adds unnecessary complexity. To tackle this issue, we propose generating CadQuery code directly from text, leveraging the strengths of pretrained LLMs to produce 3D models without intermediate representations, using this Python-based scripting language. Since LLMs already excel at Python generation and spatial reasoning, fine-tuning them on Text-to-CadQuery data proves highly effective. Given that these capabilities typically improve with scale, we hypothesize that larger models will perform better after fine-tuning. To enable this, we augment the Text2CAD dataset with 170,000 CadQuery annotations. We fine-tune six open-source LLMs of varying sizes and observe consistent improvements. Our best model achieves a top-1 exact match of 69.3%, up from 58.8%, and reduces Chamfer Distance by 48.6%. Project page: https://github.com/Text-to-CadQuery/Text-to-CadQuery.
Chinese: 本研究提出了一种利用预训练大语言模型直接从文本生成CadQuery代码的方法,省去中间表示步骤,在创建参数化3D模型方面实现了准确性和效率的显著提升。
English: This research introduces a method to generate CadQuery code directly from text using pretrained large language models, eliminating intermediate representations and achieving significant improvements in accuracy and efficiency for creating parametric 3D models.
Authors:Md Rakibul Hasan, Pouria Behnoudfar, Dan MacKinlay, Thomas Poulet
Abstract:
Machine Learning, particularly Generative Adversarial Networks (GANs), has revolutionised Super-Resolution (SR). However, generated images often lack physical meaningfulness, which is essential for scientific applications. Our approach, PC-SRGAN, enhances image resolution while ensuring physical consistency for interpretable simulations. PC-SRGAN significantly improves both the Peak Signal-to-Noise Ratio and the Structural Similarity Index Measure compared to conventional SR methods, even with limited training data (e.g., only 13% of training data is required to achieve performance similar to SRGAN). Beyond SR, PC-SRGAN augments physically meaningful machine learning, incorporating numerically justified time integrators and advanced quality metrics. These advancements promise reliable and causal machine-learning models in scientific domains. A significant advantage of PC-SRGAN over conventional SR techniques is its physical consistency, which makes it a viable surrogate model for time-dependent problems. PC-SRGAN advances scientific machine learning by improving accuracy and efficiency, enhancing process understanding, and broadening applications to scientific research. We publicly release the complete source code of PC-SRGAN and all experiments at https://github.com/hasan-rakibul/PC-SRGAN.
中文:PC-SRGAN通过生成物理一致的图像革新了超分辨率技术,在极少训练数据下显著提升精度和效率,并推动科学机器学习应用的发展。
English: PC-SRGAN revolutionizes Super-Resolution by generating physically consistent images, significantly improving accuracy and efficiency with minimal training data while advancing scientific machine learning applications.
Authors:Donghao Ren, Fred Hohman, Halden Lin, Dominik Moritz
Abstract:
Embedding projections are popular for visualizing large datasets and models. However, people often encounter "friction" when using embedding visualization tools: (1) barriers to adoption, e.g., tedious data wrangling and loading, scalability limits, no integration of results into existing workflows, and (2) limitations in possible analyses, without integration with external tools to additionally show coordinated views of metadata. In this paper, we present Embedding Atlas, a scalable, interactive visualization tool designed to make interacting with large embeddings as easy as possible. Embedding Atlas uses modern web technologies and advanced algorithms -- including density-based clustering, and automated labeling -- to provide a fast and rich data analysis experience at scale. We evaluate Embedding Atlas with a competitive analysis against other popular embedding tools, showing that Embedding Atlas's feature set specifically helps reduce friction, and report a benchmark on its real-time rendering performance with millions of points. Embedding Atlas is available as open source to support future work in embedding-based analysis.
Authors:Everest Yang, Ria Vasishtha, Luqman K. Dad, Lisa A. Kachnic, Andrew Hope, Eric Wang, Xiao Wu, Yading Yuan, David J. Brenner, Igor Shuryak
Abstract:
Causal machine learning (CML) enables individualized estimation of treatment effects, offering critical advantages over traditional correlation-based methods. However, existing approaches for medical survival data with censoring such as causal survival forests estimate effects at fixed time points, limiting their ability to capture dynamic changes over time. We introduce Causal Analysis for Survival Trajectories (CAST), a novel framework that models treatment effects as continuous functions of time following treatment. By combining parametric and non-parametric methods, CAST overcomes the limitations of discrete time-point analysis to estimate continuous effect trajectories. Using the RADCURE dataset [1] of 2,651 patients with head and neck squamous cell carcinoma (HNSCC) as a clinically relevant example, CAST models how chemotherapy and radiotherapy effects evolve over time at the population and individual levels. By capturing the temporal dynamics of treatment response, CAST reveals how treatment effects rise, peak, and decline over the follow-up period, helping clinicians determine when and for whom treatment benefits are maximized. This framework advances the application of CML to personalized care in HNSCC and other life-threatening medical conditions. Source code/data available at: https://github.com/CAST-FW/HNSCC
中文: CAST框架通过结合参数与非参数方法,将治疗效果建模为时间的连续函数,揭示了头颈癌治疗效应随时间的动态变化,从而克服了传统固定时间点分析的局限,推动了个性化医疗的发展。
English: CAST introduces a continuous time-based framework for estimating dynamic treatment effects in survival data, overcoming the limitations of fixed-time analyses by modeling how effects evolve, peak, and decline, as demonstrated in a head and neck cancer study.
Authors:Chathurangi Shyalika, Renjith Prasad, Fadi El Kalach, Revathy Venkataramanan, Ramtin Zand, Ramy Harik, Amit Sheth
Abstract:
In modern assembly pipelines, identifying anomalies is crucial in ensuring product quality and operational efficiency. Conventional single-modality methods fail to capture the intricate relationships required for precise anomaly prediction in complex predictive environments with abundant data and multiple modalities. This paper proposes a neurosymbolic AI and fusion-based approach for multimodal anomaly prediction in assembly pipelines. We introduce a time series and image-based fusion model that leverages decision-level fusion techniques. Our research builds upon three primary novel approaches in multimodal learning: time series and image-based decision-level fusion modeling, transfer learning for fusion, and knowledge-infused learning. We evaluate the novel method using our derived and publicly available multimodal dataset and conduct comprehensive ablation studies to assess the impact of our preprocessing techniques and fusion model compared to traditional baselines. The results demonstrate that a neurosymbolic AI-based fusion approach that uses transfer learning can effectively harness the complementary strengths of time series and image data, offering a robust and interpretable approach for anomaly prediction in assembly pipelines with enhanced performance. \noindent The datasets, codes to reproduce the results, supplementary materials, and demo are available at https://github.com/ChathurangiShyalika/NSF-MAP.
中文摘要:本文提出了一种基于神经符号人工智能和融合的方法,用于装配流水线中的多模态异常预测,通过决策级融合和迁移学习结合时间序列与图像数据,以提升性能与可解释性。
English Summary: This paper introduces a neurosymbolic AI and fusion-based method for multimodal anomaly prediction in assembly pipelines, combining time series and image data through decision-level fusion and transfer learning to improve performance and interpretability.
Authors:Hang Gao, Chenhao Zhang, Tie Wang, Junsuo Zhao, Fengge Wu, Changwen Zheng, Huaping Liu
Abstract:
Large Language Models (LLMs) have achieved remarkable success across various domains. However, they still face significant challenges, including high computational costs for training and limitations in solving complex reasoning problems. Although existing methods have extended the reasoning capabilities of LLMs through structured paradigms, these approaches often rely on task-specific prompts and predefined reasoning processes, which constrain their flexibility and generalizability. To address these limitations, we propose a novel framework that leverages graph learning to enable more flexible and adaptive reasoning capabilities for LLMs. Specifically, this approach models the reasoning process of a problem as a graph and employs LLM-based graph learning to guide the adaptive generation of each reasoning step. To further enhance the adaptability of the model, we introduce a Graph Neural Network (GNN) module to perform representation learning on the generated reasoning process, enabling real-time adjustments to both the model and the prompt. Experimental results demonstrate that this method significantly improves reasoning performance across multiple tasks without requiring additional training or task-specific prompt design. Code can be found in https://github.com/zch65458525/L2T.
Chinese: 本文提出了一种新颖框架,将图学习与大语言模型相结合,无需额外训练或任务特定提示即可显著提升模型在多任务中的推理灵活性和性能。
English: This paper introduces a novel framework that integrates graph learning with Large Language Models (LLMs) to enhance their reasoning flexibility and performance across various tasks without requiring extra training or task-specific prompts.
Authors:Wei Xiong, Junming Lin, Jiangtong Li, Jie Li, Changjun Jiang
Abstract:
While foundation models excel in text, image, and video domains, the critical biological signals, particularly electroencephalography(EEG), remain underexplored. EEG benefits neurological research with its high temporal resolution, operational practicality, and safety profile. However, low signal-to-noise ratio, inter-subject variability, and cross-paradigm differences hinder the generalization of current models. Existing methods often employ simplified strategies, such as a single loss function or a channel-temporal joint representation module, and suffer from a domain gap between pretraining and evaluation tasks that compromises efficiency and adaptability. To address these limitations, we propose the Adaptive Large Foundation model for EEG signal representation(ALFEE) framework, a novel hybrid transformer architecture with two learning stages for robust EEG representation learning. ALFEE employs a hybrid attention that separates channel-wise feature aggregation from temporal dynamics modeling, enabling robust EEG representation with variable channel configurations. A channel encoder adaptively compresses variable channel information, a temporal encoder captures task-guided evolution, and a hybrid decoder reconstructs signals in both temporal and frequency domains. During pretraining, ALFEE optimizes task prediction, channel and temporal mask reconstruction, and temporal forecasting to enhance multi-scale and multi-channel representation. During fine-tuning, a full-model adaptation with a task-specific token dictionary and a cross-attention layer boosts performance across multiple tasks. After 25,000 hours of pretraining, extensive experimental results on six downstream EEG tasks demonstrate the superior performance of ALFEE over existing models. Our ALFEE framework establishes a scalable foundation for biological signal analysis with implementation at https://github.com/xw1216/ALFEE.
Chinese: ALFEE框架采用混合Transformer架构,通过多阶段学习和混合注意力机制解决脑电信号的关键难题,在六项下游任务中经过大规模预训练后展现出卓越性能。
English: The ALFEE framework introduces a hybrid transformer architecture that overcomes EEG signal challenges through multi-stage learning and hybrid attention, achieving superior performance across six tasks after extensive pretraining.
Authors:Zixu Wang, Bingbing Xu, Yige Yuan, Huawei Shen, Xueqi Cheng
Abstract:
As an important graph pre-training method, Graph Contrastive Learning (GCL) continues to play a crucial role in the ongoing surge of research on graph foundation models or LLM as enhancer for graphs. Traditional GCL optimizes InfoNCE by using augmentations to define self-supervised tasks, treating augmented pairs as positive samples and others as negative. However, this leads to semantically similar pairs being classified as negative, causing significant sampling bias and limiting performance. In this paper, we argue that GCL is essentially a Positive-Unlabeled (PU) learning problem, where the definition of self-supervised tasks should be semantically guided, i.e., augmented samples with similar semantics are considered positive, while others, with unknown semantics, are treated as unlabeled. From this perspective, the key lies in how to extract semantic information. To achieve this, we propose IFL-GCL, using InfoNCE as a "free lunch" to extract semantic information. Specifically, We first prove that under InfoNCE, the representation similarity of node pairs aligns with the probability that the corresponding contrastive sample is positive. Then we redefine the maximum likelihood objective based on the corrected samples, leading to a new InfoNCE loss function. Extensive experiments on both the graph pretraining framework and LLM as an enhancer show significantly improvements of IFL-GCL in both IID and OOD scenarios, achieving up to a 9.05% improvement, validating the effectiveness of semantically guided. Code for IFL-GCL is publicly available at: https://github.com/Camel-Prince/IFL-GCL.
中文: 本文提出IFL-GCL方法,将图对比学习重新定义为正例-未标记学习问题,通过利用InfoNCE提取语义信息并修正采样偏差,在独立同分布和非独立同分布场景下均实现了显著性能提升。
English: This paper reinterprets Graph Contrastive Learning (GCL) as a Positive-Unlabeled learning problem and proposes IFL-GCL, a method that leverages InfoNCE to extract semantic information and correct sampling bias, achieving significant performance improvements in both IID and OOD scenarios.
Authors:Baijiong Lin, Weisen Jiang, Yuancheng Xu, Hao Chen, Ying-Cong Chen
Abstract:
Multi-objective test-time alignment aims to adapt large language models (LLMs) to diverse multi-dimensional user preferences during inference while keeping LLMs frozen. Recently, GenARM (Xu et al., 2025) first independently trains Autoregressive Reward Models (ARMs) for each preference dimension without awareness of each other, then combines their outputs based on user-specific preference vectors during inference to achieve multi-objective test-time alignment, leading to two key limitations: the need for \textit{multiple} ARMs increases the inference cost, and the separate training of ARMs causes the misalignment between the guided generation and the user preferences. To address these issues, we propose Preference-aware ARM (PARM), a single unified ARM trained across all preference dimensions. PARM uses our proposed Preference-Aware Bilinear Low-Rank Adaptation (PBLoRA), which employs a bilinear form to condition the ARM on preference vectors, enabling it to achieve precise control over preference trade-offs during inference. Experiments demonstrate that PARM reduces inference costs and achieves better alignment with preference vectors compared with existing methods. Additionally, PARM enables weak-to-strong guidance, allowing a smaller PARM to guide a larger frozen LLM without expensive training, making multi-objective alignment accessible with limited computing resources. The code is available at https://github.com/Baijiong-Lin/PARM.
中文: 本文提出PARM,一种统一的自回归奖励模型,通过偏好感知双线性适配实现精确的多目标对齐,相比现有方法降低了推理成本并提升了性能。
English: This paper introduces PARM, a unified autoregressive reward model that uses preference-aware bilinear adaptation to achieve precise multi-objective alignment with reduced inference costs and improved performance compared to existing methods.
Authors:Junzhou Xu, Boyu Diao
Abstract:
As deep learning models expand, the pre-training-fine-tuning paradigm has become the standard approach for handling various downstream tasks. However, shared parameters can lead to diminished performance when dealing with complex datasets involving multiple tasks. While introducing Mixture-of-Experts (MoE) methods has alleviated this issue to some extent, it also significantly increases the number of parameters required for fine-tuning and training time, introducing greater parameter redundancy. To address these challenges, we propose a method for allocating expert numbers based on parameter sensitivity LoRA-SMoE (A Sensitivity-Driven Expert Allocation Method in LoRA-MoE for Efficient Fine-Tuning). This method rapidly assesses the sensitivity of different tasks to parameters by sampling a small amount of data and using gradient information. It then adaptively allocates expert numbers within a given budget. The process maintains comparable memory consumption to LoRA (Low-Rank Adaptation) while ensuring an efficient and resource-friendly fine-tuning procedure. Experimental results demonstrate that compared to SOTA fine-tuning methods, our LoRA-SMoE approach can enhance model performance while reducing the number of trainable parameters. This significantly improves model performance in resource-constrained environments. Additionally, due to its efficient parameter sensitivity evaluation mechanism, LoRA-SMoE requires minimal computational overhead to optimize expert allocation, making it particularly suitable for scenarios with limited computational resources. All the code in this study will be made publicly available following the acceptance of the paper for publication. Source code is at https://github.com/EMLS-ICTCAS/LoRA-SMoE
中文: 提出的LoRA-SMoE方法基于参数敏感度自适应分配专家数量,在提升模型性能的同时减少可训练参数,特别适合资源受限的环境。
English: The proposed LoRA-SMoE method adaptively allocates expert numbers based on parameter sensitivity to enhance model performance while reducing trainable parameters, making it particularly efficient for resource-constrained environments.
Authors:Zhiyu Zhu, Jiayu Zhang, Zhibo Jin, Fang Chen, Jianlong Zhou
Abstract:
Attribution algorithms are essential for enhancing the interpretability and trustworthiness of deep learning models by identifying key features driving model decisions. Existing frameworks, such as InterpretDL and OmniXAI, integrate multiple attribution methods but suffer from scalability limitations, high coupling, theoretical constraints, and lack of user-friendly implementations, hindering neural network transparency and interoperability. To address these challenges, we propose Attribution-Based Explainability (ABE), a unified framework that formalizes Fundamental Attribution Methods and integrates state-of-the-art attribution algorithms while ensuring compliance with attribution axioms. ABE enables researchers to develop novel attribution techniques and enhances interpretability through four customizable modules: Robustness, Interpretability, Validation, and Data & Model. This framework provides a scalable, extensible foundation for advancing attribution-based explainability and fostering transparent AI systems. Our code is available at: https://github.com/LMBTough/ABE-XAI.
中文: 提出的基于归因的可解释性(ABE)框架统一了核心归因方法,通过可定制模块解决现有工具的可扩展性和易用性不足,推动透明人工智能系统的发展。
English: The proposed Attribution-Based Explainability (ABE) framework unifies fundamental attribution methods to overcome scalability and usability limitations in existing tools, offering modular components to advance transparent AI systems.
Authors:Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, Hongyang Li
Abstract:
A generalist robot should perform effectively across various environments. However, most existing approaches heavily rely on scaling action-annotated data to enhance their capabilities. Consequently, they are often limited to single physical specification and struggle to learn transferable knowledge across different embodiments and environments. To confront these limitations, we propose UniVLA, a new framework for learning cross-embodiment vision-language-action (VLA) policies. Our key innovation is to derive task-centric action representations from videos with a latent action model. This enables us to exploit extensive data across a wide spectrum of embodiments and perspectives. To mitigate the effect of task-irrelevant dynamics, we incorporate language instructions and establish a latent action model within the DINO feature space. Learned from internet-scale videos, the generalist policy can be deployed to various robots through efficient latent action decoding. We obtain state-of-the-art results across multiple manipulation and navigation benchmarks, as well as real-robot deployments. UniVLA achieves superior performance over OpenVLA with less than 1/20 of pretraining compute and 1/10 of downstream data. Continuous performance improvements are observed as heterogeneous data, even including human videos, are incorporated into the training pipeline. The results underscore UniVLA's potential to facilitate scalable and efficient robot policy learning.
中文: UniVLA提出了一种新框架,通过从视频中提取任务中心化的动作表示来学习跨具身的视觉-语言-动作策略,能以极少的计算资源和数据实现最先进的性能,并可部署到多种机器人平台。
English: UniVLA introduces a novel framework that learns cross-embodiment vision-language-action policies by deriving task-centric action representations from videos, enabling deployment across various robots with state-of-the-art results using significantly less computational resources and data.
Authors:Vytenis Šliogeris, Povilas Daniušis, Artūras Nakvosas
Abstract:
In this technical report, we empirically investigate the relationship between linguistic fluency and domain knowledge in the context of continual learning with large language models (LLMs). Specifically, we enhance the linguistic fluency of the Gemma2 LLM for the Lithuanian language by autoregressively pretraining its full parameter set on the first 10\% of the Lithuanian language component of the CulturaX dataset. To prevent catastrophic forgetting of the model's existing domain knowledge, we apply Elastic Weight Consolidation (EWC), leveraging Fisher information estimated using data from the Massive Multitask Language Understanding (MMLU) benchmark. In the post-training evaluations, we assess linguistic fluency through perplexity and evaluate domain knowledge using accuracy on a suite of language understanding benchmarks, including ARC-Easy, Belebele, GSM8K, HellaSwag, MMLU, TruthfulQA, and Winogrande, in both English and Lithuanian. The empirical results demonstrate that EWC not only mitigates catastrophic forgetting by preserving the model's performance in terms of both linguistic fluency and domain knowledge but also improves or maintains these capabilities for the newly added Lithuanian language. These findings highlight the potential for more efficient adaptation of general-purpose LLMs to under-represented languages without requiring access to the original training data. The accompanying codebase is openly accessible at https://github.com/Neurotechnology/LLM_EWC.
中文摘要:本研究表明,通过全参数预训练结合弹性权重巩固(EWC)方法,在提升Gemma2模型立陶宛语流畅度的同时有效防止了灾难性遗忘,实现了无需原始训练数据即可使通用大语言模型高效适应小语种的能力。
English Summary: This study demonstrates that Elastic Weight Consolidation (EWC) effectively prevents catastrophic forgetting while enhancing the Gemma2 model's Lithuanian language fluency through full-parameter pretraining, enabling efficient adaptation to underrepresented languages without original training data.
Authors:Changkun Ye, Russell Tsuchida, Lars Petersson, Nick Barnes
Abstract:
Open set label shift (OSLS) occurs when label distributions change from a source to a target distribution, and the target distribution has an additional out-of-distribution (OOD) class. In this work, we build estimators for both source and target open set label distributions using a source domain in-distribution (ID) classifier and an ID/OOD classifier. With reasonable assumptions on the ID/OOD classifier, the estimators are assembled into a sequence of three stages: 1) an estimate of the source label distribution of the OOD class, 2) an EM algorithm for Maximum Likelihood estimates (MLE) of the target label distribution, and 3) an estimate of the target label distribution of OOD class under relaxed assumptions on the OOD classifier. The sampling errors of estimates in 1) and 3) are quantified with a concentration inequality. The estimation result allows us to correct the ID classifier trained on the source distribution to the target distribution without retraining. Experiments on a variety of open set label shift settings demonstrate the effectiveness of our model. Our code is available at https://github.com/ChangkunYe/OpenSetLabelShift.
中文: 本研究提出了一种三阶段估计方法,通过利用分类器在不重新训练的情况下调整源分布训练的模型以适应目标分布,有效解决了开放集标签偏移问题,并在多种实验场景中得到验证。
English: This work develops a three-stage estimation method to address open set label shift by leveraging classifiers to adjust source-trained models for target distributions without retraining, validated through diverse experimental settings.
Authors:Zhiyuan Chen, Keyi Li, Yifan Jia, Le Ye, Yufei Ma
Abstract:
Diffusion transformer (DiT) models have achieved remarkable success in image generation, thanks for their exceptional generative capabilities and scalability. Nonetheless, the iterative nature of diffusion models (DMs) results in high computation complexity, posing challenges for deployment. Although existing cache-based acceleration methods try to utilize the inherent temporal similarity to skip redundant computations of DiT, the lack of correction may induce potential quality degradation. In this paper, we propose increment-calibrated caching, a training-free method for DiT acceleration, where the calibration parameters are generated from the pre-trained model itself with low-rank approximation. To deal with the possible correction failure arising from outlier activations, we introduce channel-aware Singular Value Decomposition (SVD), which further strengthens the calibration effect. Experimental results show that our method always achieve better performance than existing naive caching methods with a similar computation resource budget. When compared with 35-step DDIM, our method eliminates more than 45% computation and improves IS by 12 at the cost of less than 0.06 FID increase. Code is available at https://github.com/ccccczzy/icc.
中文: 本文提出了一种无需训练的扩散变换器加速方法——增量校准缓存,通过低秩近似和通道感知SVD在减少45%以上计算量的同时保持图像生成质量。
English: This paper introduces a training-free acceleration method for diffusion transformers called increment-calibrated caching, which uses low-rank approximation and channel-aware SVD to reduce computations by over 45% while maintaining image quality.
Authors:Haojie Duanmu, Xiuhong Li, Zhihang Yuan, Size Zheng, Jiangfei Duan, Xingcheng Zhang, Dahua Lin
Abstract:
Mixture-of-Experts (MoE) models face deployment challenges due to their large parameter counts and computational demands. We explore quantization for MoE models and highlight two key insights: 1) linear blocks exhibit varying quantization sensitivity, and 2) divergent expert activation frequencies create heterogeneous computational characteristics. Based on these observations, we introduce MxMoE, a mixed-precision optimization framework for MoE models that considers both algorithmic and system perspectives. MxMoE navigates the design space defined by parameter sensitivity, expert activation dynamics, and hardware resources to derive efficient mixed-precision configurations. Additionally, MxMoE automatically generates optimized mixed-precision GroupGEMM kernels, enabling parallel execution of GEMMs with different precisions. Evaluations show that MxMoE outperforms existing methods, achieving 2.4 lower Wikitext-2 perplexity than GPTQ at 2.25-bit and delivering up to 3.4x speedup over full precision, as well as up to 29.4% speedup over uniform quantization at equivalent accuracy with 5-bit weight-activation quantization. Our code is available at https://github.com/cat538/MxMoE.
Chinese: MxMoE是一种混合精度优化框架,通过综合考虑算法敏感性和系统动态,有效解决了专家混合模型的部署难题,并在性能与效率上超越了现有方法。
English: MxMoE is a mixed-precision optimization framework that addresses deployment challenges in Mixture-of-Experts models by considering algorithmic sensitivity and system dynamics, achieving superior performance and efficiency over existing methods.
Authors:Amin Ghafourian, Andrew Lee, Dechen Gao, Tyler Beer, Kin Yen, Iman Soltani
Abstract:
Automation can play a prominent role in improving efficiency, accuracy, and scalability in infrastructure surveying and assessing construction and compliance standards. This paper presents a framework for automation of geometric measurements and compliance assessment using point cloud data. The proposed approach integrates deep learning-based detection and segmentation, in conjunction with geometric and signal processing techniques, to automate surveying tasks. As a proof of concept, we apply this framework to automatically evaluate the compliance of curb ramps with the Americans with Disabilities Act (ADA), demonstrating the utility of point cloud data in survey automation. The method leverages a newly collected, large annotated dataset of curb ramps, made publicly available as part of this work, to facilitate robust model training and evaluation. Experimental results, including comparison with manual field measurements of several ramps, validate the accuracy and reliability of the proposed method, highlighting its potential to significantly reduce manual effort and improve consistency in infrastructure assessment. Beyond ADA compliance, the proposed framework lays the groundwork for broader applications in infrastructure surveying and automated construction evaluation, promoting wider adoption of point cloud data in these domains. The annotated database, manual ramp survey data, and developed algorithms are publicly available on the project's GitHub page: https://github.com/Soltanilara/SurveyAutomation.
中文: 本文提出了一种利用点云数据和深度学习的自动化框架,通过评估ADA路缘坡道合规性展示了其在基础设施检测中的高效性,显著减少了人工工作量并提高了评估一致性。
English: This paper introduces an automated framework using point cloud data and deep learning to efficiently assess infrastructure compliance, demonstrated through ADA curb ramp evaluations, which reduces manual effort and enhances assessment consistency.
Authors:Guilherme Vieira Neto, Marcos Eduardo Valle
Abstract:
EfficientNet models are convolutional neural networks optimized for parameter allocation by jointly balancing network width, depth, and resolution. Renowned for their exceptional accuracy, these models have become a standard for image classification tasks across diverse computer vision benchmarks. While traditional neural networks learn correlations between feature channels during training, vector-valued neural networks inherently treat multidimensional data as coherent entities, taking for granted the inter-channel relationships. This paper introduces vector-valued EfficientNets (V-EfficientNets), a novel extension of EfficientNet designed to process arbitrary vector-valued data. The proposed models are evaluated on a medical image classification task, achieving an average accuracy of 99.46% on the ALL-IDB2 dataset for detecting acute lymphoblastic leukemia. V-EfficientNets demonstrate remarkable efficiency, significantly reducing parameters while outperforming state-of-the-art models, including the original EfficientNet. The source code is available at https://github.com/mevalle/v-nets.
中文摘要:V-EfficientNets将EfficientNet的参数优化扩展至向量值数据,在白血病检测中以更少参数实现99.46%准确率,超越现有最优模型。
English Summary: V-EfficientNets extend EfficientNet's parameter optimization to vector-valued data, achieving 99.46% accuracy in leukemia detection with fewer parameters than state-of-the-art models.
Authors:Zhongweiyang Xu, Xulin Fan, Zhong-Qiu Wang, Xilin Jiang, Romit Roy Choudhury
Abstract:
Blind Speech Separation (BSS) aims to separate multiple speech sources from audio mixtures recorded by a microphone array. The problem is challenging because it is a blind inverse problem, i.e., the microphone array geometry, the room impulse response (RIR), and the speech sources, are all unknown. We propose ArrayDPS to solve the BSS problem in an unsupervised, array-agnostic, and generative manner. The core idea builds on diffusion posterior sampling (DPS), but unlike DPS where the likelihood is tractable, ArrayDPS must approximate the likelihood by formulating a separate optimization problem. The solution to the optimization approximates room acoustics and the relative transfer functions between microphones. These approximations, along with the diffusion priors, iterate through the ArrayDPS sampling process and ultimately yield separated voice sources. We only need a simple single-speaker speech diffusion model as a prior along with the mixtures recorded at the microphones; no microphone array information is necessary. Evaluation results show that ArrayDPS outperforms all baseline unsupervised methods while being comparable to supervised methods in terms of SDR. Audio demos are provided at: https://arraydps.github.io/ArrayDPSDemo/.
Chinese: ArrayDPS是一种无监督、阵列无关的生成方法,通过扩散后验采样近似房间声学来解决盲语音分离问题,在分离质量上超越了所有无监督基准方法,并可与监督方法相媲美。
English: ArrayDPS is an unsupervised, array-agnostic generative method that solves blind speech separation by approximating room acoustics through diffusion posterior sampling, outperforming unsupervised baselines and matching supervised methods in separation quality.
Authors:Tien Dang, Truong-Son Hy
Abstract:
Molecular interactions often involve high-order relationships that cannot be fully captured by traditional graph-based models limited to pairwise connections. Hypergraphs naturally extend graphs by enabling multi-way interactions, making them well-suited for modeling complex molecular systems. In this work, we introduce EquiHGNN, an Equivariant HyperGraph Neural Network framework that integrates symmetry-aware representations to improve molecular modeling. By enforcing the equivariance under relevant transformation groups, our approach preserves geometric and topological properties, leading to more robust and physically meaningful representations. We examine a range of equivariant architectures and demonstrate that integrating symmetry constraints leads to notable performance gains on large-scale molecular datasets. Experiments on both small and large molecules show that high-order interactions offer limited benefits for small molecules but consistently outperform 2D graphs on larger ones. Adding geometric features to these high-order structures further improves the performance, emphasizing the value of spatial information in molecular learning. Our source code is available at https://github.com/HySonLab/EquiHGNN/
中文摘要:EquiHGNN是一种等变超图神经网络,通过引入对称性约束和高阶相互作用来增强分子建模,在结合几何特征后对较大分子的处理表现尤为突出。
English Summary: EquiHGNN is an equivariant hypergraph neural network that leverages symmetry constraints and high-order interactions to enhance molecular modeling, showing significant improvements especially for larger molecules when geometric features are incorporated.
Authors:Yanbo Wang, Xiyuan Wang, Quan Gan, Minjie Wang, Qibin Yang, David Wipf, Muhan Zhang
Abstract:
We introduce Griffin, the first foundation model attemptation designed specifically for Relational Databases (RDBs). Unlike previous smaller models focused on single RDB tasks, Griffin unifies the data encoder and task decoder to handle diverse tasks. Additionally, we enhance the architecture by incorporating a cross-attention module and a novel aggregator. Griffin utilizes pretraining on both single-table and RDB datasets, employing advanced encoders for categorical, numerical, and metadata features, along with innovative components such as cross-attention modules and enhanced message-passing neural networks (MPNNs) to capture the complexities of relational data. Evaluated on large-scale, heterogeneous, and temporal graphs extracted from RDBs across various domains (spanning over 150 million nodes), Griffin demonstrates superior or comparable performance to individually trained models, excels in low-data scenarios, and shows strong transferability with similarity and diversity in pretraining across new datasets and tasks, highlighting its potential as a universally applicable foundation model for RDBs. Code available at https://github.com/yanxwb/Griffin.
Chinese: Griffin是首个专为关系数据库设计的基础模型,通过统一数据编码和任务解码机制及增强架构,在多样化任务中展现出卓越性能,并在跨数据集迁移学习中表现出强大的适应性。
English: Griffin is the first foundation model for relational databases, unifying data encoding and task decoding to handle diverse tasks with enhanced architecture and pretraining, demonstrating superior performance and strong transferability across various datasets.
Authors:Md Kamrujjaman Mobin, Md Saiful Islam, Sadik Al Barid, Md Masum
Abstract:
Electrocardiogram (ECG) classification is crucial for automated cardiac disease diagnosis, yet existing methods often struggle to capture local morphological details and long-range temporal dependencies simultaneously. To address these challenges, we propose Cardioformer, a novel multi-granularity hybrid model that integrates cross-channel patching, hierarchical residual learning, and a two-stage self-attention mechanism. Cardioformer first encodes multi-scale token embeddings to capture fine-grained local features and global contextual information and then selectively fuses these representations through intra- and inter-granularity self-attention. Extensive evaluations on three benchmark ECG datasets under subject-independent settings demonstrate that model consistently outperforms four state-of-the-art baselines. Our Cardioformer model achieves the AUROC of 96.34$\pm$0.11, 89.99$\pm$0.12, and 95.59$\pm$1.66 in MIMIC-IV, PTB-XL and PTB dataset respectively outperforming PatchTST, Reformer, Transformer, and Medformer models. It also demonstrates strong cross-dataset generalization, achieving 49.18% AUROC on PTB and 68.41% on PTB-XL when trained on MIMIC-IV. These findings underscore the potential of Cardioformer to advance automated ECG analysis, paving the way for more accurate and robust cardiovascular disease diagnosis. We release the source code at https://github.com/KMobin555/Cardioformer.
Chinese: Cardioformer是一种创新的混合模型,能有效捕捉心电图信号的局部形态特征和长程时间依赖性,在多个数据集上持续超越现有最优基准模型,并展现出强大的跨数据集泛化能力。
English: Cardioformer is a novel hybrid model that effectively captures both local morphological details and long-range temporal dependencies in ECG signals, consistently outperforming state-of-the-art baselines across multiple datasets and demonstrating strong cross-dataset generalization capabilities.
Authors:Kai Liu, Qian Zheng, Kaiwen Tao, Zhiteng Li, Haotong Qin, Wenbo Li, Yong Guo, Xianglong Liu, Linghe Kong, Guihai Chen, Yulun Zhang, Xiaokang Yang
Abstract:
With unprecedented rapid development, deep neural networks (DNNs) have deeply influenced almost all fields. However, their heavy computation costs and model sizes are usually unacceptable in real-world deployment. Model quantization, an effective weight-lighting technique, has become an indispensable procedure in the whole deployment pipeline. The essence of quantization acceleration is the conversion from continuous floating-point numbers to discrete integer ones, which significantly speeds up the memory I/O and calculation, i.e., addition and multiplication. However, performance degradation also comes with the conversion because of the loss of precision. Therefore, it has become increasingly popular and critical to investigate how to perform the conversion and how to compensate for the information loss. This article surveys the recent five-year progress towards low-bit quantization on DNNs. We discuss and compare the state-of-the-art quantization methods and classify them into 8 main categories and 24 sub-categories according to their core techniques. Furthermore, we shed light on the potential research opportunities in the field of model quantization. A curated list of model quantization is provided at https://github.com/Kai-Liu001/Awesome-Model-Quantization.
中文: 深度神经网络因高计算成本和大模型尺寸面临部署挑战,模型量化通过将浮点数转换为整数来加速性能,成为关键技术,但需在速度提升与精度损失间取得平衡。
English: Deep neural networks face deployment challenges due to high computational costs and large model sizes, making model quantization a crucial technique that accelerates performance by converting floating-point numbers to integers, though it requires balancing speed gains with precision loss.
Authors:Hanxun Huang, Sarah Erfani, Yige Li, Xingjun Ma, James Bailey
Abstract:
As Contrastive Language-Image Pre-training (CLIP) models are increasingly adopted for diverse downstream tasks and integrated into large vision-language models (VLMs), their susceptibility to adversarial perturbations has emerged as a critical concern. In this work, we introduce \textbf{X-Transfer}, a novel attack method that exposes a universal adversarial vulnerability in CLIP. X-Transfer generates a Universal Adversarial Perturbation (UAP) capable of deceiving various CLIP encoders and downstream VLMs across different samples, tasks, and domains. We refer to this property as \textbf{super transferability}--a single perturbation achieving cross-data, cross-domain, cross-model, and cross-task adversarial transferability simultaneously. This is achieved through \textbf{surrogate scaling}, a key innovation of our approach. Unlike existing methods that rely on fixed surrogate models, which are computationally intensive to scale, X-Transfer employs an efficient surrogate scaling strategy that dynamically selects a small subset of suitable surrogates from a large search space. Extensive evaluations demonstrate that X-Transfer significantly outperforms previous state-of-the-art UAP methods, establishing a new benchmark for adversarial transferability across CLIP models. The code is publicly available in our \href{https://github.com/HanxunH/XTransferBench}{GitHub repository}.
中文: 本文提出X-Transfer攻击方法,通过动态代理缩放生成具有超级迁移性的通用对抗扰动,能高效地跨任务、跨领域破坏CLIP模型。
English: The paper introduces X-Transfer, a novel attack method that generates a universal adversarial perturbation with super transferability, efficiently compromising CLIP models across various tasks and domains through dynamic surrogate scaling.
Authors:Hongyi Chen, Yunchao Yao, Yufei Ye, Zhixuan Xu, Homanga Bharadhwaj, Jiashun Wang, Shubham Tulsiani, Zackory Erickson, Jeffrey Ichnowski
Abstract:
Functional grasp is essential for enabling dexterous multi-finger robot hands to manipulate objects effectively. However, most prior work either focuses on power grasping, which simply involves holding an object still, or relies on costly teleoperated robot demonstrations to teach robots how to grasp each object functionally. Instead, we propose extracting human grasp information from web images since they depict natural and functional object interactions, thereby bypassing the need for curated demonstrations. We reconstruct human hand-object interaction (HOI) 3D meshes from RGB images, retarget the human hand to multi-finger robot hands, and align the noisy object mesh with its accurate 3D shape. We show that these relatively low-quality HOI data from inexpensive web sources can effectively train a functional grasping model. To further expand the grasp dataset for seen and unseen objects, we use the initially-trained grasping policy with web data in the IsaacGym simulator to generate physically feasible grasps while preserving functionality. We train the grasping model on 10 object categories and evaluate it on 9 unseen objects, including challenging items such as syringes, pens, spray bottles, and tongs, which are underrepresented in existing datasets. The model trained on the web HOI dataset, achieving a 75.8% success rate on seen objects and 61.8% across all objects in simulation, with a 6.7% improvement in success rate and a 1.8x increase in functionality ratings over baselines. Simulator-augmented data further boosts performance from 61.8% to 83.4%. The sim-to-real transfer to the LEAP Hand achieves a 85% success rate. Project website is at: https://web2grasp.github.io/.
Authors:Zinan Liu, Haoran Li, Jingyi Lu, Gaoyuan Ma, Xu Hong, Giovanni Iacca, Arvind Kumar, Shaojun Tang, Lin Wang
Abstract:
Autonomous AI is no longer a hard-to-reach concept, it enables the agents to move beyond executing tasks to independently addressing complex problems, adapting to change while handling the uncertainty of the environment. However, what makes the agents truly autonomous? It is agentic reasoning, that is crucial for foundation models to develop symbolic logic, statistical correlations, or large-scale pattern recognition to process information, draw inferences, and make decisions. However, it remains unclear why and how existing agentic reasoning approaches work, in comparison to biological reasoning, which instead is deeply rooted in neural mechanisms involving hierarchical cognition, multimodal integration, and dynamic interactions. In this work, we propose a novel neuroscience-inspired framework for agentic reasoning. Grounded in three neuroscience-based definitions and supported by mathematical and biological foundations, we propose a unified framework modeling reasoning from perception to action, encompassing four core types, perceptual, dimensional, logical, and interactive, inspired by distinct functional roles observed in the human brain. We apply this framework to systematically classify and analyze existing AI reasoning methods, evaluating their theoretical foundations, computational designs, and practical limitations. We also explore its implications for building more generalizable, cognitively aligned agents in physical and virtual environments. Finally, building on our framework, we outline future directions and propose new neural-inspired reasoning methods, analogous to chain-of-thought prompting. By bridging cognitive neuroscience and AI, this work offers a theoretical foundation and practical roadmap for advancing agentic reasoning in intelligent systems. The associated project can be found at: https://github.com/BioRAILab/Awesome-Neuroscience-Agent-Reasoning .
中文: 自主人工智能通过代理推理实现真正独立,本研究提出受神经科学启发的框架,模拟从感知到行动的推理过程,连接认知神经科学与人工智能,以提升智能系统的能力。
English: Autonomous AI agents achieve true independence through agentic reasoning, which this study advances by proposing a neuroscience-inspired framework that models reasoning from perception to action, bridging cognitive neuroscience and AI to enhance intelligent systems.
Authors:Thomas Sommariva, Simone Calderara, Angelo Porrello
Abstract:
Neural Metamorphosis (NeuMeta) is a recent paradigm for generating neural networks of varying width and depth. Based on Implicit Neural Representation (INR), NeuMeta learns a continuous weight manifold, enabling the direct generation of compressed models, including those with configurations not seen during training. While promising, the original formulation of NeuMeta proves effective only for the final layers of the undelying model, limiting its broader applicability. In this work, we propose a training algorithm that extends the capabilities of NeuMeta to enable full-network metamorphosis with minimal accuracy degradation. Our approach follows a structured recipe comprising block-wise incremental training, INR initialization, and strategies for replacing batch normalization. The resulting metamorphic networks maintain competitive accuracy across a wide range of compression ratios, offering a scalable solution for adaptable and efficient deployment of deep models. The code is available at: https://github.com/TSommariva/HTTY_NeuMeta.
中文: 本研究提出了一种训练算法,扩展了神经变形能力,实现全网络变形且精度损失最小,从而能够跨多种配置可扩展地生成压缩模型。
English: This work introduces a training algorithm that extends Neural Metamorphosis to enable full-network metamorphosis with minimal accuracy loss, allowing scalable generation of compressed models across various configurations.
Authors:Yuhui Xu, Hanze Dong, Lei Wang, Doyen Sahoo, Junnan Li, Caiming Xiong
Abstract:
Large reasoning models (LRMs) have achieved remarkable progress on complex tasks by generating extended chains of thought (CoT). However, their uncontrolled output lengths pose significant challenges for real-world deployment, where inference-time budgets on tokens, latency, or compute are strictly constrained. We propose Elastic Reasoning, a novel framework for scalable chain of thoughts that explicitly separates reasoning into two phases--thinking and solution--with independently allocated budgets. At test time, Elastic Reasoning prioritizes the completeness of solution segments, significantly improving reliability under tight resource constraints. To train models that are robust to truncated thinking, we introduce a lightweight budget-constrained rollout strategy, integrated into GRPO, which teaches the model to reason adaptively when the thinking process is cut short and generalizes effectively to unseen budget constraints without additional training. Empirical results on mathematical (AIME, MATH500) and programming (LiveCodeBench, Codeforces) benchmarks demonstrate that Elastic Reasoning performs robustly under strict budget constraints, while incurring significantly lower training cost than baseline methods. Remarkably, our approach also produces more concise and efficient reasoning even in unconstrained settings. Our code has been made available at https://github.com/SalesforceAIResearch/Elastic-Reasoning.
中文摘要:弹性推理是一种新颖框架,将推理过程划分为思维和解答两个独立预算阶段,使大型推理模型能够在严格计算限制下保持稳健性能,同时提高效率和简洁性。
English Summary: Elastic Reasoning is a novel framework that divides reasoning into thinking and solution phases with independent budgets, enabling large reasoning models to perform robustly under strict computational constraints while improving efficiency and conciseness.
Authors:Cong Hua, Qianqian Xu, Zhiyong Yang, Zitai Wang, Shilong Bao, Qingming Huang
Abstract:
Prompt tuning adapts Vision-Language Models like CLIP to open-world tasks with minimal training costs. In this direction, one typical paradigm evaluates model performance separately on known classes (i.e., base domain) and unseen classes (i.e., new domain). However, real-world scenarios require models to handle inputs without prior domain knowledge. This practical challenge has spurred the development of open-world prompt tuning, which demands a unified evaluation of two stages: 1) detecting whether an input belongs to the base or new domain (P1), and 2) classifying the sample into its correct class (P2). What's more, as domain distributions are generally unknown, a proper metric should be insensitive to varying base/new sample ratios (P3). However, we find that current metrics, including HM, overall accuracy, and AUROC, fail to satisfy these three properties simultaneously. To bridge this gap, we propose OpenworldAUC, a unified metric that jointly assesses detection and classification through pairwise instance comparisons. To optimize OpenworldAUC effectively, we introduce Gated Mixture-of-Prompts (GMoP), which employs domain-specific prompts and a gating mechanism to dynamically balance detection and classification. Theoretical guarantees ensure generalization of GMoP under practical conditions. Experiments on 15 benchmarks in open-world scenarios show GMoP achieves SOTA performance on OpenworldAUC and other metrics. We release the code at https://github.com/huacong/OpenworldAUC
中文: 本文提出了OpenworldAUC这一统一评估开放世界提示调优的指标,同时引入GMoP门控提示方法,在多个基准测试中实现了最优性能。
English: This paper introduces OpenworldAUC, a unified metric for evaluating open-world prompt tuning that jointly assesses domain detection and classification, and proposes GMoP, a gated prompt method that achieves state-of-the-art performance across benchmarks.
Authors:Shashank Agnihotri, Amaan Ansari, Annika Dackermann, Fabian Rösch, Margret Keuper
Abstract:
Deep learning (DL) has surpassed human performance on standard benchmarks, driving its widespread adoption in computer vision tasks. One such task is disparity estimation, estimating the disparity between matching pixels in stereo image pairs, which is crucial for safety-critical applications like medical surgeries and autonomous navigation. However, DL-based disparity estimation methods are highly susceptible to distribution shifts and adversarial attacks, raising concerns about their reliability and generalization. Despite these concerns, a standardized benchmark for evaluating the robustness of disparity estimation methods remains absent, hindering progress in the field.
To address this gap, we introduce DispBench, a comprehensive benchmarking tool for systematically assessing the reliability of disparity estimation methods. DispBench evaluates robustness against synthetic image corruptions such as adversarial attacks and out-of-distribution shifts caused by 2D Common Corruptions across multiple datasets and diverse corruption scenarios. We conduct the most extensive performance and robustness analysis of disparity estimation methods to date, uncovering key correlations between accuracy, reliability, and generalization. Open-source code for DispBench: https://github.com/shashankskagnihotri/benchmarking_robustness/tree/disparity_estimation/final/disparity_estimation
中文: 深度学习在立体视觉的视差估计中表现出色,但易受分布偏移和对抗攻击影响,为此推出了DispBench这一综合基准工具,用于评估多种数据集和干扰场景下的方法鲁棒性。
English: Deep learning excels in disparity estimation for stereo vision but faces reliability issues from distribution shifts and adversarial attacks, prompting the creation of DispBench, a comprehensive tool to benchmark robustness across various corruptions and datasets.
Authors:Xinyang Lu, Xinyuan Niu, Gregory Kang Ruey Lau, Bui Thi Cam Nhung, Rachael Hwee Ling Sim, Fanyu Wen, Chuan-Sheng Foo, See-Kiong Ng, Bryan Kian Hsiang Low
Abstract:
Large language model (LLM) unlearning is critical in real-world applications where it is necessary to efficiently remove the influence of private, copyrighted, or harmful data from some users. However, existing utility-centric unlearning metrics (based on model utility) may fail to accurately evaluate the extent of unlearning in realistic settings such as when (a) the forget and retain set have semantically similar content, (b) retraining the model from scratch on the retain set is impractical, and/or (c) the model owner can improve the unlearning metric without directly performing unlearning on the LLM. This paper presents the first data-centric unlearning metric for LLMs called WaterDrum that exploits robust text watermarking for overcoming these limitations. We also introduce new benchmark datasets for LLM unlearning that contain varying levels of similar data points and can be used to rigorously evaluate unlearning algorithms using WaterDrum. Our code is available at https://github.com/lululu008/WaterDrum and our new benchmark datasets are released at https://huggingface.co/datasets/Glow-AI/WaterDrum-Ax.
Chinese: 本文提出了WaterDrum,一种基于数据的大语言模型遗忘度量方法,通过稳健文本水印技术克服现有基于效用的评估局限,并发布了包含不同相似度数据的新基准数据集用于严格评估。
English: This paper introduces WaterDrum, a data-centric unlearning metric for large language models that utilizes robust text watermarking to address limitations in existing utility-based evaluation methods, alongside new benchmark datasets for rigorous assessment.
Authors:Jaehyun Jeon, Min Soo Kim, Jang Han Yoon, Sumin Shim, Yejin Choi, Hanbin Kim, Youngjae Yu
Abstract:
User interface (UI) design goes beyond visuals, guiding user behavior and overall user experience (UX). Strategically crafted interfaces, for example, can boost sign-ups and drive business sales, underscoring the shift toward UI/UX as a unified design concept. While recent studies have explored UI quality evaluation using Multimodal Large Language Models (MLLMs), they largely focus on surface-level features, overlooking behavior-oriented aspects. To fill this gap, we introduce WiserUI-Bench, a novel benchmark for assessing models' multimodal understanding of UI/UX design. It includes 300 diverse real-world UI image pairs, each consisting of two design variants A/B-tested at scale by actual companies, where one was empirically validated to steer more user actions than the other. Each pair is accompanied one or more of 684 expert-curated rationales that capture key factors behind each winning design's effectiveness, spanning diverse cognitive dimensions of UX. Our benchmark supports two core tasks: (1) selecting the more effective UI/UX design by predicting the A/B test verified winner and (2) assessing how well a model, given the winner, can explain its effectiveness in alignment with expert reasoning. Experiments across several MLLMs show that current models exhibit limited nuanced reasoning about UI/UX design and its behavioral impact. We believe our work will foster research in UI/UX understanding and enable broader applications such as behavior-aware interface optimization.
中文摘要:本文提出WiserUI-Bench这一新型基准,通过300组经A/B测试验证的真实界面设计对和专家解析,填补了当前界面设计评估中行为导向分析的空白,实验表明现有模型对界面用户体验设计的细微推理能力仍显不足。
English Summary: This paper introduces WiserUI-Bench, a novel benchmark addressing the gap in evaluating UI/UX design by focusing on behavior-oriented aspects through 300 real-world UI image pairs with expert rationales, revealing current models' limited nuanced reasoning about design effectiveness.
Authors:Jiaqi Zheng, Qing Ling, Yerong Feng
Abstract:
Although deep learning models have demonstrated remarkable potential in weather prediction, most of them overlook either the \textbf{physics} of the underlying weather evolution or the \textbf{topology} of the Earth's surface. In light of these disadvantages, we develop PASSAT, a novel Physics-ASSisted And Topology-informed deep learning model for weather prediction. PASSAT attributes the weather evolution to two key factors: (i) the advection process that can be characterized by the advection equation and the Navier-Stokes equation; (ii) the Earth-atmosphere interaction that is difficult to both model and calculate. PASSAT also takes the topology of the Earth's surface into consideration, other than simply treating it as a plane. With these considerations, PASSAT numerically solves the advection equation and the Navier-Stokes equation on the spherical manifold, utilizes a spherical graph neural network to capture the Earth-atmosphere interaction, and generates the initial velocity fields that are critical to solving the advection equation from the same spherical graph neural network. In the $5.625^\circ$-resolution ERA5 data set, PASSAT outperforms both the state-of-the-art deep learning-based weather prediction models and the operational numerical weather prediction model IFS T42. Code and checkpoint are available at https://github.com/Yumenomae/PASSAT_5p625.
Chinese: PASSAT是一种创新的深度学习模型,通过结合物理方程和地球表面拓扑结构进行天气预报,在球面流形上求解平流和纳维-斯托克斯方程,并利用球面图神经网络捕捉地球-大气相互作用,从而超越了现有模型的性能。
English: PASSAT is a novel deep learning model that integrates physics equations and Earth's surface topology for weather prediction, outperforming existing models by solving advection and Navier-Stokes equations on a spherical manifold and capturing Earth-atmosphere interactions through a spherical graph neural network.
Authors:I-Chun Arthur Liu, Jason Chen, Gaurav Sukhatme, Daniel Seita
Abstract:
Learning bimanual manipulation is challenging due to its high dimensionality and tight coordination required between two arms. Eye-in-hand imitation learning, which uses wrist-mounted cameras, simplifies perception by focusing on task-relevant views. However, collecting diverse demonstrations remains costly, motivating the need for scalable data augmentation. While prior work has explored visual augmentation in single-arm settings, extending these approaches to bimanual manipulation requires generating viewpoint-consistent observations across both arms and producing corresponding action labels that are both valid and feasible. In this work, we propose Diffusion for COordinated Dual-arm Data Augmentation (D-CODA), a method for offline data augmentation tailored to eye-in-hand bimanual imitation learning that trains a diffusion model to synthesize novel, viewpoint-consistent wrist-camera images for both arms while simultaneously generating joint-space action labels. It employs constrained optimization to ensure that augmented states involving gripper-to-object contacts adhere to constraints suitable for bimanual coordination. We evaluate D-CODA on 5 simulated and 3 real-world tasks. Our results across 2250 simulation trials and 300 real-world trials demonstrate that it outperforms baselines and ablations, showing its potential for scalable data augmentation in eye-in-hand bimanual manipulation. Our project website is at: https://dcodaaug.github.io/D-CODA/.
Authors:Abdulaziz Almuzairee, Rohan Patil, Dwait Bhatt, Henrik I. Christensen
Abstract:
Vision is well-known for its use in manipulation, especially using visual servoing. Due to the 3D nature of the world, using multiple camera views and merging them creates better representations for Q-learning and in turn, trains more sample efficient policies. Nevertheless, these multi-view policies are sensitive to failing cameras and can be burdensome to deploy. To mitigate these issues, we introduce a Merge And Disentanglement (MAD) algorithm that efficiently merges views to increase sample efficiency while simultaneously disentangling views by augmenting multi-view feature inputs with single-view features. This produces robust policies and allows lightweight deployment. We demonstrate the efficiency and robustness of our approach using Meta-World and ManiSkill3. For project website and code, see https://aalmuzairee.github.io/mad
Authors:Yuning Du, Jingshuai Liu, Rohan Dharmakumar, Sotirios A. Tsaftaris
Abstract:
Despite the superior diagnostic capability of Magnetic Resonance Imaging (MRI), its use as a Point-of-Care (PoC) device remains limited by high cost and complexity. To enable such a future by reducing the magnetic field strength, one key approach will be to improve sampling strategies. Previous work has shown that it is possible to make diagnostic decisions directly from k-space with fewer samples. Such work shows that single diagnostic decisions can be made, but if we aspire to see MRI as a true PoC, multiple and sequential decisions are necessary while minimizing the number of samples acquired. We present a novel multi-objective reinforcement learning framework enabling comprehensive, sequential, diagnostic evaluation from undersampled k-space data. Our approach during inference actively adapts to sequential decisions to optimally sample. To achieve this, we introduce a training methodology that identifies the samples that contribute the best to each diagnostic objective using a step-wise weighting reward function. We evaluate our approach in two sequential knee pathology assessment tasks: ACL sprain detection and cartilage thickness loss assessment. Our framework achieves diagnostic performance competitive with various policy-based benchmarks on disease detection, severity quantification, and overall sequential diagnosis, while substantially saving k-space samples. Our approach paves the way for the future of MRI as a comprehensive and affordable PoC device. Our code is publicly available at https://github.com/vios-s/MRI_Sequential_Active_Sampling
中文摘要:本研究提出了一种多目标强化学习框架,通过自适应优化采样策略,能够从欠采样的MRI k空间数据中实现连续诊断评估,在保持膝关节病变诊断准确性的同时大幅减少采样需求。
English Summary: This study introduces a multi-objective reinforcement learning framework that enables sequential diagnostic evaluations from undersampled MRI k-space data, significantly reducing sampling requirements while maintaining competitive diagnostic accuracy for knee pathology assessments.
Authors:Kunlun Xu, Xu Zou, Gang Hua, Jiahuan Zhou
Abstract:
Domain Incremental Learning (DIL) aims to learn from non-stationary data streams across domains while retaining and utilizing past knowledge. Although prompt-based methods effectively store multi-domain knowledge in prompt parameters and obtain advanced performance through cross-domain prompt fusion, we reveal an intrinsic limitation: component-wise misalignment between domain-specific prompts leads to conflicting knowledge integration and degraded predictions. This arises from the random positioning of knowledge components within prompts, where irrelevant component fusion introduces interference.To address this, we propose Componential Prompt-Knowledge Alignment (KA-Prompt), a novel prompt-based DIL method that introduces component-aware prompt-knowledge alignment during training, significantly improving both the learning and inference capacity of the model. KA-Prompt operates in two phases: (1) Initial Componential Structure Configuring, where a set of old prompts containing knowledge relevant to the new domain are mined via greedy search, which is then exploited to initialize new prompts to achieve reusable knowledge transfer and establish intrinsic alignment between new and old prompts. (2) Online Alignment Preservation, which dynamically identifies the target old prompts and applies adaptive componential consistency constraints as new prompts evolve. Extensive experiments on DIL benchmarks demonstrate the effectiveness of our KA-Prompt. Our source code is available at https://github.com/zhoujiahuan1991/ICML2025-KA-Prompt
中文摘要:本文针对基于提示的领域增量学习中领域特定提示组件错位导致知识冲突的问题,提出KA-Prompt方法,通过组件感知的提示-知识对齐机制显著提升模型的学习和推理能力。
English Summary: This paper identifies a limitation in prompt-based Domain Incremental Learning where misaligned domain-specific prompts cause conflicting knowledge integration, and proposes KA-Prompt with component-aware alignment to enhance model learning and inference capabilities.
Authors:Guanghui Wang, Zhiyong Yang, Zitai Wang, Shi Wang, Qianqian Xu, Qingming Huang
Abstract:
Knowledge Distillation (KD) transfers knowledge from a large teacher model to a smaller student model by minimizing the divergence between their output distributions, typically using forward Kullback-Leibler divergence (FKLD) or reverse KLD (RKLD). It has become an effective training paradigm due to the broader supervision information provided by the teacher distribution compared to one-hot labels. We identify that the core challenge in KD lies in balancing two mode-concentration effects: the \textbf{\textit{Hardness-Concentration}} effect, which refers to focusing on modes with large errors, and the \textbf{\textit{Confidence-Concentration}} effect, which refers to focusing on modes with high student confidence. Through an analysis of how probabilities are reassigned during gradient updates, we observe that these two effects are entangled in FKLD and RKLD, but in extreme forms. Specifically, both are too weak in FKLD, causing the student to fail to concentrate on the target class. In contrast, both are too strong in RKLD, causing the student to overly emphasize the target class while ignoring the broader distributional information from the teacher. To address this imbalance, we propose ABKD, a generic framework with $α$-$β$-divergence. Our theoretical results show that ABKD offers a smooth interpolation between FKLD and RKLD, achieving an effective trade-off between these effects. Extensive experiments on 17 language/vision datasets with 12 teacher-student settings confirm its efficacy. The code is available at https://github.com/ghwang-s/abkd.
中文: 知识蒸馏的核心挑战在于平衡硬度集中和置信度集中两种效应,ABKD通过α-β散度实现了前向KL散度与反向KL散度之间的平滑插值,在多项实验中验证了其有效性。
English: Knowledge Distillation faces a challenge in balancing the Hardness-Concentration and Confidence-Concentration effects, which ABKD addresses using α-β-divergence to effectively trade off between FKLD and RKLD, as validated by extensive experiments.
Authors:Ren Wang, Pengcheng Zhou
Abstract:
Manifold learning aims to discover and represent low-dimensional structures underlying high-dimensional data while preserving critical topological and geometric properties. Existing methods often fail to capture local details with global topological integrity from noisy data or construct a balanced dimensionality reduction, resulting in distorted or fractured embeddings. We present an AutoEncoder-based method that integrates a manifold reconstruction layer, which uncovers latent manifold structures from noisy point clouds, and further provides regularizations on topological and geometric properties during dimensionality reduction, whereas the two components promote each other during training. Experiments on point cloud datasets demonstrate that our method outperforms baselines like t-SNE, UMAP, and Topological AutoEncoders in discovering manifold structures from noisy data and preserving them through dimensionality reduction, as validated by visualization and quantitative metrics. This work demonstrates the significance of combining manifold reconstruction with manifold learning to achieve reliable representation of the latent manifold, particularly when dealing with noisy real-world data. Code repository: https://github.com/Thanatorika/mrtg.
Chinese: 本文提出了一种基于自动编码器的方法,将流形重建与拓扑和几何正则化相结合,能够有效从噪声数据中发现并保持潜在结构,在可视化和定量指标上均优于现有技术。
English: This paper introduces an AutoEncoder-based method that integrates manifold reconstruction with topological and geometric regularizations to effectively discover and preserve latent structures from noisy data, outperforming existing techniques in both visualization and quantitative metrics.
Authors:Weiwei Ye, Zhuopeng Xu, Ning Gui
Abstract:
Due to the dynamics of underlying physics and external influences, the uncertainty of time series often varies over time. However, existing Denoising Diffusion Probabilistic Models (DDPMs) often fail to capture this non-stationary nature, constrained by their constant variance assumption from the additive noise model (ANM). In this paper, we innovatively utilize the Location-Scale Noise Model (LSNM) to relax the fixed uncertainty assumption of ANM. A diffusion-based probabilistic forecasting framework, termed Non-stationary Diffusion (NsDiff), is designed based on LSNM that is capable of modeling the changing pattern of uncertainty. Specifically, NsDiff combines a denoising diffusion-based conditional generative model with a pre-trained conditional mean and variance estimator, enabling adaptive endpoint distribution modeling. Furthermore, we propose an uncertainty-aware noise schedule, which dynamically adjusts the noise levels to accurately reflect the data uncertainty at each step and integrates the time-varying variances into the diffusion process. Extensive experiments conducted on nine real-world and synthetic datasets demonstrate the superior performance of NsDiff compared to existing approaches. Code is available at https://github.com/wwy155/NsDiff.
中文: 本文提出NsDiff,一种基于扩散的概率预测框架,采用位置尺度噪声模型来捕捉时间序列中的时变不确定性,克服了传统模型固定方差假设的局限,并在多个数据集上展现出优越性能。
English: This paper introduces NsDiff, a diffusion-based probabilistic forecasting framework that employs the Location-Scale Noise Model to capture time-varying uncertainty in time series, overcoming the limitations of traditional models with fixed variance assumptions and demonstrating superior performance across multiple datasets.
Authors:Hongyi Li, Jun Xu, William Ward Armstrong
Abstract:
We introduce the Learning Hyperplane Tree (LHT), a novel oblique decision tree model designed for expressive and interpretable classification. LHT fundamentally distinguishes itself through a non-iterative, statistically-driven approach to constructing splitting hyperplanes. Unlike methods that rely on iterative optimization or heuristics, LHT directly computes the hyperplane parameters, which are derived from feature weights based on the differences in feature expectations between classes within each node. This deterministic mechanism enables a direct and well-defined hyperplane construction process. Predictions leverage a unique piecewise linear membership function within leaf nodes, obtained via local least-squares fitting. We formally analyze the convergence of the LHT splitting process, ensuring that each split yields meaningful, non-empty partitions. Furthermore, we establish that the time complexity for building an LHT up to depth $d$ is $O(mnd)$, demonstrating the practical feasibility of constructing trees with powerful oblique splits using this methodology. The explicit feature weighting at each split provides inherent interpretability. Experimental results on benchmark datasets demonstrate LHT's competitive accuracy, positioning it as a practical, theoretically grounded, and interpretable alternative in the landscape of tree-based models. The implementation of the proposed method is available at https://github.com/Hongyi-Li-sz/LHT_model.
Chinese: 学习超平面树(LHT)是一种新型斜决策树,通过非迭代的统计驱动方法构建分割超平面,凭借明确的特征权重提供内在可解释性,并在基准数据集上展现出有竞争力的准确率。
English: The Learning Hyperplane Tree (LHT) is a novel oblique decision tree that uses a non-iterative, statistically-driven method to construct splitting hyperplanes, offering competitive accuracy and inherent interpretability through explicit feature weighting.
Authors:Xuyang Wang, Siyuan Duan, Qizhi Li, Guiduo Duan, Yuan Sun, Dezhong Peng
Abstract:
Trustworthy multi-view learning has attracted extensive attention because evidence learning can provide reliable uncertainty estimation to enhance the credibility of multi-view predictions. Existing trusted multi-view learning methods implicitly assume that multi-view data is secure. However, in safety-sensitive applications such as autonomous driving and security monitoring, multi-view data often faces threats from adversarial perturbations, thereby deceiving or disrupting multi-view models. This inevitably leads to the adversarial unreliability problem (AUP) in trusted multi-view learning. To overcome this tricky problem, we propose a novel multi-view learning framework, namely Reliable Disentanglement Multi-view Learning (RDML). Specifically, we first propose evidential disentanglement learning to decompose each view into clean and adversarial parts under the guidance of corresponding evidences, which is extracted by a pretrained evidence extractor. Then, we employ the feature recalibration module to mitigate the negative impact of adversarial perturbations and extract potential informative features from them. Finally, to further ignore the irreparable adversarial interferences, a view-level evidential attention mechanism is designed. Extensive experiments on multi-view classification tasks with adversarial attacks show that RDML outperforms the state-of-the-art methods by a relatively large margin. Our code is available at https://github.com/Willy1005/2025-IJCAI-RDML.
中文摘要:本文提出RDML框架,通过证据引导的解耦学习分离多视图数据中的正常特征与对抗扰动,采用特征重校准和视图级证据注意力机制,显著提升了多视图分类任务在对抗攻击下的可靠性。
English Summary: This paper introduces RDML, a reliable multi-view learning framework that addresses adversarial unreliability by disentangling clean and adversarial features using evidence-guided mechanisms to enhance model robustness against attacks.
Authors:Xuechao Wang, Sven Nomm, Junqing Huang, Kadri Medijainen, Aaro Toomela, Michael Ruzhansky
Abstract:
Deep neural networks have shown potential in analyzing digitized hand-drawn signals for early diagnosis of Parkinson's disease. However, the lack of clear interpretability in existing diagnostic methods presents a challenge to clinical trust. In this paper, we propose PointExplainer, an explainable diagnostic strategy to identify hand-drawn regions that drive model diagnosis. Specifically, PointExplainer assigns discrete attribution values to hand-drawn segments, explicitly quantifying their relative contributions to the model's decision. Its key components include: (i) a diagnosis module, which encodes hand-drawn signals into 3D point clouds to represent hand-drawn trajectories, and (ii) an explanation module, which trains an interpretable surrogate model to approximate the local behavior of the black-box diagnostic model. We also introduce consistency measures to further address the issue of faithfulness in explanations. Extensive experiments on two benchmark datasets and a newly constructed dataset show that PointExplainer can provide intuitive explanations with no diagnostic performance degradation. The source code is available at https://github.com/chaoxuewang/PointExplainer.
中文摘要:PointExplainer是一种可解释的诊断策略,通过为手绘片段分配归因值来识别推动帕金森病诊断的关键区域,在保持诊断性能的同时提供直观解释。
English Summary: PointExplainer is an interpretable diagnostic method that identifies key hand-drawn segments for Parkinson's disease detection while maintaining diagnostic accuracy through attribution values and consistency measures.
Authors:Ioannis Nasios
Abstract:
Harmful algal blooms are a growing threat to inland water quality and public health worldwide, creating an urgent need for efficient, accurate, and cost-effective detection methods. This research introduces a high-performing methodology that integrates multiple open-source remote sensing data with advanced artificial intelligence models. Key data sources include Copernicus Sentinel-2 optical imagery, the Copernicus Digital Elevation Model (DEM), and NOAA's High-Resolution Rapid Refresh (HRRR) climate data, all efficiently retrieved using platforms like Google Earth Engine (GEE) and Microsoft Planetary Computer (MPC). The NIR and two SWIR bands from Sentinel-2, the altitude from the elevation model, the temperature and wind from NOAA as well as the longitude and latitude were the most important features. The approach combines two types of machine learning models, tree-based models and a neural network, into an ensemble for classifying algal bloom severity. While the tree models performed strongly on their own, incorporating a neural network added robustness and demonstrated how deep learning models can effectively use diverse remote sensing inputs. The method leverages high-resolution satellite imagery and AI-driven analysis to monitor algal blooms dynamically, and although initially developed for a NASA competition in the U.S., it shows potential for global application. The complete code is available for further adaptation and practical implementation, illustrating the convergence of remote sensing data and AI to address critical environmental challenges (https://github.com/IoannisNasios/HarmfulAlgalBloomDetection).
中文: 本研究提出一种集成树模型与神经网络的AI方法,结合多源遥感数据高效检测有害藻华,虽最初针对美国水域开发,但具备全球应用潜力,为应对水环境挑战提供了可扩展方案。
English: This study presents an AI-driven ensemble method combining tree-based models and neural networks with multi-source remote sensing data to efficiently detect harmful algal blooms, offering a scalable solution initially developed for U.S. waters but with global applicability.
Authors:Lang Feng, Weihao Tan, Zhiyi Lyu, Longtao Zheng, Haiyang Xu, Ming Yan, Fei Huang, Bo An
Abstract:
Online fine-tuning vision-language model (VLM) agents with reinforcement learning (RL) has shown promise for equipping agents with multi-step, goal-oriented capabilities in dynamic environments. However, their open-ended textual action space and non-end-to-end nature of action generation present significant challenges to effective online exploration in RL, e.g., explosion of the exploration space. We propose a novel online fine-tuning method, Counterfactual Soft Reinforcement Learning (CoSo), better suited to the textual output space of VLM agents. Compared to prior methods that assign uniform uncertainty to all tokens, CoSo leverages counterfactual reasoning to dynamically assess the causal influence of individual tokens on post-processed actions. By prioritizing the exploration of action-critical tokens while reducing the impact of semantically redundant or low-impact tokens, CoSo enables a more targeted and efficient online rollout process. We provide theoretical analysis proving CoSo's convergence and policy improvement guarantees, and extensive empirical evaluations supporting CoSo's effectiveness. Our results across a diverse set of agent tasks, including Android device control, card gaming, and embodied AI, highlight its remarkable ability to enhance exploration efficiency and deliver consistent performance gains. The code is available at https://github.com/langfengQ/CoSo.
Chinese: CoSo提出了一种新颖的在线微调方法,通过反事实推理动态评估单个令牌对处理后动作的因果影响,优先探索关键令牌,从而在多种任务中显著提升探索效率和性能表现。
English: CoSo introduces a novel online fine-tuning method for vision-language model agents that uses counterfactual reasoning to dynamically prioritize exploration of action-critical tokens, significantly improving exploration efficiency and performance across diverse tasks.
Authors:Md Fahim Anjum
Abstract:
Large Language Models (LLM) with reasoning capabilities offer a promising path for improving candidate evaluation in planning frameworks, but their relative performance against traditional non-reasoning models remains largely underexplored. In this study, we benchmark a distilled 1.5B parameter reasoning model (DeepSeek-R1) against several state-of-the-art non-reasoning LLMs within a generator-discriminator LLM planning framework for the text-to-SQL task. For this, we introduce a novel method for extracting soft scores from the chain-of-thought (CoT) outputs from reasoning that enables fine-grained ranking of candidates. Our central hypothesis is that reasoning models are more effective discriminators than non-reasoning LLMs. Our results show that distilled DeepSeek-R1-1.5B achieves up to $87\%$ higher F1 and $3.7\%$ better discrimination accuracy than CodeLlama-7B, as well as $3.7\%$ higher execution accuracy than CodeLlama-13B, despite having significantly fewer parameters. Furthermore, we find that there is a limit to the logical capabilities of reasoning models, and only providing more context or allowing more compute budget for reasoning is not enough to improve their discrimination performance. Finally, we demonstrate that, unlike non-reasoning LLMs, reasoning models find generation more challenging than discrimination and may underperform as generators compared to smaller non-reasoning LLMs. Our work highlights the potential of reasoning models as discriminators in agentic frameworks, far outweighing their capabilities as generators, offering insights into their optimal role within LLM planning infrastructures.
中文: 推理模型在规划框架中作为判别器表现卓越,其文本转SQL能力远超参数更多的非推理模型,但在生成任务中存在局限,揭示了其在智能体框架中的最优角色定位。
English: Reasoning models like DeepSeek-R1 demonstrate superior discrimination capabilities in LLM planning frameworks, significantly outperforming larger non-reasoning models in text-to-SQL tasks despite having fewer parameters, while revealing limitations in generation tasks.
Authors:Eleftherios Tzanis, Michail E. Klontzas
Abstract:
Agentic systems built on large language models (LLMs) offer promising capabilities for automating complex workflows in healthcare AI. We introduce mAIstro, an open-source, autonomous multi-agentic framework for end-to-end development and deployment of medical AI models. The system orchestrates exploratory data analysis, radiomic feature extraction, image segmentation, classification, and regression through a natural language interface, requiring no coding from the user. Built on a modular architecture, mAIstro supports both open- and closed-source LLMs, and was evaluated using a large and diverse set of prompts across 16 open-source datasets, covering a wide range of imaging modalities, anatomical regions, and data types. The agents successfully executed all tasks, producing interpretable outputs and validated models. This work presents the first agentic framework capable of unifying data analysis, AI model development, and inference across varied healthcare applications, offering a reproducible and extensible foundation for clinical and research AI integration. The code is available at: https://github.com/eltzanis/mAIstro
中文: mAIstro是一个开源的自主多智能体框架,通过自然语言界面实现医疗AI模型的端到端开发和部署,无需编码即可统一跨多种医疗应用的数据分析、模型构建和推理过程。
English: mAIstro is an open-source, autonomous multi-agent framework that enables end-to-end development and deployment of medical AI models through a natural language interface, eliminating the need for coding and unifying data analysis, model creation, and inference across diverse healthcare applications.
Authors:Asad Aali, Adney Cardoza, Melissa Capo
Abstract:
Efficient inference of LLMs remains a crucial challenge, with two main phases: a compute-intensive prompt computation and a memory-intensive token generation. Despite existing batching and scheduling techniques, token generation phases fail to fully utilize compute resources, especially when compared to prompt computation phases. To address these challenges, we propose Splitwiser, a methodology that splits the two phases of an LLM inference request onto the same GPU, thereby reducing overhead and improving memory access and cache utilization. By eliminating the need to transfer data across devices, Splitwiser aims to minimize network-related overheads. In this report, we describe the basic structure of our proposed pipeline while sharing preliminary results and analysis. We implement our proposed multiprocessing design on two widely-used and independent LLM architectures: Huggingface and vLLM. We open-source our code for the respective implementations: 1) Huggingface (https://github.com/asad-aali/splitwiser), and 2) vLLM (https://github.com/adney11/vllm-sysml).
中文摘要:Splitwiser是一种创新方法,通过将LLM推理的提示计算和令牌生成两阶段整合到同一GPU上,有效降低开销并提升资源利用率,已针对Huggingface和vLLM架构开源实现。
English Summary: Splitwiser is a novel method that enhances LLM inference efficiency by co-locating prompt computation and token generation phases on a single GPU, reducing overhead and improving resource utilization, with implementations available for Huggingface and vLLM architectures.
Authors:Sharvi Endait, Ruturaj Ghatage, Aditya Kulkarni, Rajlaxmi Patil, Raviraj Joshi
Abstract:
The rapid progress in question-answering (QA) systems has predominantly benefited high-resource languages, leaving Indic languages largely underrepresented despite their vast native speaker base. In this paper, we present IndicSQuAD, a comprehensive multi-lingual extractive QA dataset covering nine major Indic languages, systematically derived from the SQuAD dataset. Building on previous work with MahaSQuAD for Marathi, our approach adapts and extends translation techniques to maintain high linguistic fidelity and accurate answer-span alignment across diverse languages. IndicSQuAD comprises extensive training, validation, and test sets for each language, providing a robust foundation for model development. We evaluate baseline performances using language-specific monolingual BERT models and the multilingual MuRIL-BERT. The results indicate some challenges inherent in low-resource settings. Moreover, our experiments suggest potential directions for future work, including expanding to additional languages, developing domain-specific datasets, and incorporating multimodal data. The dataset and models are publicly shared at https://github.com/l3cube-pune/indic-nlp
中文: 本文介绍了IndicSQuAD,一个基于SQuAD构建的涵盖九种印度语言的多语言问答数据集,旨在解决这些语言在问答系统中的代表性不足问题,并通过基线模型评估揭示了当前面临的挑战和未来研究方向。
English: The paper introduces IndicSQuAD, a multilingual QA dataset for nine Indic languages derived from SQuAD, addressing the underrepresentation of these languages in QA systems and evaluating baseline models to highlight challenges and future research directions.
Authors:Bernardo Marenco, Paola Bermolen, Marcelo Fiori, Federico Larroca, Gonzalo Mateos
Abstract:
Modeling of intricate relational patterns has become a cornerstone of contemporary statistical research and related data science fields. Networks, represented as graphs, offer a natural framework for this analysis. This paper extends the Random Dot Product Graph (RDPG) model to accommodate weighted graphs, markedly broadening the model's scope to scenarios where edges exhibit heterogeneous weight distributions. We propose a nonparametric weighted (W)RDPG model that assigns a sequence of latent positions to each node. Inner products of these nodal vectors specify the moments of their incident edge weights' distribution via moment-generating functions. In this way, and unlike prior art, the WRDPG can discriminate between weight distributions that share the same mean but differ in other higher-order moments. We derive statistical guarantees for an estimator of the nodal's latent positions adapted from the workhorse adjacency spectral embedding, establishing its consistency and asymptotic normality. We also contribute a generative framework that enables sampling of graphs that adhere to a (prescribed or data-fitted) WRDPG, facilitating, e.g., the analysis and testing of observed graph metrics using judicious reference distributions. The paper is organized to formalize the model's definition, the estimation (or nodal embedding) process and its guarantees, as well as the methodologies for generating weighted graphs, all complemented by illustrative and reproducible examples showcasing the WRDPG's effectiveness in various network analytic applications.
中文: 本文提出了一种非参数加权随机点积图(WRDPG)模型,通过为节点分配潜位置扩展了RDPG以处理加权图,能够区分均值相同但高阶矩不同的权重分布,并为估计量提供了统计保证及生成图结构的框架。
English: This paper introduces a nonparametric weighted Random Dot Product Graph (WRDPG) model that extends the RDPG to handle weighted graphs by assigning latent positions to nodes, enabling discrimination between weight distributions with identical means but differing higher-order moments, and provides statistical guarantees for the estimator and a generative framework for sampling graphs.
Authors:Kirill Lukyanov, Mikhail Drobyshevskiy, Georgii Sazonov, Mikhail Soloviov, Ilya Makarov
Abstract:
The growing need for Trusted AI (TAI) highlights the importance of interpretability and robustness in machine learning models. However, many existing tools overlook graph data and rarely combine these two aspects into a single solution. Graph Neural Networks (GNNs) have become a popular approach, achieving top results across various tasks. We introduce GNN-AID (Graph Neural Network Analysis, Interpretation, and Defense), an open-source framework designed for graph data to address this gap. Built as a Python library, GNN-AID supports advanced trust methods and architectural layers, allowing users to analyze graph datasets and GNN behavior using attacks, defenses, and interpretability methods.
GNN-AID is built on PyTorch-Geometric, offering preloaded datasets, models, and support for any GNNs through customizable interfaces. It also includes a web interface with tools for graph visualization and no-code features like an interactive model builder, simplifying the exploration and analysis of GNNs. The framework also supports MLOps techniques, ensuring reproducibility and result versioning to track and revisit analyses efficiently.
GNN-AID is a flexible tool for developers and researchers. It helps developers create, analyze, and customize graph models, while also providing access to prebuilt datasets and models for quick experimentation. Researchers can use the framework to explore advanced topics on the relationship between interpretability and robustness, test defense strategies, and combine methods to protect against different types of attacks.
We also show how defenses against evasion and poisoning attacks can conflict when applied to graph data, highlighting the complex connections between defense strategies.
GNN-AID is available at \href{https://github.com/ispras/GNN-AID}{github.com/ispras/GNN-AID}
中文: GNN-AID是一个基于PyTorch-Geometric的开源框架,专门针对图数据填补可信AI的空白,集成了可解释性、鲁棒性和防御机制的工具,为开发者和研究人员提供代码与无代码结合的解决方案。
English: GNN-AID is an open-source Python framework built on PyTorch-Geometric that addresses the gap in Trusted AI by providing integrated tools for interpretability, robustness, and defense mechanisms specifically for graph data, featuring both code and no-code interfaces for developers and researchers.
Authors:Saleh Zare Zade, Yao Qiang, Xiangyu Zhou, Hui Zhu, Mohammad Amin Roshani, Prashant Khanduri, Dongxiao Zhu
Abstract:
Membership Inference Attacks (MIAs) have recently been employed to determine whether a specific text was part of the pre-training data of Large Language Models (LLMs). However, existing methods often misinfer non-members as members, leading to a high false positive rate, or depend on additional reference models for probability calibration, which limits their practicality. To overcome these challenges, we introduce a novel framework called Automatic Calibration Membership Inference Attack (ACMIA), which utilizes a tunable temperature to calibrate output probabilities effectively. This approach is inspired by our theoretical insights into maximum likelihood estimation during the pre-training of LLMs. We introduce ACMIA in three configurations designed to accommodate different levels of model access and increase the probability gap between members and non-members, improving the reliability and robustness of membership inference. Extensive experiments on various open-source LLMs demonstrate that our proposed attack is highly effective, robust, and generalizable, surpassing state-of-the-art baselines across three widely used benchmarks. Our code is available at: \href{https://github.com/Salehzz/ACMIA}{\textcolor{blue}{Github}}.
中文: 提出的自动校准成员推理攻击(ACMIA)框架通过可调温度参数有效校准输出概率,增强了大型语言模型中成员推理的可靠性与鲁棒性,在多个基准测试中优于现有方法。
English: The proposed Automatic Calibration Membership Inference Attack (ACMIA) framework effectively calibrates output probabilities using a tunable temperature to enhance the reliability and robustness of membership inference in Large Language Models, outperforming existing methods across multiple benchmarks.
Authors:Keyu Chen, Wenchao Sun, Hao Cheng, Sifa Zheng
Abstract:
Achieving both realism and controllability in closed-loop traffic simulation remains a key challenge in autonomous driving. Dataset-based methods reproduce realistic trajectories but suffer from covariate shift in closed-loop deployment, compounded by simplified dynamics models that further reduce reliability. Conversely, physics-based simulation methods enhance reliable and controllable closed-loop interactions but often lack expert demonstrations, compromising realism. To address these challenges, we introduce a dual-stage AV-centric simulation framework that conducts imitation learning pre-training in a data-driven simulator to capture trajectory-level realism and route-level controllability, followed by reinforcement learning fine-tuning in a physics-based simulator to enhance style-level controllability and mitigate covariate shift. In the fine-tuning stage, we propose RIFT, a novel group-relative RL fine-tuning strategy that evaluates all candidate modalities through group-relative formulation and employs a surrogate objective for stable optimization, enhancing style-level controllability and mitigating covariate shift while preserving the trajectory-level realism and route-level controllability inherited from IL pre-training. Extensive experiments demonstrate that RIFT improves realism and controllability in traffic simulation while simultaneously exposing the limitations of modern AV systems in closed-loop evaluation. Project Page: https://currychen77.github.io/RIFT/
Authors:Arthur Corrêa, Alexandre Jesus, Cristóvão Silva, Samuel Moniz
Abstract:
Recently, deep reinforcement learning has emerged as a promising approach for solving complex combinatorial optimization problems. Broadly, deep reinforcement learning methods fall into two categories: policy-based and value-based. While value-based approaches have achieved notable success in domains such as the Arcade Learning Environment, the combinatorial optimization community has predominantly favored policy-based methods, often overlooking the potential of value-based algorithms. In this work, we conduct a comprehensive empirical evaluation of value-based algorithms, including the deep q-network and several of its advanced extensions, within the context of two complex combinatorial problems: the job-shop and the flexible job-shop scheduling problems, two fundamental challenges with multiple industrial applications. Our results challenge the assumption that policy-based methods are inherently superior for combinatorial optimization. We show that several value-based approaches can match or even outperform the widely adopted proximal policy optimization algorithm, suggesting that value-based strategies deserve greater attention from the combinatorial optimization community. Our code is openly available at: https://github.com/AJ-Correa/Unraveling-the-Rainbow.
中文摘要:本研究通过实证评估表明,在组合优化问题中,基于值的深度强化学习算法能够匹敌甚至超越基于策略的方法,挑战了该领域对后者的主流偏好。
English Summary: This study demonstrates that value-based deep reinforcement learning algorithms can compete with or surpass policy-based methods in complex combinatorial optimization problems, challenging the field's prevailing preference for the latter.
Authors:Yutong Xie, Fuchao Yang, Yuheng Jia
Abstract:
Partial label learning (PLL) is a significant weakly supervised learning framework, where each training example corresponds to a set of candidate labels and only one label is the ground-truth label. For the first time, this paper investigates the partial label clustering problem, which takes advantage of the limited available partial labels to improve the clustering performance. Specifically, we first construct a weight matrix of examples based on their relationships in the feature space and disambiguate the candidate labels to estimate the ground-truth label based on the weight matrix. Then, we construct a set of must-link and cannot-link constraints based on the disambiguation results. Moreover, we propagate the initial must-link and cannot-link constraints based on an adversarial prior promoted dual-graph learning approach. Finally, we integrate weight matrix construction, label disambiguation, and pairwise constraints propagation into a joint model to achieve mutual enhancement. We also theoretically prove that a better disambiguated label matrix can help improve clustering performance. Comprehensive experiments demonstrate our method realizes superior performance when comparing with state-of-the-art constrained clustering methods, and outperforms PLL and semi-supervised PLL methods when only limited samples are annotated. The code is publicly available at https://github.com/xyt-ml/PLC.
中文摘要:本文首次提出部分标签聚类方法,通过特征空间权重矩阵构建、标签消歧和约束传播的联合优化模型,在有限标注样本下显著超越了现有约束聚类和部分标签学习方法。
English Summary: This paper introduces a novel partial label clustering method that integrates feature-based weight matrix construction, label disambiguation, and constraint propagation into a unified model, demonstrating superior performance over existing constrained clustering and partial label learning approaches.
Authors:Kien Tran Duc Tuan, Tam Nguyen Trong, Son Nguyen Hoang, Khoat Than, Anh Nguyen Duc
Abstract:
In explainable AI, Integrated Gradients (IG) is a widely adopted technique for assessing the significance of feature attributes of the input on model outputs by evaluating contributions from a baseline input to the current input. The choice of the baseline input significantly influences the resulting explanation. While the traditional Expected Gradients (EG) method assumes baselines can be uniformly sampled and averaged with equal weights, this study argues that baselines should not be treated equivalently. We introduce Weighted Integrated Gradients (WG), a novel approach that unsupervisedly evaluates baseline suitability and incorporates a strategy for selecting effective baselines. Theoretical analysis demonstrates that WG satisfies essential explanation method criteria and offers greater stability than prior approaches. Experimental results further confirm that WG outperforms EG across diverse scenarios, achieving an improvement of 10-35\% on main metrics. Moreover, by evaluating baselines, our method can filter a subset of effective baselines for each input to calculate explanations, maintaining high accuracy while reducing computational cost. The code is available at: https://github.com/tamnt240904/weighted_ig.
中文: 本研究提出加权积分梯度(WG)方法,通过无监督评估基线适用性并筛选有效基线,在保持高精度的同时降低计算成本,相比期望梯度法在主要指标上提升10-35%且具有更优稳定性。
English: This study introduces Weighted Integrated Gradients (WG), an unsupervised method that evaluates baseline suitability and selects effective baselines, demonstrating superior performance and stability over Expected Gradients with 10-35% metric improvements while reducing computational costs.
Authors:Mohammad Rostami, Atik Faysal, Reihaneh Gh. Roshan, Huaxia Wang, Nikhil Muralidhar, Yu-Dong Yao
Abstract:
Automatic Modulation Classification (AMC) is critical for efficient spectrum management and robust wireless communications. However, AMC remains challenging due to the complex interplay of signal interference and noise. In this work, we propose an innovative framework that integrates traditional signal processing techniques with Large-Language Models (LLMs) to address AMC. Our approach leverages higher-order statistics and cumulant estimation to convert quantitative signal features into structured natural language prompts. By incorporating exemplar contexts into these prompts, our method exploits the LLM's inherent familiarity with classical signal processing, enabling effective one-shot classification without additional training or preprocessing (e.g., denoising). Experimental evaluations on synthetically generated datasets, spanning both noiseless and noisy conditions, demonstrate that our framework achieves competitive performance across diverse modulation schemes and Signal-to-Noise Ratios (SNRs). Moreover, our approach paves the way for robust foundation models in wireless communications across varying channel conditions, significantly reducing the expense associated with developing channel-specific models. This work lays the foundation for scalable, interpretable, and versatile signal classification systems in next-generation wireless networks. The source code is available at https://github.com/RU-SIT/context-is-king
中文摘要:本研究提出了一种将传统信号处理技术与大语言模型相结合的新框架,用于自动调制分类,无需额外训练或预处理即可在不同噪声条件下实现稳健性能。
English Summary: This study introduces a novel framework combining traditional signal processing with Large-Language Models (LLMs) for Automatic Modulation Classification, achieving robust performance across various noise conditions without requiring additional training or preprocessing.
Authors:Daniel Goldstein, Eric Alcaide, Janna Lu, Eugene Cheah
Abstract:
We present Rapid Attention Distillation to Linear Attention Decoders at Scale (RADLADS), a protocol for rapidly converting softmax attention transformers into linear attention decoder models, along with two new RWKV-variant architectures, and models converted from popular Qwen2.5 open source models in 7B, 32B, and 72B sizes. Our conversion process requires only 350-700M tokens, less than 0.005% of the token count used to train the original teacher models. Converting to our 72B linear attention model costs less than \$2,000 USD at today's prices, yet quality at inference remains close to the original transformer. These models achieve state-of-the-art downstream performance across a set of standard benchmarks for linear attention models of their size. We release all our models on HuggingFace under the Apache 2.0 license, with the exception of our 72B models which are also governed by the Qwen License Agreement.
Models at https://huggingface.co/collections/recursal/radlads-6818ee69e99e729ba8a87102 Training Code at https://github.com/recursal/RADLADS-paper
中文:RADLADS协议能以极少的训练数据和成本将softmax注意力变换器高效转换为线性注意力解码器,在保持接近原模型质量的同时实现领先性能,并以开放许可证发布。
English: RADLADS enables efficient conversion of softmax attention transformers into linear attention decoders using minimal training tokens and cost, achieving near-original quality and state-of-the-art performance while being released under open licenses.
Authors:Zhikai Wang, Yanyan Shen, Zibin Zhang, Kangyi Lin
Abstract:
Click-through Rate (CTR) prediction in real-world recommender systems often deals with billions of user interactions every day. To improve the training efficiency, it is common to update the CTR prediction model incrementally using the new incremental data and a subset of historical data. However, the feature embeddings of a CTR prediction model often get stale when the corresponding features do not appear in current incremental data. In the next period, the model would have a performance degradation on samples containing stale features, which we call the feature staleness problem. To mitigate this problem, we propose a Feature Staleness Aware Incremental Learning method for CTR prediction (FeSAIL) which adaptively replays samples containing stale features. We first introduce a staleness aware sampling algorithm (SAS) to sample a fixed number of stale samples with high sampling efficiency. We then introduce a staleness aware regularization mechanism (SAR) for a fine-grained control of the feature embedding updating. We instantiate FeSAIL with a general deep learning-based CTR prediction model and the experimental results demonstrate FeSAIL outperforms various state-of-the-art methods on four benchmark datasets.
Chinese: FeSAIL方法通过自适应回放陈旧特征样本和精细化嵌入正则化,有效缓解了增量CTR模型训练中的特征陈旧问题,提升了模型性能。
English: FeSAIL addresses the feature staleness issue in incremental CTR model training by adaptively replaying stale feature samples and applying fine-grained embedding regularization to enhance performance.
Authors:Dmitriy Shopkhoev, Ammar Ali, Magauiya Zhussip, Valentin Malykh, Stamatios Lefkimmiatis, Nikos Komodakis, Sergey Zagoruyko
Abstract:
We introduce ReplaceMe, a generalized training-free depth pruning method that effectively replaces transformer blocks with a linear operation, while maintaining high performance for low compression ratios. In contrast to conventional pruning approaches that require additional training or fine-tuning, our approach requires only a small calibration dataset that is used to estimate a linear transformation, which approximates the pruned blocks. The estimated linear mapping can be seamlessly merged with the remaining transformer blocks, eliminating the need for any additional network parameters. Our experiments show that ReplaceMe consistently outperforms other training-free approaches and remains highly competitive with state-of-the-art pruning methods that involve extensive retraining/fine-tuning and architectural modifications. Applied to several large language models (LLMs), ReplaceMe achieves up to 25% pruning while retaining approximately 90% of the original model's performance on open benchmarks - without any training or healing steps, resulting in minimal computational overhead (see Fig.1). We provide an open-source library implementing ReplaceMe alongside several state-of-the-art depth pruning techniques, available at https://github.com/mts-ai/ReplaceMe.
Chinese: ReplaceMe 是一种无需训练的深度剪枝方法,通过线性操作替换 transformer 模块,在无需重新训练的情况下实现高达 25% 的剪枝率,同时保持约 90% 的原始性能。
English: ReplaceMe is a training-free depth pruning method that replaces transformer blocks with a linear operation, achieving up to 25% pruning while maintaining approximately 90% of original performance without retraining.
Authors:Zhouliang Yu, Ruotian Peng, Keyi Ding, Yizhe Li, Zhongyuan Peng, Minghao Liu, Yifan Zhang, Zheng Yuan, Huajian Xin, Wenhao Huang, Yandong Wen, Ge Zhang, Weiyang Liu
Abstract:
Formal mathematical reasoning remains a critical challenge for artificial intelligence, hindered by limitations of existing benchmarks in scope and scale. To address this, we present FormalMATH, a large-scale Lean4 benchmark comprising 5,560 formally verified problems spanning from high-school Olympiad challenges to undergraduate-level theorems across diverse domains (e.g., algebra, applied mathematics, calculus, number theory, and discrete mathematics). To mitigate the inefficiency of manual formalization, we introduce a novel human-in-the-loop autoformalization pipeline that integrates: (1) specialized large language models (LLMs) for statement autoformalization, (2) multi-LLM semantic verification, and (3) negation-based disproof filtering strategies using off-the-shelf LLM-based provers. This approach reduces expert annotation costs by retaining 72.09% of statements before manual verification while ensuring fidelity to the original natural-language problems. Our evaluation of state-of-the-art LLM-based theorem provers reveals significant limitations: even the strongest models achieve only 16.46% success rate under practical sampling budgets, exhibiting pronounced domain bias (e.g., excelling in algebra but failing in calculus) and over-reliance on simplified automation tactics. Notably, we identify a counterintuitive inverse relationship between natural-language solution guidance and proof success in chain-of-thought reasoning scenarios, suggesting that human-written informal reasoning introduces noise rather than clarity in the formal reasoning settings. We believe that FormalMATH provides a robust benchmark for benchmarking formal mathematical reasoning.
中文: 针对形式数学推理基准的不足,我们提出了FormalMATH这一大规模Lean4数据集,包含5,560个已验证问题,并通过高效的自动形式化流程降低了专家成本,同时揭示了当前AI定理证明器性能的显著缺陷及其在人类指导下的意外困境。
English: To tackle the limitations in formal mathematical reasoning benchmarks, we introduce FormalMATH, a large-scale Lean4 dataset with 5,560 verified problems and an efficient autoformalization pipeline that cuts expert costs while maintaining accuracy, revealing significant gaps in current AI theorem provers' performance and their unexpected struggles with human guidance.
Authors:Shiwei Guo, Ziang Chen, Yupeng Ma, Yunfei Han, Yi Wang
Abstract:
The Transformer model has shown strong performance in multivariate time series forecasting by leveraging channel-wise self-attention. However, this approach lacks temporal constraints when computing temporal features and does not utilize cumulative historical series effectively.To address these limitations, we propose the Structured Channel-wise Transformer with Cumulative Historical state (SCFormer). SCFormer introduces temporal constraints to all linear transformations, including the query, key, and value matrices, as well as the fully connected layers within the Transformer. Additionally, SCFormer employs High-order Polynomial Projection Operators (HiPPO) to deal with cumulative historical time series, allowing the model to incorporate information beyond the look-back window during prediction. Extensive experiments on multiple real-world datasets demonstrate that SCFormer significantly outperforms mainstream baselines, highlighting its effectiveness in enhancing time series forecasting. The code is publicly available at https://github.com/ShiweiGuo1995/SCFormer
Chinese: SCFormer模型通过在线性变换中引入时间约束并采用高阶多项式投影算子处理累积历史序列,有效提升了多元时间序列预测性能,在多个真实数据集上的实验表明其显著优于主流基线方法。
English: The SCFormer model enhances multivariate time series forecasting by incorporating temporal constraints into all linear transformations and using High-order Polynomial Projection Operators to effectively utilize cumulative historical data, significantly outperforming existing methods in experiments.
Authors:Hongze Li, Zesheng Zhou, Zhenbiao Cao, Xinhui Li, Wei Chen, Xiaojin Zhang
Abstract:
Traditional Federated Domain Generalization (FedDG) methods focus on learning domain-invariant features or adapting to unseen target domains, often overlooking the unique knowledge embedded within the source domain, especially in strictly isolated federated learning environments. Through experimentation, we discovered a counterintuitive phenomenon.: features learned from a complete source domain have superior generalization capabilities compared to those learned directly from the target domain. This insight leads us to propose the Federated Source Domain Awareness Framework (FedSDAF), the first systematic approach to enhance FedDG by leveraging source domain-aware features. FedSDAF employs a dual-adapter architecture that decouples "local expertise" from "global generalization consensus". A Domain-Aware Adapter, retained locally, extracts and protects the unique discriminative knowledge of each source domain, while a Domain-Invariant Adapter, shared across clients, builds a robust global consensus. To enable knowledge exchange, we introduce a Bidirectional Knowledge Distillation mechanism that facilitates efficient dialogue between the adapters. Extensive experiments on four benchmark datasets (OfficeHome, PACS, VLCS, DomainNet) show that FedSDAF significantly outperforms existing FedDG methods.The source code is available at https://github.com/pizzareapers/FedSDAF.
中文摘要:传统联邦域泛化方法常忽视源域知识,但本研究发现源域特征比目标域特征具有更强的泛化能力,据此提出采用双向知识蒸馏的双适配器FedSDAF框架,在多个基准测试中显著优于现有方法。
English Summary: Traditional Federated Domain Generalization methods often neglect source domain knowledge, but this research reveals that source domain features actually generalize better than target domain features, leading to the innovative FedSDAF framework with dual adapters and bidirectional knowledge distillation that significantly outperforms existing methods.
Authors:Jiarui Yao, Yifan Hao, Hanning Zhang, Hanze Dong, Wei Xiong, Nan Jiang, Tong Zhang
Abstract:
Chain-of-thought (CoT) reasoning in large language models (LLMs) can be formalized as a latent variable problem, where the model needs to generate intermediate reasoning steps. While prior approaches such as iterative reward-ranked fine-tuning (RAFT) have relied on such formulations, they typically apply uniform inference budgets across prompts, which fails to account for variability in difficulty and convergence behavior. This work identifies the main bottleneck in CoT training as inefficient stochastic gradient estimation due to static sampling strategies. We propose GVM-RAFT, a prompt-specific Dynamic Sample Allocation Strategy designed to minimize stochastic gradient variance under a computational budget constraint. The method dynamically allocates computational resources by monitoring prompt acceptance rates and stochastic gradient norms, ensuring that the resulting gradient variance is minimized. Our theoretical analysis shows that the proposed dynamic sampling strategy leads to accelerated convergence guarantees under suitable conditions. Experiments on mathematical reasoning show that GVM-RAFT achieves a 2-4x speedup and considerable accuracy improvements over vanilla RAFT. The proposed dynamic sampling strategy is general and can be incorporated into other reinforcement learning algorithms, such as GRPO, leading to similar improvements in convergence and test accuracy. Our code is available at https://github.com/RLHFlow/GVM.
中文摘要:GVM-RAFT提出动态采样策略,通过监控接受率和梯度范数优化计算资源分配,在数学推理任务中相比传统RAFT实现2-4倍加速收敛与显著精度提升。
English Summary: GVM-RAFT introduces a dynamic sampling strategy that optimizes computational resource allocation by minimizing gradient variance, achieving 2-4x faster convergence and higher accuracy in mathematical reasoning tasks compared to standard RAFT.
Authors:Enbo Zhao, Yi Shen, Shuming Shi, Jieyun Huang, Zhihao Chen, Ning Wang, Siqi Xiao, Jian Zhang, Kai Wang, Shiguo Lian
Abstract:
Recently, there is a high demand for deploying DeepSeek-R1 and V3 locally, possibly because the official service often suffers from being busy and some organizations have data privacy concerns. While single-machine deployment offers infrastructure simplicity, the models' 671B FP8 parameter configuration exceeds the practical memory limits of a standard 8-GPU machine. Quantization is a widely used technique that helps reduce model memory consumption. However, it is unclear what the performance of DeepSeek-R1 and V3 will be after being quantized. This technical report presents the first quantitative evaluation of multi-bitwidth quantization across the complete DeepSeek model spectrum. Key findings reveal that 4-bit quantization maintains little performance degradation versus FP8 while enabling single-machine deployment on standard NVIDIA GPU devices. We further propose DQ3_K_M, a dynamic 3-bit quantization method that significantly outperforms traditional Q3_K_M variant on various benchmarks, which is also comparable with 4-bit quantization (Q4_K_M) approach in most tasks. Moreover, DQ3_K_M supports single-machine deployment configurations for both NVIDIA H100/A100 and Huawei 910B. Our implementation of DQ3\_K\_M is released at https://github.com/UnicomAI/DeepSeek-Eval, containing optimized 3-bit quantized variants of both DeepSeek-R1 and DeepSeek-V3.
中文摘要:本技术报告提出了一种名为DQ3_K_M的动态3位量化方法,能够在保持与4位量化相当性能的同时,实现DeepSeek模型在标准NVIDIA GPU设备上的单机部署。
English Summary: This technical report introduces a dynamic 3-bit quantization method called DQ3_K_M that enables single-machine deployment of DeepSeek models while maintaining performance comparable to 4-bit quantization across various benchmarks.
Authors:Huangyue Yu, Baoxiong Jia, Yixin Chen, Yandan Yang, Puhao Li, Rongpeng Su, Jiaxin Li, Qing Li, Wei Liang, Song-Chun Zhu, Tengyu Liu, Siyuan Huang
Abstract:
Embodied AI (EAI) research requires high-quality, diverse 3D scenes to effectively support skill acquisition, sim-to-real transfer, and generalization. Achieving these quality standards, however, necessitates the precise replication of real-world object diversity. Existing datasets demonstrate that this process heavily relies on artist-driven designs, which demand substantial human effort and present significant scalability challenges. To scalably produce realistic and interactive 3D scenes, we first present MetaScenes, a large-scale, simulatable 3D scene dataset constructed from real-world scans, which includes 15366 objects spanning 831 fine-grained categories. Then, we introduce Scan2Sim, a robust multi-modal alignment model, which enables the automated, high-quality replacement of assets, thereby eliminating the reliance on artist-driven designs for scaling 3D scenes. We further propose two benchmarks to evaluate MetaScenes: a detailed scene synthesis task focused on small item layouts for robotic manipulation and a domain transfer task in vision-and-language navigation (VLN) to validate cross-domain transfer. Results confirm MetaScene's potential to enhance EAI by supporting more generalizable agent learning and sim-to-real applications, introducing new possibilities for EAI research. Project website: https://meta-scenes.github.io/.
中文摘要:MetaScenes通过基于真实扫描的大规模可模拟3D场景数据集和Scan2Sim自动资产替换模型,解决了艺术家驱动设计的可扩展性难题,为具身AI提供了增强泛化能力和仿真到现实应用的新可能。
English Summary: MetaScenes introduces a large-scale, simulatable 3D scene dataset from real-world scans and Scan2Sim, an automated asset replacement model, to overcome scalability challenges in artist-driven designs and enhance Embodied AI through improved generalization and sim-to-real applications.
Authors:Xiusi Chen, Gaotang Li, Ziqi Wang, Bowen Jin, Cheng Qian, Yu Wang, Hongru Wang, Yu Zhang, Denghui Zhang, Tong Zhang, Hanghang Tong, Heng Ji
Abstract:
Reward modeling is essential for aligning large language models with human preferences through reinforcement learning from human feedback. To provide accurate reward signals, a reward model (RM) should stimulate deep thinking and conduct interpretable reasoning before assigning a score or a judgment. Inspired by recent advances of long chain-of-thought on reasoning-intensive tasks, we hypothesize and validate that integrating reasoning capabilities into reward modeling significantly enhances RMs interpretability and performance. To this end, we introduce a new class of generative reward models - Reasoning Reward Models (ReasRMs) - which formulate reward modeling as a reasoning task. We propose a reasoning-oriented training pipeline and train a family of ReasRMs, RM-R1. RM-R1 features a chain-of-rubrics (CoR) mechanism - self-generating sample-level chat rubrics or math/code solutions, and evaluating candidate responses against them. The training of RM-R1 consists of two key stages: (1) distillation of high-quality reasoning chains and (2) reinforcement learning with verifiable rewards. Empirically, our models achieve state-of-the-art performance across three reward model benchmarks on average, outperforming much larger open-weight models (e.g., INF-ORM-Llama3.1-70B) and proprietary ones (e.g., GPT-4o) by up to 4.9%. Beyond final performance, we perform thorough empirical analyses to understand the key ingredients of successful ReasRM training. To facilitate future research, we release six REASRM models along with code and data at https://github.com/RM-R1-UIUC/RM-R1.
中文: 本文提出推理奖励模型(ReasRMs),通过引入推理能力和链式评分机制,结合两阶段训练流程,显著提升了奖励模型的性能和可解释性,在多个基准测试中达到最优水平,并以高达4.9%的优势超越更大规模的模型。
English: This paper introduces Reasoning Reward Models (ReasRMs), which enhance reward modeling by incorporating reasoning capabilities through a chain-of-rubrics mechanism and a two-stage training process, achieving state-of-the-art performance across benchmarks while outperforming larger models by up to 4.9%.
Authors:Ming Li, Xin Gu, Fan Chen, Xiaoying Xing, Longyin Wen, Chen Chen, Sijie Zhu
Abstract:
Due to the challenges of manually collecting accurate editing data, existing datasets are typically constructed using various automated methods, leading to noisy supervision signals caused by the mismatch between editing instructions and original-edited image pairs. Recent efforts attempt to improve editing models through generating higher-quality edited images, pre-training on recognition tasks, or introducing vision-language models (VLMs) but fail to resolve this fundamental issue. In this paper, we offer a novel solution by constructing more effective editing instructions for given image pairs. This includes rectifying the editing instructions to better align with the original-edited image pairs and using contrastive editing instructions to further enhance their effectiveness. Specifically, we find that editing models exhibit specific generation attributes at different inference steps, independent of the text. Based on these prior attributes, we define a unified guide for VLMs to rectify editing instructions. However, there are some challenging editing scenarios that cannot be resolved solely with rectified instructions. To this end, we further construct contrastive supervision signals with positive and negative instructions and introduce them into the model training using triplet loss, thereby further facilitating supervision effectiveness. Our method does not require the VLM modules or pre-training tasks used in previous work, offering a more direct and efficient way to provide better supervision signals, and providing a novel, simple, and effective solution for instruction-based image editing. Results on multiple benchmarks demonstrate that our method significantly outperforms existing approaches. Compared with previous SOTA SmartEdit, we achieve 9.19% improvements on the Real-Edit benchmark with 30x less training data and 13x smaller model size.
中文: 本文提出了一种新颖的图像编辑指令优化方法,通过校正指令和构建对比监督信号来提升编辑效果,在显著减少训练数据和模型规模的同时,实现了优于现有方法的性能表现。
English: This paper introduces a novel approach to improve instruction-based image editing by constructing more effective editing instructions through rectification and contrastive supervision, achieving superior performance with significantly reduced training data and model size compared to existing methods.
Authors:Bobo Lian, Dandan Wang, Chenjian Wu, Minxin Chen
Abstract:
Point cloud surface representation is a fundamental problem in computer graphics and vision. This paper presents a machine learning approach for approximating the signed distance function (SDF) of a point cloud using a sparse ellipsoidal radial basis function network, enabling a compact and accurate surface representation. Given the SDF values defined on the grid points constructed from the point cloud, our method approximates the SDF accurately with as few ellipsoidal radial basis functions (ERBFs) as possible, i.e., represents the SDF of a point cloud by sparse ERBFs. To balance sparsity and approximation precision, a dynamic multi-objective optimization strategy is introduced, which adaptively adds the regularization terms and jointly optimizes the weights, centers, shapes, and orientations of ERBFs. To improve computational efficiency, a nearest-neighbor-based data structure is employed, restricting function calculations to points near each Gaussian kernel center. The computations for each kernel are further parallelized on CUDA, which significantly improves the optimization speed. Additionally, a hierarchical octree-based refinement strategy is designed for training. Specifically, the initialization and optimization of network parameters are conducted using coarse grid points in the octree lattice structure. Subsequently, fine lattice points are progressively incorporated to accelerate model convergence and enhance training efficiency. Extensive experiments on multiple benchmark datasets demonstrate that our method outperforms previous sparse representation approaches in terms of accuracy, robustness, and computational efficiency. The corresponding executable program is publicly available at https://github.com/lianbobo/SE-RBFNet.git.
Chinese: 本文提出一种机器学习方法,利用稀疏椭球径向基函数网络精确高效地逼近点云的符号距离函数,在精度、鲁棒性和计算效率方面均优于现有稀疏表示方法。
English: This paper introduces a machine learning method that uses a sparse ellipsoidal radial basis function network to accurately and efficiently approximate the signed distance function of point clouds, achieving superior performance in accuracy, robustness, and computational efficiency compared to previous approaches.
Authors:James Read, Ming-Yen Lee, Wei-Hsing Huang, Yuan-Chun Luo, Anni Lu, Shimeng Yu
Abstract:
The exponential growth of artificial intelligence (AI) applications has exposed the inefficiency of conventional von Neumann architectures, where frequent data transfers between compute units and memory create significant energy and latency bottlenecks. Analog Computing-in-Memory (ACIM) addresses this challenge by performing multiply-accumulate (MAC) operations directly in the memory arrays, substantially reducing data movement. However, designing robust ACIM accelerators requires accurate modeling of device- and circuit-level non-idealities. In this work, we present NeuroSim V1.5, introducing several key advances: (1) seamless integration with TensorRT's post-training quantization flow enabling support for more neural networks including transformers, (2) a flexible noise injection methodology built on pre-characterized statistical models, making it straightforward to incorporate data from SPICE simulations or silicon measurements, (3) expanded device support including emerging non-volatile capacitive memories, and (4) up to 6.5x faster runtime than NeuroSim V1.4 through optimized behavioral simulation. The combination of these capabilities uniquely enables systematic design space exploration across both accuracy and hardware efficiency metrics. Through multiple case studies, we demonstrate optimization of critical design parameters while maintaining network accuracy. By bridging high-fidelity noise modeling with efficient simulation, NeuroSim V1.5 advances the design and validation of next-generation ACIM accelerators. All NeuroSim versions are available open-source at https://github.com/neurosim/NeuroSim.
中文: NeuroSim V1.5 增强了模拟存内计算加速器的建模与仿真功能,支持高效的设计空间探索与优化,同时保持神经网络精度。
English: NeuroSim V1.5 introduces enhanced modeling and simulation capabilities for analog computing-in-memory accelerators, enabling efficient design space exploration and optimization while maintaining neural network accuracy.
Authors:Henry Ndubuaku, Mouad Talhi
Abstract:
Embedding layers in transformer-based NLP models typically account for the largest share of model parameters, scaling with vocabulary size but not yielding performance gains proportional to scale. We propose an alternative approach in which token embedding vectors are first generated deterministically, directly from the token IDs using a Fourier expansion of their normalized values, followed by a lightweight multilayer perceptron (MLP) that captures higher-order interactions. We train standard transformers and our architecture on natural language inference tasks (SNLI and MNLI), and evaluate zero-shot performance on sentence textual similarity (STS-B). Our results demonstrate that the proposed method achieves competitive performance using significantly fewer parameters, trains faster, and operates effectively without the need for dropout. This proof-of-concept study highlights the potential for scalable, memory-efficient language models and motivates further large-scale experimentation based on our findings.
中文: 我们提出的方法用基于傅里叶变换的确定性标记生成和轻量级多层感知机替代传统嵌入层,以更少参数和更快训练实现同等性能,且无需使用dropout技术。
English: Our proposed method replaces traditional embedding layers with a deterministic Fourier-based token generation followed by a lightweight MLP, achieving competitive performance with fewer parameters and faster training while eliminating dropout requirements.
Authors:Xingyu Zheng, Yuye Li, Haoran Chu, Yue Feng, Xudong Ma, Jie Luo, Jinyang Guo, Haotong Qin, Michele Magno, Xianglong Liu
Abstract:
The Qwen series has emerged as a leading family of open-source Large Language Models (LLMs), demonstrating remarkable capabilities in natural language understanding tasks. With the recent release of Qwen3, which exhibits superior performance across diverse benchmarks, there is growing interest in deploying these models efficiently in resource-constrained environments. Low-bit quantization presents a promising solution, yet its impact on Qwen3's performance remains underexplored. This study conducts a systematic evaluation of Qwen3's robustness under various quantization settings, aiming to uncover both opportunities and challenges in compressing this state-of-the-art model. We rigorously assess 5 existing classic post-training quantization techniques applied to Qwen3, spanning bit-widths from 1 to 8 bits, and evaluate their effectiveness across multiple datasets. Our findings reveal that while Qwen3 maintains competitive performance at moderate bit-widths, it experiences notable degradation in linguistic tasks under ultra-low precision, underscoring the persistent hurdles in LLM compression. These results emphasize the need for further research to mitigate performance loss in extreme quantization scenarios. We anticipate that this empirical analysis will provide actionable insights for advancing quantization methods tailored to Qwen3 and future LLMs, ultimately enhancing their practicality without compromising accuracy. Our project is released on https://github.com/Efficient-ML/Qwen3-Quantization and https://huggingface.co/collections/Efficient-ML/qwen3-quantization-68164450decb1c868788cb2b.
Chinese: 本研究系统评估了Qwen3模型在不同低比特量化设置下的性能,发现在中等比特宽度下模型保持竞争力,但在超低精度场景中性能显著下降,这凸显了改进压缩技术的必要性。
English: This study systematically evaluates the Qwen3 model's performance under various low-bit quantization settings, revealing that while it maintains competitiveness at moderate bit-widths, it suffers significant degradation in ultra-low precision scenarios, highlighting the need for improved compression techniques.
Authors:Minzheng Wang, Yongbin Li, Haobo Wang, Xinghua Zhang, Nan Xu, Bingli Wu, Fei Huang, Haiyang Yu, Wenji Mao
Abstract:
Effective social intelligence simulation requires language agents to dynamically adjust reasoning depth, a capability notably absent in current studies. Existing methods either lack this kind of reasoning capability or enforce Long Chain-of-Thought reasoning uniformly across all scenarios, resulting in excessive token usage and inflexible social simulation. To address this, we propose an $\textbf{A}$daptive $\textbf{M}$ode $\textbf{L}$earning ($\textbf{AML}$) framework in this paper, aiming to improve the adaptive thinking ability of language agents in dynamic social interactions. To this end, we first identify hierarchical thinking modes ranging from intuitive response to deep deliberation based on the cognitive control theory. We then develop the $\textbf{A}$daptive $\textbf{M}$ode $\textbf{P}$olicy $\textbf{O}$ptimization ($\textbf{AMPO}$) algorithm to optimize the context-aware mode switching and reasoning. Our framework advances existing research in three key aspects: (1) Multi-granular thinking mode design, (2) Context-aware mode switching across social interaction, and (3) Token-efficient reasoning via depth-adaptive processing. Extensive experiments on social intelligence benchmarks verify that AML achieves 15.6% higher task performance than GPT-4o. Notably, our AMPO outperforms GRPO by 7.0% with 32.8% shorter reasoning chains, demonstrating the advantage of adaptive thinking mode selection and optimization mechanism in AMPO over GRPO's fixed-depth solution.
中文: 本文提出自适应模式学习(AML)框架及自适应模式策略优化(AMPO)算法,通过多粒度思维模式设计和情境感知的模式切换机制,显著提升了语言代理在社交互动中的动态推理能力,实验证明其以更短推理链获得优于现有方法的性能表现。
English: This paper introduces an Adaptive Mode Learning (AML) framework with an Adaptive Mode Policy Optimization (AMPO) algorithm to enhance language agents' dynamic reasoning depth in social interactions, achieving superior performance over existing methods through context-aware mode switching and token-efficient processing.
Authors:Volodymyr Havrylov, Haiwen Huang, Dan Zhang, Andreas Geiger
Abstract:
Vision Foundation Models (VFMs) are large-scale, pre-trained models that serve as general-purpose backbones for various computer vision tasks. As VFMs' popularity grows, there is an increasing interest in understanding their effectiveness for dense prediction tasks. However, VFMs typically produce low-resolution features, limiting their direct applicability in this context. One way to tackle this limitation is by employing a task-agnostic feature upsampling module that refines VFM features resolution. To assess the effectiveness of this approach, we investigate Interactive Segmentation (IS) as a novel benchmark for evaluating feature upsampling methods on VFMs. Due to its inherent multimodal input, consisting of an image and a set of user-defined clicks, as well as its dense mask output, IS creates a challenging environment that demands comprehensive visual scene understanding. Our benchmarking experiments show that selecting appropriate upsampling strategies significantly improves VFM features quality. The code is released at https://github.com/havrylovv/iSegProbe
中文: 视觉基础模型通过任务无关的特征上采样模块提升低分辨率特征,并以交互式分割为基准验证了合适的上采样策略能显著改善特征质量。
English: Vision Foundation Models (VFMs) are enhanced for dense prediction tasks by employing task-agnostic feature upsampling, with interactive segmentation serving as a benchmark to demonstrate that proper upsampling strategies significantly improve feature quality.
Authors:Yi Han
Abstract:
As time series classification (TSC) gains prominence, ensuring robust TSC models against adversarial attacks is crucial. While adversarial defense is well-studied in Computer Vision (CV), the TSC field has primarily relied on adversarial training (AT), which is computationally expensive. In this paper, five data augmentation-based defense methods tailored for time series are developed, with the most computationally intensive method among them increasing the computational resources by only 14.07% compared to the original TSC model. Moreover, the deployment process for these methods is straightforward. By leveraging these advantages of our methods, we create two combined methods. One of these methods is an ensemble of all the proposed techniques, which not only provides better defense performance than PGD-based AT but also enhances the generalization ability of TSC models. Moreover, the computational resources required for our ensemble are less than one-third of those required for PGD-based AT. These methods advance robust TSC in data mining. Furthermore, as foundation models are increasingly explored for time series feature learning, our work provides insights into integrating data augmentation-based adversarial defense with large-scale pre-trained models in future research.
Chinese: 本文针对时间序列分类提出了五种基于数据增强的高效防御方法,其中集成方法不仅防御性能优于基于PGD的对抗训练,且计算资源需求减少三分之二以上,同时增强了模型的泛化能力。
English: This paper introduces five computationally efficient data augmentation-based defense methods for time series classification, including an ensemble approach that outperforms PGD-based adversarial training with less than one-third of the computational cost while improving model generalization.
Authors:Rui Lv, Zaixi Zhang, Kai Zhang, Qi Liu, Weibo Gao, Jiawei Liu, Jiaxia Yan, Linan Yue, Fangzhou Yao
Abstract:
Graph In-Context Learning, with the ability to adapt pre-trained graph models to novel and diverse downstream graphs without updating any parameters, has gained much attention in the community. The key to graph in-context learning is to perform downstream graphs conditioned on chosen prompt examples. Existing methods randomly select subgraphs or edges as prompts, leading to noisy graph prompts and inferior model performance. Additionally, due to the gap between pre-training and testing graphs, when the number of classes in the testing graphs is much greater than that in the training, the in-context learning ability will also significantly deteriorate. To tackle the aforementioned challenges, we develop a multi-stage adaptive prompt optimization method GraphPrompter, which optimizes the entire process of generating, selecting, and using graph prompts for better in-context learning capabilities. Firstly, Prompt Generator introduces a reconstruction layer to highlight the most informative edges and reduce irrelevant noise for graph prompt construction. Furthermore, in the selection stage, Prompt Selector employs the $k$-nearest neighbors algorithm and pre-trained selection layers to dynamically choose appropriate samples and minimize the influence of irrelevant prompts. Finally, we leverage a Prompt Augmenter with a cache replacement strategy to enhance the generalization capability of the pre-trained model on new datasets. Extensive experiments show that GraphPrompter effectively enhances the in-context learning ability of graph models. On average across all the settings, our approach surpasses the state-of-the-art baselines by over 8%. Our code is released at https://github.com/karin0018/GraphPrompter.
中文摘要:GraphPrompter提出了一种多阶段自适应提示优化方法,通过生成、选择和增强图提示来提升图上下文学习能力,相比现有方法性能平均提升超过8%。
English Summary: GraphPrompter introduces a multi-stage adaptive prompt optimization method to enhance graph in-context learning by generating, selecting, and augmenting prompts, achieving over 8% performance improvement over existing methods.
Authors:Yancheng Chen, Wenguo Yang, Zhipeng Jiang
Abstract:
Wide & Deep, a simple yet effective learning architecture for recommendation systems developed by Google, has had a significant impact in both academia and industry due to its combination of the memorization ability of generalized linear models and the generalization ability of deep models. Graph convolutional networks (GCNs) remain dominant in node classification tasks; however, recent studies have highlighted issues such as heterophily and expressiveness, which focus on graph structure while seemingly neglecting the potential role of node features. In this paper, we propose a flexible framework GCNIII, which leverages the Wide & Deep architecture and incorporates three techniques: Intersect memory, Initial residual and Identity mapping. We provide comprehensive empirical evidence showing that GCNIII can more effectively balance the trade-off between over-fitting and over-generalization on various semi- and full- supervised tasks. Additionally, we explore the use of large language models (LLMs) for node feature engineering to enhance the performance of GCNIII in cross-domain node classification tasks. Our implementation is available at https://github.com/CYCUCAS/GCNIII.
中文: 本文提出GCNIII框架,结合Wide & Deep架构与三项技术,有效平衡节点分类任务中的过拟合与泛化问题,并探索使用大语言模型优化特征工程。
English: This paper introduces GCNIII, a flexible framework that integrates the Wide & Deep architecture with three techniques to balance over-fitting and over-generalization in node classification tasks, while also exploring LLMs for feature enhancement.
Authors:Zeyuan Ma, Zhiguang Cao, Zhou Jiang, Hongshu Guo, Yue-Jiao Gong
Abstract:
Recent progress in Meta-Black-Box-Optimization (MetaBBO) has demonstrated that using RL to learn a meta-level policy for dynamic algorithm configuration (DAC) over an optimization task distribution could significantly enhance the performance of the low-level BBO algorithm. However, the online learning paradigms in existing works makes the efficiency of MetaBBO problematic. To address this, we propose an offline learning-based MetaBBO framework in this paper, termed Q-Mamba, to attain both effectiveness and efficiency in MetaBBO. Specifically, we first transform DAC task into long-sequence decision process. This allows us further introduce an effective Q-function decomposition mechanism to reduce the learning difficulty within the intricate algorithm configuration space. Under this setting, we propose three novel designs to meta-learn DAC policy from offline data: we first propose a novel collection strategy for constructing offline DAC experiences dataset with balanced exploration and exploitation. We then establish a decomposition-based Q-loss that incorporates conservative Q-learning to promote stable offline learning from the offline dataset. To further improve the offline learning efficiency, we equip our work with a Mamba architecture which helps long-sequence learning effectiveness and efficiency by selective state model and hardware-aware parallel scan respectively. Through extensive benchmarking, we observe that Q-Mamba achieves competitive or even superior performance to prior online/offline baselines, while significantly improving the training efficiency of existing online baselines. We provide sourcecodes of Q-Mamba at https://github.com/MetaEvo/Q-Mamba.
中文摘要:本文提出Q-Mamba离线元黑盒优化框架,通过将动态算法配置转化为长序列决策过程,并采用平衡探索的数据收集策略、分解Q学习机制及Mamba架构,在保证性能的同时显著提升了训练效率。
English Summary: The paper introduces Q-Mamba, an offline MetaBBO framework that enhances both effectiveness and efficiency by transforming dynamic algorithm configuration into a long-sequence decision process and incorporating novel designs including a balanced dataset collection strategy, decomposition-based Q-learning, and Mamba architecture for improved learning.
Authors:Leyi Yan, Linda Wang, Sihang Liu, Yi Ding
Abstract:
Carbon intensity (CI) measures the average carbon emissions generated per unit of electricity, making it a crucial metric for quantifying and managing the environmental impact. Accurate CI predictions are vital for minimizing carbon footprints, yet the state-of-the-art method (CarbonCast) falls short due to its inability to address regional variability and lack of adaptability. To address these limitations, we introduce EnsembleCI, an adaptive, end-to-end ensemble learning-based approach for CI forecasting. EnsembleCI combines weighted predictions from multiple sublearners, offering enhanced flexibility and regional adaptability. In evaluations across 11 regional grids, EnsembleCI consistently surpasses CarbonCast, achieving the lowest mean absolute percentage error (MAPE) in almost all grids and improving prediction accuracy by an average of 19.58%. While performance still varies across grids due to inherent regional diversity, EnsembleCI reduces variability and exhibits greater robustness in long-term forecasting compared to CarbonCast and identifies region-specific key features, underscoring its interpretability and practical relevance. These findings position EnsembleCI as a more accurate and reliable solution for CI forecasting. EnsembleCI source code and data used in this paper are available at https://github.com/emmayly/EnsembleCI.
中文: EnsembleCI是一种自适应集成学习方法,相比CarbonCast将碳强度预测准确率平均提高19.58%,在区域适应性和鲁棒性方面表现更优。
English: EnsembleCI is an adaptive ensemble learning method that improves carbon intensity forecasting accuracy by 19.58% on average and demonstrates superior regional adaptability and robustness compared to CarbonCast.
Authors:Siddharth Kothari, Srinivasan Murali, Sankalp Kothari, Ujjwal Verma, Jaya Sreevalsan-Nair
Abstract:
Inland water body segmentation from Synthetic Aperture Radar (SAR) images is an important task needed for several applications, such as flood mapping. While SAR sensors capture data in all-weather conditions as high-resolution images, differentiating water and water-like surfaces from SAR images is not straightforward. Inland water bodies, such as large river basins, have complex geometry, which adds to the challenge of segmentation. U-Net is a widely used deep learning model for land-water segmentation of SAR images. In practice, manual annotation is often used to generate the corresponding water masks as ground truth. Manual annotation of the images is prone to label noise owing to data poisoning attacks, especially due to complex geometry. In this work, we simulate manual errors in the form of adversarial attacks on the U-Net model and study the robustness of the model to human errors in annotation. Our results indicate that U-Net can tolerate a certain level of corruption before its performance drops significantly. This finding highlights the crucial role that the quality of manual annotations plays in determining the effectiveness of the segmentation model. The code and the new dataset, along with adversarial examples for robust training, are publicly available. (GitHub link - https://github.com/GVCL/IWSeg-SAR-Poison.git)
中文摘要:本研究通过对抗性攻击模拟人工标注错误,探讨了U-Net模型在SAR图像内陆水域分割中的鲁棒性,发现模型虽能承受一定程度的数据污染,但标注质量对分割效果具有决定性影响。
English Summary: This study investigates the robustness of the U-Net model for inland water segmentation in SAR images against simulated human annotation errors introduced through adversarial attacks, finding that while the model can tolerate some corruption, annotation quality critically impacts segmentation performance.
Authors:Yize Jiang, Xinze Li, Yuanyuan Zhang, Jin Han, Youjun Xu, Ayush Pandit, Zaixi Zhang, Mengdi Wang, Mengyang Wang, Chong Liu, Guang Yang, Yejin Choi, Wu-Jun Li, Tianfan Fu, Fang Wu, Junhong Liu
Abstract:
Existing protein-ligand docking studies typically focus on the self-docking scenario, which is less practical in real applications. Moreover, some studies involve heavy frameworks requiring extensive training, posing challenges for convenient and efficient assessment of docking methods. To fill these gaps, we design PoseX, an open-source benchmark to evaluate both self-docking and cross-docking, enabling a practical and comprehensive assessment of algorithmic advances. Specifically, we curated a novel dataset comprising 718 entries for self-docking and 1,312 entries for cross-docking; second, we incorporated 23 docking methods in three methodological categories, including physics-based methods (e.g., Schrödinger Glide), AI docking methods (e.g., DiffDock) and AI co-folding methods (e.g., AlphaFold3); third, we developed a relaxation method for post-processing to minimize conformational energy and refine binding poses; fourth, we built a leaderboard to rank submitted models in real-time. We derived some key insights and conclusions from extensive experiments: (1) AI approaches have consistently outperformed physics-based methods in overall docking success rate. (2) Most intra- and intermolecular clashes of AI approaches can be greatly alleviated with relaxation, which means combining AI modeling with physics-based post-processing could achieve excellent performance. (3) AI co-folding methods exhibit ligand chirality issues, except for Boltz-1x, which introduced physics-inspired potentials to fix hallucinations, suggesting modeling on stereochemistry improves the structural plausibility markedly. (4) Specifying binding pockets significantly promotes docking performance, indicating that pocket information can be leveraged adequately, particularly for AI co-folding methods, in future modeling efforts. The code, dataset, and leaderboard are released at https://github.com/CataAI/PoseX.
中文: PoseX是一个开源基准测试,旨在评估自对接和交叉对接方法,整合了新颖数据集和23种不同对接方法,关键发现显示AI方法优于物理方法,并通过物理后处理显著提升性能。
English: PoseX is an open-source benchmark designed to evaluate both self-docking and cross-docking methods, incorporating a novel dataset and 23 diverse docking approaches, with key findings showing AI methods outperform physics-based ones and benefit from physics-based post-processing.
Authors:Jianxing Qin, Jingrong Chen, Xinhao Kong, Yongji Wu, Tianjun Yuan, Liang Luo, Zhaodong Wang, Ying Zhang, Tingjun Chen, Alvin R. Lebeck, Danyang Zhuo
Abstract:
Modern machine learning (ML) training workloads place substantial demands on both computational and communication resources. Consequently, accurate performance estimation has become increasingly critical for guiding system design decisions, such as the selection of parallelization strategies, cluster configurations, and hardware provisioning. Existing simulation-based performance estimation requires reimplementing the ML framework in a simulator, which demands significant manual effort and is hard to maintain as ML frameworks evolve rapidly. This paper introduces Phantora, a hybrid GPU cluster simulator designed for performance estimation of ML training workloads. Phantora executes unmodified ML frameworks as is within a distributed, containerized environment. Each container emulates the behavior of a GPU server in a large-scale cluster, while Phantora intercepts and simulates GPU- and communication-related operations to provide high-fidelity performance estimation. We call this approach hybrid simulation of ML systems, in contrast to traditional methods that simulate static workloads. The primary advantage of hybrid simulation is that it allows direct reuse of ML framework source code in simulation, avoiding the need for reimplementation. Our evaluation shows that Phantora provides accuracy comparable to static workload simulation while supporting three state-of-the-art LLM training frameworks out-of-the-box. In addition, Phantora operates on a single GPU, eliminating the need for the resource-intensive trace collection and workload extraction steps required by traditional trace-based simulators. Phantora is open-sourced at https://github.com/QDelta/Phantora.
Chinese: Phantora是一种混合GPU集群模拟器,通过在容器化环境中直接运行未经修改的机器学习框架,为训练工作负载提供高精度性能评估,无需重新实现框架代码,且能在单GPU上支持最新的大语言模型训练框架。
English: Phantora is a hybrid GPU cluster simulator that enables high-fidelity performance estimation for ML training workloads by executing unmodified ML frameworks in a containerized environment, eliminating the need for reimplementation and supporting state-of-the-art frameworks with minimal resource requirements.
Authors:Abdalwahab Almajed, Maryam Tabar, Peyman Najafirad
Abstract:
As a basic human need, housing plays a key role in enhancing health, well-being, and educational outcome in society, and the housing market is a major factor for promoting quality of life and ensuring social equity. To improve the housing conditions, there has been extensive research on building Machine Learning (ML)-driven house price prediction solutions to accurately forecast the future conditions, and help inform actions and policies in the field. In spite of their success in developing high-accuracy models, there is a gap in our understanding of the extent to which various ML-driven house price prediction approaches show ethnic and/or racial bias, which in turn is essential for the responsible use of ML, and ensuring that the ML-driven solutions do not exacerbate inequity. To fill this gap, this paper develops several ML models from a combination of structural and neighborhood-level attributes, and conducts comprehensive assessments on the fairness of ML models under various definitions of privileged groups. As a result, it finds that the ML-driven house price prediction models show various levels of bias towards protected attributes (i.e., race and ethnicity in this study). Then, it investigates the performance of different bias mitigation solutions, and the experimental results show their various levels of effectiveness on different ML-driven methods. However, in general, the in-processing bias mitigation approach tends to be more effective than the pre-processing one in this problem domain. Our code is available at https://github.com/wahab1412/housing_fairness.
Chinese: 本研究通过开发揭示种族和民族相关偏见的机器学习房价预测模型,评估了不同偏见缓解方法的有效性,发现在该问题领域中,处理中方法通常比预处理方法更为有效。
English: This study addresses the fairness gap in machine learning-driven house price prediction by developing models that reveal biases related to race and ethnicity, and evaluating the effectiveness of different bias mitigation approaches, with in-processing methods proving generally more effective than pre-processing ones.
Authors:Zongxia Li, Xiyang Wu, Guangyao Shi, Yubin Qin, Hongyang Du, Tianyi Zhou, Dinesh Manocha, Jordan Lee Boyd-Graber
Abstract:
Synthetic video generation has gained significant attention for its realism and broad applications, but remains prone to violations of common sense and physical laws. This highlights the need for reliable abnormality detectors that understand such principles and are robust to hallucinations. To address this, we introduce VideoHallu, a benchmark of over 3,000 video QA pairs built from synthetic videos generated by models like Veo2, Sora, and Kling, paired with expert-crafted counterintuitive QA to evaluate the critical thinking abilities of Multi-modal Large Language Models (MLLMs) on abnormalities that are perceptually obvious to humans but often hallucinated due to language priors. VideoHallu evaluates MLLMs' abnormality detection abilities with examples across alignment, consistency, commonsense, and physics. We benchmark SOTA MLLMs, including GPT-4o, Gemini-2.5-Pro, Qwen2.5-VL, Video-R1, and VideoChat-R1. We observe that these models perform well on many real-world benchmarks like MVBench and MovieChat, but still struggle with basic physics-based and commonsense reasoning in synthetic videos. We further show that post-training with Group Relative Policy Optimization (GRPO), using curriculum learning on datasets combining video QA with counterintuitive commonsense and physics reasoning over real and synthetic videos, improves MLLMs' abnormality detection and critical thinking, demonstrating the value of targeted training for improving their understanding of commonsense and physical laws. Our code is available at https://github.com/zli12321/VideoHallu.git.
中文: VideoHallu基准通过合成视频异常评估多模态大模型的批判性思维能力,发现尽管在真实场景表现优异,模型仍难以处理常识和物理推理,但采用GRPO的针对性训练能显著提升这方面的理解能力。
English: The VideoHallu benchmark evaluates MLLMs' critical thinking on synthetic video abnormalities, revealing their struggles with commonsense and physics reasoning despite strong real-world performance, while demonstrating that targeted training with GRPO significantly enhances these capabilities.
Authors:Chaoyi Wang, Junjie Zheng, Zihao Chen, Shiyu Xia, Chaofan Ding, Xiaohao Zhang, Xi Tao, Xiaoming He, Xinhan Di
Abstract:
Movie dubbing has advanced significantly, yet assessing the real-world effectiveness of these models remains challenging. A comprehensive evaluation benchmark is crucial for two key reasons: 1) Existing metrics fail to fully capture the complexities of dialogue, narration, monologue, and actor adaptability in movie dubbing. 2) A practical evaluation system should offer valuable insights to improve movie dubbing quality and advancement in film production. To this end, we introduce Talking Adaptive Dubbing Benchmarks (TA-Dubbing), designed to improve film production by adapting to dialogue, narration, monologue, and actors in movie dubbing. TA-Dubbing offers several key advantages: 1) Comprehensive Dimensions: TA-Dubbing covers a variety of dimensions of movie dubbing, incorporating metric evaluations for both movie understanding and speech generation. 2) Versatile Benchmarking: TA-Dubbing is designed to evaluate state-of-the-art movie dubbing models and advanced multi-modal large language models. 3) Full Open-Sourcing: We fully open-source TA-Dubbing at https://github.com/woka- 0a/DeepDubber- V1 including all video suits, evaluation methods, annotations. We also continuously integrate new movie dubbing models into the TA-Dubbing leaderboard at https://github.com/woka- 0a/DeepDubber-V1 to drive forward the field of movie dubbing.
Chinese: TA-Dubbing基准通过全面评估电影配音中的对话、旁白、独白和演员适应性,弥补现有指标的不足,并借助开源工具和持续模型集成推动电影制作质量的提升。
English: The TA-Dubbing benchmark is introduced to address the limitations of existing metrics by comprehensively evaluating movie dubbing across dialogue, narration, monologue, and actor adaptability, aiming to enhance film production through open-source tools and continuous model integration.
Authors:Carlo Siebenschuh, Kyle Hippe, Ozan Gokdemir, Alexander Brace, Arham Khan, Khalid Hossain, Yadu Babuji, Nicholas Chia, Venkatram Vishwanath, Rick Stevens, Arvind Ramanathan, Ian Foster, Robert Underwood
Abstract:
Language models for scientific tasks are trained on text from scientific publications, most distributed as PDFs that require parsing. PDF parsing approaches range from inexpensive heuristics (for simple documents) to computationally intensive ML-driven systems (for complex or degraded ones). The choice of the "best" parser for a particular document depends on its computational cost and the accuracy of its output. To address these issues, we introduce an Adaptive Parallel PDF Parsing and Resource Scaling Engine (AdaParse), a data-driven strategy for assigning an appropriate parser to each document. We enlist scientists to select preferred parser outputs and incorporate this information through direct preference optimization (DPO) into AdaParse, thereby aligning its selection process with human judgment. AdaParse then incorporates hardware requirements and predicted accuracy of each parser to orchestrate computational resources efficiently for large-scale parsing campaigns. We demonstrate that AdaParse, when compared to state-of-the-art parsers, improves throughput by $17\times$ while still achieving comparable accuracy (0.2 percent better) on a benchmark set of 1000 scientific documents. AdaParse's combination of high accuracy and parallel scalability makes it feasible to parse large-scale scientific document corpora to support the development of high-quality, trillion-token-scale text datasets. The implementation is available at https://github.com/7shoe/AdaParse/
中文摘要:AdaParse是一种自适应PDF解析引擎,通过整合人类偏好和优化资源分配,为每份科学文献智能匹配最适合的解析器,在大规模解析任务中实现17倍吞吐量提升并保持相当的准确率。
English Summary: AdaParse is an adaptive engine that efficiently assigns the most suitable PDF parser to each scientific document by incorporating human preferences and optimizing resource allocation, achieving 17 times higher throughput with comparable accuracy for large-scale parsing.
Authors:Rahuul Rangaraj, Jimeng Shi, Azam Shirali, Rajendra Paudel, Yanzhao Wu, Giri Narasimhan
Abstract:
The Everglades play a crucial role in flood and drought regulation, water resource planning, and ecosystem management in the surrounding regions. However, traditional physics-based and statistical methods for predicting water levels often face significant challenges, including high computational costs and limited adaptability to diverse or unforeseen conditions. Recent advancements in large time series models have demonstrated the potential to address these limitations, with state-of-the-art deep learning and foundation models achieving remarkable success in time series forecasting across various domains. Despite this progress, their application to critical environmental systems, such as the Everglades, remains underexplored. In this study, we fill the gap by investigating twelve task-specific models and five time series foundation models across six categories for a real-world application focused on water level prediction in the Everglades. Our primary results show that the foundation model Chronos significantly outperforms all other models while the remaining foundation models exhibit relatively poor performance. We also noticed that the performance of task-specific models varies with the model architectures, and discussed the possible reasons. We hope our study and findings will inspire the community to explore the applicability of large time series models in hydrological applications. The code and data are available at https://github.com/rahuul2992000/Everglades-Benchmark.
中文: 本研究评估了用于大沼泽地水位预测的先进时间序列模型,发现Chronos基础模型表现显著优于其他模型,同时强调了在水利学中进一步探索大型模型的必要性。
English: This study evaluates advanced time series models for water level prediction in the Everglades, finding that the Chronos foundation model significantly outperforms others, while highlighting the need for further exploration of large models in hydrology.
Authors:Mohammadreza Teymoorianfard, Shiqing Ma, Amir Houmansadr
Abstract:
The rapid rise of video diffusion models has enabled the generation of highly realistic and temporally coherent videos, raising critical concerns about content authenticity, provenance, and misuse. Existing watermarking approaches, whether passive, post-hoc, or adapted from image-based techniques, often struggle to withstand video-specific manipulations such as frame insertion, dropping, or reordering, and typically degrade visual quality. In this work, we introduce VIDSTAMP, a watermarking framework that embeds per-frame or per-segment messages directly into the latent space of temporally-aware video diffusion models. By fine-tuning the model's decoder through a two-stage pipeline, first on static image datasets to promote spatial message separation, and then on synthesized video sequences to restore temporal consistency, VIDSTAMP learns to embed high-capacity, flexible watermarks with minimal perceptual impact. Leveraging architectural components such as 3D convolutions and temporal attention, our method imposes no additional inference cost and offers better perceptual quality than prior methods, while maintaining comparable robustness against common distortions and tampering. VIDSTAMP embeds 768 bits per video (48 bits per frame) with a bit accuracy of 95.0%, achieves a log P-value of -166.65 (lower is better), and maintains a video quality score of 0.836, comparable to unwatermarked outputs (0.838) and surpassing prior methods in capacity-quality tradeoffs. Code: Code: \url{https://github.com/SPIN-UMass/VidStamp}
中文摘要:VIDSTAMP是一种新型水印框架,通过在视频扩散模型的潜在空间中直接嵌入高容量水印,在保持与未加水印视频相当的视觉质量的同时,实现了对篡改的更强鲁棒性。
English Summary: VIDSTAMP is a novel watermarking framework that embeds high-capacity watermarks directly into video diffusion models' latent space, achieving superior robustness against tampering while maintaining visual quality comparable to unwatermarked videos.
Authors:Yang Jin, Jun Lv, Wenye Yu, Hongjie Fang, Yong-Lu Li, Cewu Lu
Abstract:
Self-improvement requires robotic systems to initially learn from human-provided data and then gradually enhance their capabilities through interaction with the environment. This is similar to how humans improve their skills through continuous practice. However, achieving effective self-improvement is challenging, primarily because robots tend to repeat their existing abilities during interactions, often failing to generate new, valuable data for learning. In this paper, we identify the key to successful self-improvement: modal-level exploration and data selection. By incorporating a modal-level exploration mechanism during policy execution, the robot can produce more diverse and multi-modal interactions. At the same time, we select the most valuable trials and high-quality segments from these interactions for learning. We successfully demonstrate effective robot self-improvement on both simulation benchmarks and real-world experiments. The capability for self-improvement will enable us to develop more robust and high-success-rate robotic control strategies at a lower cost. Our code and experiment scripts are available at https://ericjin2002.github.io/SIME/
中文: 机器人通过模态级探索和数据选择实现有效自我提升,产生多样化互动并学习高质量试验,从而开发出更稳健、高成功率的低成本控制策略。
English: Effective robot self-improvement is achieved through modal-level exploration and data selection, enabling diverse interactions and learning from high-quality trials to develop robust, cost-efficient control strategies.
Authors:Fahong Zhang, Yilei Shi, Xiao Xiang Zhu
Abstract:
This paper addresses the challenge of mapping polygonal buildings from remote sensing images and introduces a novel algorithm, the Global Collinearity-aware Polygonizer (GCP). GCP, built upon an instance segmentation framework, processes binary masks produced by any instance segmentation model. The algorithm begins by collecting polylines sampled along the contours of the binary masks. These polylines undergo a refinement process using a transformer-based regression module to ensure they accurately fit the contours of the targeted building instances. Subsequently, a collinearity-aware polygon simplification module simplifies these refined polylines and generate the final polygon representation. This module employs dynamic programming technique to optimize an objective function that balances the simplicity and fidelity of the polygons, achieving globally optimal solutions. Furthermore, the optimized collinearity-aware objective is seamlessly integrated into network training, enhancing the cohesiveness of the entire pipeline. The effectiveness of GCP has been validated on two public benchmarks for polygonal building mapping. Further experiments reveal that applying the collinearity-aware polygon simplification module to arbitrary polylines, without prior knowledge, enhances accuracy over traditional methods such as the Douglas-Peucker algorithm. This finding underscores the broad applicability of GCP. The code for the proposed method will be made available at https://github.com/zhu-xlab.
本文提出了一种新颖的全局共线性感知多边形化算法(GCP),通过基于变换器的回归模块和动态规划技术优化遥感图像中的建筑物轮廓,最终生成精确的多边形表示。
This paper introduces the Global Collinearity-aware Polygonizer (GCP), a novel algorithm that refines and simplifies building contours from remote sensing images using transformer-based regression and dynamic programming to produce accurate polygonal representations.
Authors:Yajuan Zhang, Jiahai Jiang, Yule Yan, Liang Yang, Ping Zhang
Abstract:
Accurate wind power forecasting can help formulate scientific dispatch plans, which is of great significance for maintaining the safety, stability, and efficient operation of the power system. In recent years, wind power forecasting methods based on deep learning have focused on extracting the spatiotemporal correlations among data, achieving significant improvements in forecasting accuracy. However, they exhibit two limitations. First, there is a lack of modeling for the inter-variable relationships, which limits the accuracy of the forecasts. Second, by treating endogenous and exogenous variables equally, it leads to unnecessary interactions between the endogenous and exogenous variables, increasing the complexity of the model. In this paper, we propose the 2DXformer, which, building upon the previous work's focus on spatiotemporal correlations, addresses the aforementioned two limitations. Specifically, we classify the inputs of the model into three types: exogenous static variables, exogenous dynamic variables, and endogenous variables. First, we embed these variables as variable tokens in a channel-independent manner. Then, we use the attention mechanism to capture the correlations among exogenous variables. Finally, we employ a multi-layer perceptron with residual connections to model the impact of exogenous variables on endogenous variables. Experimental results on two real-world large-scale datasets indicate that our proposed 2DXformer can further improve the performance of wind power forecasting. The code is available in this repository: \href{https://github.com/jseaj/2DXformer}{https://github.com/jseaj/2DXformer}.
中文: 本文提出的2DXformer模型通过将输入变量分类并采用注意力机制,解决了现有方法在变量间关系建模方面的不足,有效提升了风电功率预测的准确性。
English: The proposed 2DXformer model enhances wind power forecasting by addressing limitations in inter-variable relationship modeling and unnecessary interactions between endogenous and exogenous variables, achieving improved performance through specialized variable classification and attention mechanisms.
Authors:Vladimir Somers, Baptiste Standaert, Victor Joos, Alexandre Alahi, Christophe De Vleeschouwer
Abstract:
Online multi-object tracking has been recently dominated by tracking-by-detection (TbD) methods, where recent advances rely on increasingly sophisticated heuristics for tracklet representation, feature fusion, and multi-stage matching. The key strength of TbD lies in its modular design, enabling the integration of specialized off-the-shelf models like motion predictors and re-identification. However, the extensive usage of human-crafted rules for temporal associations makes these methods inherently limited in their ability to capture the complex interplay between various tracking cues. In this work, we introduce CAMEL, a novel association module for Context-Aware Multi-Cue ExpLoitation, that learns resilient association strategies directly from data, breaking free from hand-crafted heuristics while maintaining TbD's valuable modularity. At its core, CAMEL employs two transformer-based modules and relies on a novel association-centric training scheme to effectively model the complex interactions between tracked targets and their various association cues. Unlike end-to-end detection-by-tracking approaches, our method remains lightweight and fast to train while being able to leverage external off-the-shelf models. Our proposed online tracking pipeline, CAMELTrack, achieves state-of-the-art performance on multiple tracking benchmarks. Our code is available at https://github.com/TrackingLaboratory/CAMELTrack.
中文: CAMEL提出了一种数据驱动的关联模块,通过直接学习数据中的关联策略摆脱人工启发式规则,同时保持检测跟踪的模块化优势,在多目标在线跟踪中实现了最先进的性能。
English: CAMEL introduces a data-driven association module that learns resilient tracking strategies without hand-crafted heuristics while preserving the modularity of tracking-by-detection, achieving state-of-the-art performance in online multi-object tracking.
Authors:Zhiwei Hao, Jianyuan Guo, Li Shen, Yong Luo, Han Hu, Guoxia Wang, Dianhai Yu, Yonggang Wen, Dacheng Tao
Abstract:
Large language models (LLMs) have achieved impressive performance across various domains. However, the substantial hardware resources required for their training present a significant barrier to efficiency and scalability. To mitigate this challenge, low-precision training techniques have been widely adopted, leading to notable advancements in training efficiency. Despite these gains, low-precision training involves several components$\unicode{x2013}$such as weights, activations, and gradients$\unicode{x2013}$each of which can be represented in different numerical formats. The resulting diversity has created a fragmented landscape in low-precision training research, making it difficult for researchers to gain a unified overview of the field. This survey provides a comprehensive review of existing low-precision training methods. To systematically organize these approaches, we categorize them into three primary groups based on their underlying numerical formats, which is a key factor influencing hardware compatibility, computational efficiency, and ease of reference for readers. The categories are: (1) fixed-point and integer-based methods, (2) floating-point-based methods, and (3) customized format-based methods. Additionally, we discuss quantization-aware training approaches, which share key similarities with low-precision training during forward propagation. Finally, we highlight several promising research directions to advance this field. A collection of papers discussed in this survey is provided in https://github.com/Hao840/Awesome-Low-Precision-Training.
中文: 大语言模型面临硬件效率挑战,促使采用低精度训练方法,按定点、浮点和自定义数值格式分类,以提升可扩展性并统一研究进展。
English: Large language models face hardware efficiency challenges, leading to the adoption of low-precision training methods categorized by numerical formats—fixed-point, floating-point, and customized—to enhance scalability and unify research efforts.
Authors:Lokesh Nagalapatti, Ashutosh Srivastava, Sunita Sarawagi, Amit Sharma
Abstract:
Diagnosing the root cause of an anomaly in a complex interconnected system is a pressing problem in today's cloud services and industrial operations. We propose In-Distribution Interventions (IDI), a novel algorithm that predicts root cause as nodes that meet two criteria: 1) **Anomaly:** root cause nodes should take on anomalous values; 2) **Fix:** had the root cause nodes assumed usual values, the target node would not have been anomalous. Prior methods of assessing the fix condition rely on counterfactuals inferred from a Structural Causal Model (SCM) trained on historical data. But since anomalies are rare and fall outside the training distribution, the fitted SCMs yield unreliable counterfactual estimates. IDI overcomes this by relying on interventional estimates obtained by solely probing the fitted SCM at in-distribution inputs. We present a theoretical analysis comparing and bounding the errors in assessing the fix condition using interventional and counterfactual estimates. We then conduct experiments by systematically varying the SCM's complexity to demonstrate the cases where IDI's interventional approach outperforms the counterfactual approach and vice versa. Experiments on both synthetic and PetShop RCD benchmark datasets demonstrate that \our\ consistently identifies true root causes more accurately and robustly than nine existing state-of-the-art RCD baselines. Code is released at https://github.com/nlokeshiisc/IDI_release.
中文摘要:IDI算法通过采用分布内干预方法评估异常和修复条件,有效识别复杂系统中的根本原因,在准确性和鲁棒性方面优于现有方法。
English Summary: The IDI algorithm effectively identifies root causes in complex systems by evaluating anomaly and fix conditions using reliable in-distribution interventions, outperforming existing methods in accuracy and robustness.
Authors:Ihab Tabbara, Hussein Sibai
Abstract:
Safety filters, particularly those based on control barrier functions, have gained increased interest as effective tools for safe control of dynamical systems. Existing correct-by-construction synthesis algorithms for such filters, however, suffer from the curse-of-dimensionality. Deep learning approaches have been proposed in recent years to address this challenge. In this paper, we add to this set of approaches an algorithm for training neural control barrier functions from offline datasets. Such functions can be used to design constraints for quadratic programs that are then used as safety filters. Our algorithm trains these functions so that the system is not only prevented from reaching unsafe states but is also disincentivized from reaching out-of-distribution ones, at which they would be less reliable. It is inspired by Conservative Q-learning, an offline reinforcement learning algorithm. We call its outputs Conservative Control Barrier Functions (CCBFs). Our empirical results demonstrate that CCBFs outperform existing methods in maintaining safety while minimally affecting task performance. Source code is available at https://github.com/tabz23/CCBF.
中文: 本文提出保守控制屏障函数(CCBFs),一种基于离线数据训练的深度学习方法,通过阻止系统进入不安全及分布外状态来增强安全过滤器,在安全性和任务性能上优于现有方法。
English: This paper introduces Conservative Control Barrier Functions (CCBFs), a deep learning approach trained from offline data to enhance safety filters by preventing systems from reaching both unsafe and out-of-distribution states, outperforming existing methods in safety and task performance.
Authors:Junwon Seo, Kensuke Nakamura, Andrea Bajcsy
Abstract:
Recent advances in generative world models have enabled classical safe control methods, such as Hamilton-Jacobi (HJ) reachability, to generalize to complex robotic systems operating directly from high-dimensional sensor observations. However, obtaining comprehensive coverage of all safety-critical scenarios during world model training is extremely challenging. As a result, latent safety filters built on top of these models may miss novel hazards and even fail to prevent known ones, overconfidently misclassifying risky out-of-distribution (OOD) situations as safe. To address this, we introduce an uncertainty-aware latent safety filter that proactively steers robots away from both known and unseen failures. Our key idea is to use the world model's epistemic uncertainty as a proxy for identifying unseen potential hazards. We propose a principled method to detect OOD world model predictions by calibrating an uncertainty threshold via conformal prediction. By performing reachability analysis in an augmented state space-spanning both the latent representation and the epistemic uncertainty-we synthesize a latent safety filter that can reliably safeguard arbitrary policies from both known and unseen safety hazards. In simulation and hardware experiments on vision-based control tasks with a Franka manipulator, we show that our uncertainty-aware safety filter preemptively detects potential unsafe scenarios and reliably proposes safe, in-distribution actions. Video results can be found on the project website at https://cmu-intentlab.github.io/UNISafe
Authors:Henry Peng Zou, Wei-Chieh Huang, Yaozu Wu, Yankai Chen, Chunyu Miao, Hoang Nguyen, Yue Zhou, Weizhi Zhang, Liancheng Fang, Langzhou He, Yangning Li, Dongyuan Li, Renhe Jiang, Xue Liu, Philip S. Yu
Abstract:
Recent advances in large language models (LLMs) have sparked growing interest in building fully autonomous agents. However, fully autonomous LLM-based agents still face significant challenges, including limited reliability due to hallucinations, difficulty in handling complex tasks, and substantial safety and ethical risks, all of which limit their feasibility and trustworthiness in real-world applications. To overcome these limitations, LLM-based human-agent systems (LLM-HAS) incorporate human-provided information, feedback, or control into the agent system to enhance system performance, reliability and safety. These human-agent collaboration systems enable humans and LLM-based agents to collaborate effectively by leveraging their complementary strengths. This paper provides the first comprehensive and structured survey of LLM-HAS. It clarifies fundamental concepts, systematically presents core components shaping these systems, including environment & profiling, human feedback, interaction types, orchestration and communication, explores emerging applications, and discusses unique challenges and opportunities arising from human-AI collaboration. By consolidating current knowledge and offering a structured overview, we aim to foster further research and innovation in this rapidly evolving interdisciplinary field. Paper lists and resources are available at https://github.com/HenryPengZou/Awesome-Human-Agent-Collaboration-Interaction-Systems.
Chinese: 本文首次系统综述了基于大语言模型的人机协作系统,通过整合人类输入来提升系统性能、可靠性和安全性,同时应对自主智能体存在的幻觉与伦理风险等挑战。
English: This paper presents the first comprehensive survey of LLM-based human-agent systems (LLM-HAS), which integrate human input to enhance performance, reliability, and safety while addressing challenges like hallucinations and ethical risks in autonomous agents.
Authors:Marius-Constantin Dinu
Abstract:
This paper presents a novel primality test based on the eigenvalue structure of circulant matrices constructed from roots of unity. We prove that an integer $n > 2$ is prime if and only if the minimal polynomial of the circulant matrix $C_n = W_n + W_n^2$ has exactly two irreducible factors over $\mathbb{Q}$. This characterization connects cyclotomic field theory with matrix algebra, providing both theoretical insights and practical applications. We demonstrate that the eigenvalue patterns of these matrices reveal fundamental distinctions between prime and composite numbers, leading to a deterministic primality test. Our approach leverages the relationship between primitive roots of unity, Galois theory, and the factorization of cyclotomic polynomials. We provide comprehensive experimental validation across various ranges of integers, discuss practical implementation considerations, and analyze the computational complexity of our method in comparison with established primality tests. The visual interpretation of our mathematical framework provides intuitive understanding of the algebraic structures that distinguish prime numbers. Our experimental validation demonstrates that our approach offers a deterministic alternative to existing methods, with performance characteristics reflecting its algebraic foundations.
本文提出了一种基于单位根构造的循环矩阵特征值结构的素数判定方法,证明了整数为素数的充要条件是该矩阵的最小多项式在有理数域上恰好有两个不可约因子。
This paper introduces a deterministic primality test by analyzing the eigenvalue patterns of circulant matrices derived from roots of unity, establishing that an integer is prime precisely when the matrix's minimal polynomial has exactly two irreducible factors over the rationals.
Authors:Dongzhi Jiang, Ziyu Guo, Renrui Zhang, Zhuofan Zong, Hao Li, Le Zhuo, Shilin Yan, Pheng-Ann Heng, Hongsheng Li
Abstract:
Recent advancements in large language models have demonstrated how chain-of-thought (CoT) and reinforcement learning (RL) can improve performance. However, applying such reasoning strategies to the visual generation domain remains largely unexplored. In this paper, we present T2I-R1, a novel reasoning-enhanced text-to-image generation model, powered by RL with a bi-level CoT reasoning process. Specifically, we identify two levels of CoT that can be utilized to enhance different stages of generation: (1) the semantic-level CoT for high-level planning of the prompt and (2) the token-level CoT for low-level pixel processing during patch-by-patch generation. To better coordinate these two levels of CoT, we introduce BiCoT-GRPO with an ensemble of generation rewards, which seamlessly optimizes both generation CoTs within the same training step. By applying our reasoning strategies to the baseline model, Janus-Pro, we achieve superior performance with 13% improvement on T2I-CompBench and 19% improvement on the WISE benchmark, even surpassing the state-of-the-art model FLUX.1. Code is available at: https://github.com/CaraJ7/T2I-R1
中文: 本文提出T2I-R1模型,通过结合双层思维链推理与强化学习,在语义规划和像素处理层面优化文本到图像生成,显著提升了生成性能并超越了现有先进模型。
English: This paper introduces T2I-R1, a reasoning-enhanced text-to-image model that integrates a bi-level chain-of-thought process with reinforcement learning to optimize both semantic planning and pixel processing, achieving significant performance improvements over baseline and state-of-the-art models.
Authors:Tiange Luo, Lajanugen Logeswaran, Justin Johnson, Honglak Lee
Abstract:
We introduce RegionFocus, a visual test-time scaling approach for Vision Language Model Agents. Understanding webpages is challenging due to the visual complexity of GUI images and the large number of interface elements, making accurate action selection difficult. Our approach dynamically zooms in on relevant regions, reducing background clutter and improving grounding accuracy. To support this process, we propose an image-as-map mechanism that visualizes key landmarks at each step, providing a transparent action record and enables the agent to effectively choose among action candidates. Even with a simple region selection strategy, we observe significant performance gains of 28+\% on Screenspot-pro and 24+\% on WebVoyager benchmarks on top of two state-of-the-art open vision language model agents, UI-TARS and Qwen2.5-VL, highlighting the effectiveness of visual test-time scaling in interactive settings. We achieve a new state-of-the-art grounding performance of 61.6\% on the ScreenSpot-Pro benchmark by applying RegionFocus to a Qwen2.5-VL-72B model. Our code will be released publicly at https://github.com/tiangeluo/RegionFocus.
中文摘要:RegionFocus是一种视觉测试时缩放方法,通过动态聚焦相关区域并采用图像作为地图的机制,显著提升了视觉语言模型代理的性能,在基准测试中实现了超过24%的性能提升,并以61.6%的准确率创下了新的最先进接地性能记录。
English Summary: RegionFocus is a visual test-time scaling method that enhances Vision Language Model Agents by dynamically zooming into relevant regions and using an image-as-map mechanism, achieving significant performance improvements of over 24% on benchmarks and setting a new state-of-the-art grounding accuracy of 61.6%.
Authors:Arsha Nagrani, Sachit Menon, Ahmet Iscen, Shyamal Buch, Ramin Mehran, Nilpa Jha, Anja Hauth, Yukun Zhu, Carl Vondrick, Mikhail Sirotenko, Cordelia Schmid, Tobias Weyand
Abstract:
Multimodal LLMs are turning their focus to video benchmarks, however most video benchmarks only provide outcome supervision, with no intermediate or interpretable reasoning steps. This makes it challenging to assess if models are truly able to combine perceptual and temporal information to reason about videos, or simply get the correct answer by chance or by exploiting linguistic biases. To remedy this, we provide a new video reasoning dataset called MINERVA for modern multimodal models. Each question in the dataset comes with 5 answer choices, as well as detailed, hand-crafted reasoning traces. Our dataset is multimodal, diverse in terms of video domain and length, and consists of complex multi-step questions. Extensive benchmarking shows that our dataset provides a challenge for frontier open-source and proprietary models. We perform fine-grained error analysis to identify common failure modes across various models, and create a taxonomy of reasoning errors. We use this to explore both human and LLM-as-a-judge methods for scoring video reasoning traces, and find that failure modes are primarily related to temporal localization, followed by visual perception errors, as opposed to logical or completeness errors. The dataset, along with questions, answer candidates and reasoning traces will be publicly available under https://github.com/google-deepmind/neptune?tab=readme-ov-file\#minerva.
Chinese: MINERVA数据集通过提供带有详细推理路径的复杂问题,弥补了现有视频基准仅关注结果监督的不足,旨在检验多模态模型是否真正具备结合感知与时间信息进行视频推理的能力。
English: The MINERVA dataset addresses the limitations of current video benchmarks by providing detailed reasoning traces alongside complex questions, challenging multimodal models to demonstrate genuine perceptual and temporal reasoning rather than relying on superficial cues.
Authors:Wenkai Yang, Jingwen Chen, Yankai Lin, Ji-Rong Wen
Abstract:
As Large Language Models (LLMs) are rapidly evolving, providing accurate feedback and scalable oversight on their outputs becomes an urgent and critical problem. Leveraging LLMs as critique models to achieve automated supervision is a promising solution. In this work, we focus on studying and enhancing the math critique ability of LLMs. Current LLM critics provide critiques that are too shallow and superficial on each step, leading to low judgment accuracy and struggling to offer sufficient feedback for the LLM generator to correct mistakes. To tackle this issue, we propose a novel and effective two-stage framework to develop LLM critics that are capable of deliberately critiquing on each reasoning step of math solutions. In the first stage, we utilize Qwen2.5-72B-Instruct to generate 4.5K long-form critiques as seed data for supervised fine-tuning. Each seed critique consists of deliberate step-wise critiques that includes multi-perspective verifications as well as in-depth critiques of initial critiques for each reasoning step. Then, we perform reinforcement learning on the fine-tuned model with either existing human-labeled data from PRM800K or our automatically annotated data obtained via Monte Carlo sampling-based correctness estimation, to further incentivize its critique ability. Our developed critique model built on Qwen2.5-7B-Instruct not only significantly outperforms existing LLM critics (including the same-sized DeepSeek-R1-distill models and GPT-4o) on various error identification benchmarks, but also more effectively helps the LLM generator refine erroneous steps through more detailed feedback.
Chinese: 本研究提出一个两阶段框架,通过先使用精细化分步评论进行监督微调,再结合强化学习,显著提升大语言模型的数学批判能力,最终模型在错误识别和纠错反馈方面均优于现有评论模型。
English: This study introduces a two-stage framework to enhance LLMs' math critique ability by first fine-tuning with deliberate step-wise critiques and then applying reinforcement learning, resulting in a model that outperforms existing critics and improves error correction.
Authors:Atahan Karagoz
Abstract:
Unsupervised learning of disease subtypes from multi-omics data presents a significant opportunity for advancing personalized medicine. We introduce OmicsCL, a modular contrastive learning framework that jointly embeds heterogeneous omics modalities-such as gene expression, DNA methylation, and miRNA expression-into a unified latent space. Our method incorporates a survival-aware contrastive loss that encourages the model to learn representations aligned with survival-related patterns, without relying on labeled outcomes. Evaluated on the TCGA BRCA dataset, OmicsCL uncovers clinically meaningful clusters and achieves strong unsupervised concordance with patient survival. The framework demonstrates robustness across hyperparameter configurations and can be tuned to prioritize either subtype coherence or survival stratification. Ablation studies confirm that integrating survival-aware loss significantly enhances the predictive power of learned embeddings. These results highlight the promise of contrastive objectives for biological insight discovery in high-dimensional, heterogeneous omics data.
中文: OmicsCL是一种对比学习框架,通过融合多组学数据并引入生存感知损失,无需标记结果即可有效识别具有临床意义的疾病亚型。
English: OmicsCL is a contrastive learning framework that integrates multi-omics data into a unified latent space with survival-aware loss, effectively identifying clinically relevant disease subtypes without labeled outcomes.
Authors:Marco Braga, Pranav Kasela, Alessandro Raganato, Gabriella Pasi
Abstract:
Large Language Models (LLMs) have shown impressive zero-shot performance across a variety of Natural Language Processing tasks, including document re-ranking. However, their effectiveness degrades on unseen tasks and domains, largely due to shifts in vocabulary and word distributions. In this paper, we investigate Task Arithmetic, a technique that combines the weights of LLMs pre-trained on different tasks or domains via simple mathematical operations, such as addition or subtraction, to adapt retrieval models without requiring additional fine-tuning. Our method is able to synthesize diverse tasks and domain knowledge into a single model, enabling effective zero-shot adaptation in different retrieval contexts. Extensive experiments on publicly available scientific, biomedical, and multilingual datasets show that our method improves state-of-the-art re-ranking performance by up to 18% in NDCG@10 and 15% in P@10. In addition to these empirical gains, our analysis provides insights into the strengths and limitations of Task Arithmetic as a practical strategy for zero-shot learning and model adaptation. We make our code publicly available at https://github.com/DetectiveMB/Task-Arithmetic-for-ZS-IR.
中文摘要:本文提出的任务算术方法通过简单数学运算整合不同任务和领域的预训练模型权重,使大语言模型无需额外微调即可实现有效的零样本文档重排,在多项测试中显著提升了检索性能。
English Summary: This paper introduces Task Arithmetic, a technique that adapts Large Language Models for zero-shot document re-ranking by combining pre-trained weights from different tasks and domains through simple mathematical operations, achieving significant performance improvements without additional fine-tuning.
Authors:Haozheng Luo, Chenghao Qiu, Maojiang Su, Zhihan Zhou, Zoe Mehta, Guo Ye, Jerry Yao-Chieh Hu, Han Liu
Abstract:
To address the challenge of scarce computational resources in genomic modeling, we introduce GERM, a genomic foundation model with strong compression performance and fast adaptability. GERM improves upon models like DNABERT-2 by eliminating outliers that hinder low-rank adaptation and post-training quantization, enhancing both efficiency and robustness. We replace the vanilla attention layer with an outlier-free mechanism inspired by associative memory models. By removing outliers during both pre-training and fine-tuning, this approach accelerates adaptation, reduces computational costs, and enhances quantization robustness within acceptable loss margins. Additionally, we propose GERM-T, a strategy that employs small-step continual learning within the outlier-free framework, leveraging original checkpoints to avoid retraining from scratch. Empirically, GERM improves fine-tuning performance by 37.98% and quantization by 64.34% over the baseline model. It also reduces average kurtosis by 92.14% and maximum infinity norm by 82.77%. Compared to leading methods, GERM consistently delivers superior performance, offering a practical solution for genomic modeling in resource-constrained settings. Code is available at https://github.com/MAGICS-LAB/GERM.
中文: GERM是一种基因组基础模型,通过消除注意力机制中的异常值来提高效率、适应性和量化鲁棒性,在微调和量化方面实现了显著的性能提升,为资源受限环境提供了实用解决方案。
English: GERM is a genomic foundation model designed to overcome computational limitations by eliminating outliers in attention mechanisms, which enhances efficiency, adaptability, and quantization robustness, achieving significant performance improvements in fine-tuning and quantization.
Authors:Alex Schutz, Yang You, Matias Mattamala, Ipek Caliskanelli, Bruno Lacerda, Nick Hawes
Abstract:
Deterministic partially observable Markov decision processes (DetPOMDPs) often arise in planning problems where the agent is uncertain about its environmental state but can act and observe deterministically. In this paper, we propose DetMCVI, an adaptation of the Monte Carlo Value Iteration (MCVI) algorithm for DetPOMDPs, which builds policies in the form of finite-state controllers (FSCs). DetMCVI solves large problems with a high success rate, outperforming existing baselines for DetPOMDPs. We also verify the performance of the algorithm in a real-world mobile robot forest mapping scenario.
中文摘要:DetMCVI算法将蒙特卡洛值迭代应用于确定性部分可观测马尔可夫决策过程,通过构建有限状态控制器策略,在解决大规模问题时表现优异,并在真实机器人森林测绘场景中得到验证。
English Summary: DetMCVI adapts Monte Carlo Value Iteration for deterministic partially observable Markov decision processes, effectively solving large problems and outperforming existing methods, as validated in a real-world robot mapping application.
Authors:Yue Meng, Chuchu Fan
Abstract:
Learning to solve complex tasks with signal temporal logic (STL) specifications is crucial to many real-world applications. However, most previous works only consider fixed or parametrized STL specifications due to the lack of a diverse STL dataset and encoders to effectively extract temporal logic information for downstream tasks. In this paper, we propose TeLoGraF, Temporal Logic Graph-encoded Flow, which utilizes Graph Neural Networks (GNN) encoder and flow-matching to learn solutions for general STL specifications. We identify four commonly used STL templates and collect a total of 200K specifications with paired demonstrations. We conduct extensive experiments in five simulation environments ranging from simple dynamical models in the 2D space to high-dimensional 7DoF Franka Panda robot arm and Ant quadruped navigation. Results show that our method outperforms other baselines in the STL satisfaction rate. Compared to classical STL planning algorithms, our approach is 10-100X faster in inference and can work on any system dynamics. Besides, we show our graph-encoding method's capability to solve complex STLs and robustness to out-distribution STL specifications. Code is available at https://github.com/mengyuest/TeLoGraF
中文: 本文提出TeLoGraF方法,利用图神经网络和流匹配技术高效解决通用信号时序逻辑规范,在多种仿真环境中展现出优越的性能和速度。
English: This paper introduces TeLoGraF, a method using Graph Neural Networks and flow-matching to efficiently solve general signal temporal logic specifications, demonstrating superior performance and speed in various simulations.
Authors:Chanwoo Kim, Jinkyu Sung, Yebonn Han, Joonseok Lee
Abstract:
Graph convolutional networks have recently gained prominence in collaborative filtering (CF) for recommendations. However, we identify potential bottlenecks in two foundational components. First, the embedding layer leads to a latent space with limited capacity, overlooking locally observed but potentially valuable preference patterns. Also, the widely-used neighborhood aggregation is limited in its ability to leverage diverse preference patterns in a fine-grained manner. Building on spectral graph theory, we reveal that these limitations stem from graph filtering with a cut-off in the frequency spectrum and a restricted linear form. To address these issues, we introduce ChebyCF, a CF framework based on graph spectral filtering. Instead of a learned embedding, it takes a user's raw interaction history to utilize the full spectrum of signals contained in it. Also, it adopts Chebyshev interpolation to effectively approximate a flexible non-linear graph filter, and further enhances it by using an additional ideal pass filter and degree-based normalization. Through extensive experiments, we verify that ChebyCF overcomes the aforementioned bottlenecks and achieves state-of-the-art performance across multiple benchmarks and reasonably fast inference. Our code is available at https://github.com/chanwoo0806/ChebyCF.
中文:ChebyCF提出了一种基于图谱滤波和切比雪夫插值的协同过滤框架,有效解决了嵌入层容量限制和邻域聚合的不足,在多个基准测试中实现了领先性能。
English: ChebyCF introduces a collaborative filtering framework using graph spectral filtering and Chebyshev interpolation to overcome limitations in embedding capacity and neighborhood aggregation, achieving state-of-the-art performance across benchmarks.
Authors:Qingyuan Wu, Yuhui Wang, Simon Sinong Zhan, Yixuan Wang, Chung-Wei Lin, Chen Lv, Qi Zhu, Jürgen Schmidhuber, Chao Huang
Abstract:
Reinforcement learning (RL) with delays is challenging as sensory perceptions lag behind the actual events: the RL agent needs to estimate the real state of its environment based on past observations. State-of-the-art (SOTA) methods typically employ recursive, step-by-step forecasting of states. This can cause the accumulation of compounding errors. To tackle this problem, our novel belief estimation method, named Directly Forecasting Belief Transformer (DFBT), directly forecasts states from observations without incrementally estimating intermediate states step-by-step. We theoretically demonstrate that DFBT greatly reduces compounding errors of existing recursively forecasting methods, yielding stronger performance guarantees. In experiments with D4RL offline datasets, DFBT reduces compounding errors with remarkable prediction accuracy. DFBT's capability to forecast state sequences also facilitates multi-step bootstrapping, thus greatly improving learning efficiency. On the MuJoCo benchmark, our DFBT-based method substantially outperforms SOTA baselines. Code is available at https://github.com/QingyuanWuNothing/DFBT.
中文摘要:提出的DFBT方法通过直接从观测值预测状态来减少延迟强化学习中的累积误差,在理论保证和实验基准上均显著优于现有方法。
English Summary: The proposed DFBT method directly forecasts states from observations to minimize compounding errors in reinforcement learning with delays, significantly outperforming existing approaches in both theoretical guarantees and experimental benchmarks.
Authors:Jeremias Ferrao, Luhan Mikaelson, Keenan Pepper, Natalia Perez-Campanero Antolin
Abstract:
A growing intuition in machine learning suggests a link between sparsity and interpretability. We introduce a novel self-ablation mechanism to investigate this connection ante-hoc in the context of language transformers. Our approach dynamically enforces a k-winner-takes-all constraint, forcing the model to demonstrate selective activation across neuron and attention units. Unlike post-hoc methods that analyze already-trained models, our approach integrates interpretability directly into model training, promoting feature localization from inception. Training small models on the TinyStories dataset and employing interpretability tests, we find that self-ablation leads to more localized circuits, concentrated feature representations, and increased neuron specialization without compromising language modelling performance. Surprisingly, our method also decreased overall sparsity, indicating that self-ablation promotes specialization rather than widespread inactivity. This reveals a complex interplay between sparsity and interpretability, where decreased global sparsity can coexist with increased local specialization, leading to enhanced interpretability. To facilitate reproducibility, we make our code available at https://github.com/keenanpepper/self-ablating-transformers.
中文: 本研究提出了一种自消融机制,强制语言变换器进行选择性激活,发现尽管该方法降低了整体稀疏性,但通过促进局部专业化和特征集中,在不牺牲性能的情况下增强了可解释性。
English: This study introduces a self-ablation mechanism that enforces selective activation in language transformers, revealing that while it reduces overall sparsity, it enhances interpretability by promoting local specialization and feature concentration without sacrificing performance.
Authors:Thomas Flinkow, Marco Casadio, Colin Kessler, Rosemary Monahan, Ekaterina Komendantskaya
Abstract:
Neural networks have been shown to frequently fail to learn critical safety and correctness properties purely from data, highlighting the need for training methods that directly integrate logical specifications. While adversarial training can be used to improve robustness to small perturbations within $ε$-cubes, domains other than computer vision -- such as control systems and natural language processing -- may require more flexible input region specifications via generalised hyper-rectangles. Differentiable logics offer a way to encode arbitrary logical constraints as additional loss terms that guide the learning process towards satisfying these constraints. In this paper, we investigate how these two complementary approaches can be unified within a single framework for property-driven machine learning, as a step toward effective formal verification of neural networks. We show that well-known properties from the literature are subcases of this general approach, and we demonstrate its practical effectiveness on a case study involving a neural network controller for a drone system. Our framework is made publicly available at https://github.com/tflinkow/property-driven-ml.
神经网络常无法仅从数据中学习安全属性,需要通过可微逻辑和对抗训练将逻辑规范融入统一框架,以提升网络在控制等领域的可靠性与验证效果。
Neural networks often fail to learn safety properties from data alone, necessitating training methods that incorporate logical specifications through differentiable logics and adversarial training within a unified framework.
Authors:Yu-Hsiang Lan, Eric K. Oermann
Abstract:
There has been a recent surge of interest in time series modeling using the Transformer architecture. However, forecasting multivariate time series with Transformer presents a unique challenge as it requires modeling both temporal (cross-time) and variate (cross-variate) dependencies. While Transformer-based models have gained popularity for their flexibility in capturing both sequential and cross-variate relationships, it is unclear how to best integrate these two sources of information in the context of the Transformer architecture while optimizing for both performance and efficiency. We re-purpose the Transformer architecture to effectively model both cross-time and cross-variate dependencies. Our approach begins by embedding each variate independently into a variate-wise representation that captures its cross-time dynamics, and then models cross-variate dependencies through attention mechanisms on these learned embeddings. Gating operations in both cross-time and cross-variate modeling phases regulate information flow, allowing the model to focus on the most relevant features for accurate predictions. Our method achieves state-of-the-art performance across 13 real-world datasets and can be seamlessly integrated into other Transformer-based and LLM-based forecasters, delivering performance improvements up to 20.7\% over original models. Code is available at this repository: https://github.com/nyuolab/Gateformer.
中文摘要:本研究提出了一种基于Transformer的新方法,通过专门的嵌入和门控机制有效建模多元时间序列中的时间和变量间依赖关系,在多个数据集上实现了最先进的预测性能。
English Summary: This study introduces a novel Transformer-based approach that effectively models both temporal and cross-variate dependencies in multivariate time series forecasting through specialized embedding and gating mechanisms, achieving state-of-the-art performance across multiple datasets.
Authors:Shingo Higashiguchi, Yasuko Matsubara, Koki Kawabata, Taichi Murayama, Yasushi Sakurai
Abstract:
Large quantities of social activity data, such as weekly web search volumes and the number of new infections with infectious diseases, reflect peoples' interests and activities. It is important to discover temporal patterns from such data and to forecast future activities accurately. However, modeling and forecasting social activity data streams is difficult because they are high-dimensional and composed of multiple time-varying dynamics such as trends, seasonality, and interest diffusion. In this paper, we propose D-Tracker, a method for continuously capturing time-varying temporal patterns within social activity tensor data streams and forecasting future activities. Our proposed method has the following properties: (a) Interpretable: it incorporates the partial differential equation into a tensor decomposition framework and captures time-varying temporal patterns such as trends, seasonality, and interest diffusion between locations in an interpretable manner; (b) Automatic: it has no hyperparameters and continuously models tensor data streams fully automatically; (c) Scalable: the computation time of D-Tracker is independent of the time series length. Experiments using web search volume data obtained from GoogleTrends, and COVID-19 infection data obtained from COVID-19 Open Data Repository show that our method can achieve higher forecasting accuracy in less computation time than existing methods while extracting the interest diffusion between locations. Our source code and datasets are available at {https://github.com/Higashiguchi-Shingo/D-Tracker.
中文: D-Tracker是一种可解释、自动且可扩展的方法,用于建模和预测高维社会活动数据流,在捕捉趋势和兴趣传播等时变模式方面实现了更高的准确性和效率。
English: D-Tracker is an interpretable, automatic, and scalable method for modeling and forecasting high-dimensional social activity data streams, achieving superior accuracy and efficiency in capturing time-varying patterns like trends and interest diffusion.
Authors:Filipp Nikitin, Ian Dunn, David Ryan Koes, Olexandr Isayev
Abstract:
Deep generative models have shown significant promise in generating valid 3D molecular structures, with the GEOM-Drugs dataset serving as a key benchmark. However, current evaluation protocols suffer from critical flaws, including incorrect valency definitions, bugs in bond order calculations, and reliance on force fields inconsistent with the reference data. In this work, we revisit GEOM-Drugs and propose a corrected evaluation framework: we identify and fix issues in data preprocessing, construct chemically accurate valency tables, and introduce a GFN2-xTB-based geometry and energy benchmark. We retrain and re-evaluate several leading models under this framework, providing updated performance metrics and practical recommendations for future benchmarking. Our results underscore the need for chemically rigorous evaluation practices in 3D molecular generation. Our recommended evaluation methods and GEOM-Drugs processing scripts are available at https://github.com/isayevlab/geom-drugs-3dgen-evaluation.
中文: 本研究通过提出包含精确化合价定义和基于GFN2-xTB基准的校正框架,解决了当前三维分子生成评估方法的缺陷,为化学严谨性评估提供了更新指标和实践建议。
English: This study addresses flaws in current evaluation protocols for 3D molecular generation by proposing a corrected framework with accurate valency definitions and GFN2-xTB-based benchmarks, providing updated metrics and recommendations for chemically rigorous assessments.
Authors:Jiuwu Hao, Liguo Sun, Yuting Wan, Yueyang Wu, Ti Xiang, Haolin Song, Pin Lv
Abstract:
Collaborative perception enhances environmental awareness through inter-agent communication and is regarded as a promising solution to intelligent transportation systems. However, existing collaborative methods for Unmanned Aerial Vehicles (UAVs) overlook the unique characteristics of the UAV perspective, resulting in substantial communication overhead. To address this issue, we propose a novel communication-efficient collaborative perception framework based on late-intermediate fusion, dubbed LIF. The core concept is to exchange informative and compact detection results and shift the fusion stage to the feature representation level. In particular, we leverage vision-guided positional embedding (VPE) and box-based virtual augmented feature (BoBEV) to effectively integrate complementary information from various agents. Additionally, we innovatively introduce an uncertainty-driven communication mechanism that uses uncertainty evaluation to select high-quality and reliable shared areas. Experimental results demonstrate that our LIF achieves superior performance with minimal communication bandwidth, proving its effectiveness and practicality. Code and models are available at https://github.com/uestchjw/LIF.
中文: 本文提出LIF框架,通过后期中间融合和不确定性驱动机制,实现无人机间高效通信的协同感知,在保证性能的同时大幅降低通信开销。
English: This paper introduces LIF, a communication-efficient collaborative perception framework for UAVs that uses late-intermediate fusion and an uncertainty-driven mechanism to minimize bandwidth while maintaining high performance.
Authors:Ting Qiao, Yingjia Wang, Xing Liu, Sixing Wu, Jianbing Li, Yiming Li
Abstract:
Deep neural networks (DNNs) are vulnerable to backdoor attacks, where an attacker manipulates a small portion of the training data to implant hidden backdoors into the model. The compromised model behaves normally on clean samples but misclassifies backdoored samples into the attacker-specified target class, posing a significant threat to real-world DNN applications. Currently, several empirical defense methods have been proposed to mitigate backdoor attacks, but they are often bypassed by more advanced backdoor techniques. In contrast, certified defenses based on randomized smoothing have shown promise by adding random noise to training and testing samples to counteract backdoor attacks. In this paper, we reveal that existing randomized smoothing defenses implicitly assume that all samples are equidistant from the decision boundary. However, it may not hold in practice, leading to suboptimal certification performance. To address this issue, we propose a sample-specific certified backdoor defense method, termed Cert-SSB. Cert-SSB first employs stochastic gradient ascent to optimize the noise magnitude for each sample, ensuring a sample-specific noise level that is then applied to multiple poisoned training sets to retrain several smoothed models. After that, Cert-SSB aggregates the predictions of multiple smoothed models to generate the final robust prediction. In particular, in this case, existing certification methods become inapplicable since the optimized noise varies across different samples. To conquer this challenge, we introduce a storage-update-based certification method, which dynamically adjusts each sample's certification region to improve certification performance. We conduct extensive experiments on multiple benchmark datasets, demonstrating the effectiveness of our proposed method. Our code is available at https://github.com/NcepuQiaoTing/Cert-SSB.
中文: 深度神经网络易受后门攻击,现有基于随机平滑的认证防御方法假设所有样本与决策边界等距,导致性能不佳;本文提出Cert-SSB方法,通过为每个样本优化噪声水平并采用动态认证机制,有效提升了防御能力。
English: Deep neural networks are susceptible to backdoor attacks, but existing certified defenses using randomized smoothing assume uniform sample distances from decision boundaries, leading to suboptimal performance; this paper introduces Cert-SSB, a sample-specific defense that optimizes noise levels per sample and employs a dynamic certification method to enhance robustness.
Authors:Hannes Reichert, Benjamin Serfling, Elijah Schüssler, Kerim Turacan, Konrad Doll, Bernhard Sick
Abstract:
In recent studies, numerous previous works emphasize the importance of semantic segmentation of LiDAR data as a critical component to the development of driver-assistance systems and autonomous vehicles. However, many state-of-the-art methods are tested on outdated, lower-resolution LiDAR sensors and struggle with real-time constraints. This study introduces a novel semantic segmentation framework tailored for modern high-resolution LiDAR sensors that addresses both accuracy and real-time processing demands. We propose a novel LiDAR dataset collected by a cutting-edge automotive 128 layer LiDAR in urban traffic scenes. Furthermore, we propose a semantic segmentation method utilizing surface normals as strong input features. Our approach is bridging the gap between cutting-edge research and practical automotive applications. Additionaly, we provide a Robot Operating System (ROS2) implementation that we operate on our research vehicle. Our dataset and code are publicly available: https://github.com/kav-institute/SemanticLiDAR.
中文: 本研究针对高分辨率激光雷达数据,提出了一种利用表面法线和新城市数据集的实时语义分割框架,旨在弥合前沿研究与实际汽车应用之间的差距。
English: This study introduces a real-time semantic segmentation framework for high-resolution LiDAR data, utilizing surface normals and a new urban dataset to bridge research with practical automotive applications.
Authors:Qinfeng Zhu, Yunxi Jiang, Lei Fan
Abstract:
We propose a result-level category-specific fusion architecture called ClassWise-CRF. This architecture employs a two-stage process: first, it selects expert networks that perform well in specific categories from a pool of candidate networks using a greedy algorithm; second, it integrates the segmentation predictions of these selected networks by adaptively weighting their contributions based on their segmentation performance in each category. Inspired by Conditional Random Field (CRF), the ClassWise-CRF architecture treats the segmentation predictions from multiple networks as confidence vector fields. It leverages segmentation metrics (such as Intersection over Union) from the validation set as priors and employs an exponential weighting strategy to fuse the category-specific confidence scores predicted by each network. This fusion method dynamically adjusts the weights of each network for different categories, achieving category-specific optimization. Building on this, the architecture further optimizes the fused results using unary and pairwise potentials in CRF to ensure spatial consistency and boundary accuracy. To validate the effectiveness of ClassWise-CRF, we conducted experiments on two remote sensing datasets, LoveDA and Vaihingen, using eight classic and advanced semantic segmentation networks. The results show that the ClassWise-CRF architecture significantly improves segmentation performance: on the LoveDA dataset, the mean Intersection over Union (mIoU) metric increased by 1.00% on the validation set and by 0.68% on the test set; on the Vaihingen dataset, the mIoU improved by 0.87% on the validation set and by 0.91% on the test set. These results fully demonstrate the effectiveness and generality of the ClassWise-CRF architecture in semantic segmentation of remote sensing images. The full code is available at https://github.com/zhuqinfeng1999/ClassWise-CRF.
中文:ClassWise-CRF架构通过类别自适应加权融合专家网络预测并结合CRF优化,在遥感图像语义分割中显著提升了mIoU指标。
English: The ClassWise-CRF architecture enhances semantic segmentation by adaptively fusing expert networks' predictions using category-specific weighting and CRF optimization, achieving significant mIoU improvements on remote sensing datasets.
Authors:Xinyi Liu, Yujie Wang, Shenhan Zhu, Fangcheng Fu, Qingshuo Liu, Guangming Lin, Bin Cui
Abstract:
Galvatron is a distributed system for efficiently training large-scale Foundation Models. It overcomes the complexities of selecting optimal parallelism strategies by automatically identifying the most efficient hybrid strategy, incorporating data, tensor, pipeline, sharded data, and sequence parallelism, along with recomputation. The system's architecture includes a profiler for hardware and model analysis, a search engine for strategy optimization using decision trees and dynamic programming, and a runtime for executing these strategies efficiently. Benchmarking on various clusters demonstrates Galvatron's superior throughput compared to existing frameworks. This open-source system offers user-friendly interfaces and comprehensive documentation, making complex distributed training accessible and efficient. The source code of Galvatron is available at https://github.com/PKU-DAIR/Hetu-Galvatron.
中文:Galvatron是一个开源分布式系统,能自动选择最优混合并行策略来高效训练大规模基础模型,提供更高吞吐量和友好易用的界面。
English: Galvatron is an open-source distributed system that automatically determines the optimal hybrid parallelism strategy for training large-scale Foundation Models, delivering higher throughput and user-friendly accessibility.
Authors:Khoa Tuan Nguyen, Ho-min Park, Gaeun Oh, Joris Vankerschaver, Wesley De Neve
Abstract:
We propose a novel approach to cervical cell image classification for cervical cancer screening using the EVA-02 transformer model. We developed a four-step pipeline: fine-tuning EVA-02, feature extraction, selecting important features through multiple machine learning models, and training a new artificial neural network with optional loss weighting for improved generalization. With this design, our best model achieved an F1-score of 0.85227, outperforming the baseline EVA-02 model (0.84878). We also utilized Kernel SHAP analysis and identified key features correlating with cell morphology and staining characteristics, providing interpretable insights into the decision-making process of the fine-tuned model. Our code is available at https://github.com/Khoa-NT/isbi2025_ps3c.
中文: 我们提出了一种基于EVA-02转换器模型的宫颈细胞图像分类新方法,通过四步流程实现了0.85227的优异F1分数,并利用SHAP分析提供了可解释的决策依据。
English: We introduce a novel cervical cell classification method using the EVA-02 transformer model with a four-step pipeline that achieved a superior F1-score of 0.85227 and provided interpretable insights through SHAP analysis.
Authors:Jinpeng Wang, Tianci Luo, Yaohua Zha, Yan Feng, Ruisheng Luo, Bin Chen, Tao Dai, Long Chen, Yaowei Wang, Shu-Tao Xia
Abstract:
Visual In-Context Learning (VICL) enables adaptively solving vision tasks by leveraging pixel demonstrations, mimicking human-like task completion through analogy. Prompt selection is critical in VICL, but current methods assume the existence of a single "ideal" prompt in a pool of candidates, which in practice may not hold true. Multiple suitable prompts may exist, but individually they often fall short, leading to difficulties in selection and the exclusion of useful context. To address this, we propose a new perspective: prompt condensation. Rather than relying on a single prompt, candidate prompts collaborate to efficiently integrate informative contexts without sacrificing resolution. We devise Condenser, a lightweight external plugin that compresses relevant fine-grained context across multiple prompts. Optimized end-to-end with the backbone, Condenser ensures accurate integration of contextual cues. Experiments demonstrate Condenser outperforms state-of-the-arts across benchmark tasks, showing superior context compression, scalability with more prompts, and enhanced computational efficiency compared to ensemble methods, positioning it as a highly competitive solution for VICL. Code is open-sourced at https://github.com/gimpong/CVPR25-Condenser.
中文: 本研究提出Condenser方法,通过协同压缩多个提示实现视觉上下文学习优化,在基准测试中展现出优于现有技术的效率与性能表现。
English: The study introduces Condenser, a prompt condensation method that collaboratively compresses multiple prompts to enhance visual in-context learning, outperforming existing techniques in efficiency and performance across benchmarks.
Authors:Sixuan Wang, Jiao Yin, Jinli Cao, MingJian Tang, Hua Wang, Yanchun Zhang
Abstract:
Effective and efficient graph representation learning is essential for enabling critical downstream tasks, such as node classification, link prediction, and subgraph search. However, existing graph neural network (GNN) architectures often struggle to adapt to diverse and complex graph structures, limiting their ability to produce structure-aware and task-discriminative representations. To address this challenge, we propose ABG-NAS, a novel framework for automated graph neural network architecture search tailored for efficient graph representation learning. ABG-NAS encompasses three key components: a Comprehensive Architecture Search Space (CASS), an Adaptive Genetic Optimization Strategy (AGOS), and a Bayesian-Guided Tuning Module (BGTM). CASS systematically explores diverse propagation (P) and transformation (T) operations, enabling the discovery of GNN architectures capable of capturing intricate graph characteristics. AGOS dynamically balances exploration and exploitation, ensuring search efficiency and preserving solution diversity. BGTM further optimizes hyperparameters periodically, enhancing the scalability and robustness of the resulting architectures. Empirical evaluations on benchmark datasets (Cora, PubMed, Citeseer, and CoraFull) demonstrate that ABG-NAS consistently outperforms both manually designed GNNs and state-of-the-art neural architecture search (NAS) methods. These results highlight the potential of ABG-NAS to advance graph representation learning by providing scalable and adaptive solutions for diverse graph structures. Our code is publicly available at https://github.com/sserranw/ABG-NAS.
中文摘要:ABG-NAS框架通过自适应遗传优化和贝叶斯调优自动搜索图神经网络最优架构,在基准数据集上持续超越现有方法,为多样化图结构提供可扩展的解决方案。
English Summary: ABG-NAS is an automated framework that efficiently searches for optimal graph neural network architectures through adaptive genetic optimization and Bayesian tuning, consistently outperforming existing methods on benchmark datasets.
Authors:Shuai Gong, Chaoran Cui, Xiaolin Dong, Xiushan Nie, Lei Zhu, Xiaojun Chang
Abstract:
Federated domain generalization (FedDG) aims to learn a globally generalizable model from decentralized clients with heterogeneous data while preserving privacy. Recent studies have introduced prompt learning to adapt vision-language models (VLMs) in FedDG by learning a single global prompt. However, such a one-prompt-fits-all learning paradigm typically leads to performance degradation on personalized samples. Although the mixture of experts (MoE) offers a promising solution for specialization, existing MoE-based methods suffer from coarse image-level expert assignment and high communication costs from parameterized routers. To address these limitations, we propose TRIP, a Token-level prompt mixture with parameter-free routing framework for FedDG, which treats multiple prompts as distinct experts. Unlike existing image-level routing designs, TRIP assigns different tokens within an image to specific experts. To ensure communication efficiency, TRIP incorporates a parameter-free routing mechanism based on token clustering and optimal transport. The instance-specific prompt is then synthesized by aggregating experts, weighted by the number of tokens assigned to each. Additionally, TRIP develops an unbiased learning strategy for prompt experts, leveraging the VLM's zero-shot generalization capability. Extensive experiments across four benchmarks demonstrate that TRIP achieves optimal generalization results, with communication of only 1K parameters per round. Our code is available at https://github.com/GongShuai8210/TRIP.
中文: TRIP提出了一种用于联邦领域泛化的无参数路由令牌级提示混合框架,通过将图像令牌分配给专业专家,实现了高效通信和增强的模型泛化能力。
English: TRIP introduces a token-level prompt mixture framework with parameter-free routing for federated domain generalization, enabling efficient communication and enhanced model generalization by assigning image tokens to specialized experts.
Authors:Yiting Zhang, Shichen Li, Elena Shrestha
Abstract:
Mechanical search (MS) in cluttered environments remains a significant challenge for autonomous manipulators, requiring long-horizon planning and robust state estimation under occlusions and partial observability. In this work, we introduce XPG-RL, a reinforcement learning framework that enables agents to efficiently perform MS tasks through explainable, priority-guided decision-making based on raw sensory inputs. XPG-RL integrates a task-driven action prioritization mechanism with a learned context-aware switching strategy that dynamically selects from a discrete set of action primitives such as target grasping, occlusion removal, and viewpoint adjustment. Within this strategy, a policy is optimized to output adaptive threshold values that govern the discrete selection among action primitives. The perception module fuses RGB-D inputs with semantic and geometric features to produce a structured scene representation for downstream decision-making. Extensive experiments in both simulation and real-world settings demonstrate that XPG-RL consistently outperforms baseline methods in task success rates and motion efficiency, achieving up to 4.5$\times$ higher efficiency in long-horizon tasks. These results underscore the benefits of integrating domain knowledge with learnable decision-making policies for robust and efficient robotic manipulation. The project page for XPG-RL is https://yitingzhang1997.github.io/xpgrl/.
Authors:Zayd M. K. Zuhri, Erland Hilman Fuadi, Alham Fikri Aji
Abstract:
We introduce softpick, a rectified, not sum-to-one, drop-in replacement for softmax in transformer attention mechanisms that eliminates attention sink and massive activations. Our experiments with 340M and 1.8B parameter models demonstrate that softpick achieves 0\% sink rate consistently. The softpick transformers produce hidden states with significantly lower kurtosis and creates sparse attention maps. Quantized models using softpick outperform softmax on standard benchmarks, with a particularly pronounced advantage at lower bit precisions. Our analysis and discussion shows how softpick has the potential to open new possibilities for quantization, low-precision training, sparsity optimization, pruning, and interpretability. Our code is available at https://github.com/zaydzuhri/softpick-attention
中文: Softpick 是 Transformer 注意力机制中 softmax 的改进替代方案,它消除了注意力汇聚现象并显著降低激活峰度,在量化模型中表现更优,为多种优化技术开辟了新途径。
English: Softpick is a rectified, non-sum-to-one alternative to softmax in transformer attention that eliminates attention sinks and reduces activation kurtosis, improving performance in quantized models and enabling new optimization possibilities.
Authors:Zikui Cai, Shayan Shabihi, Bang An, Zora Che, Brian R. Bartoldson, Bhavya Kailkhura, Tom Goldstein, Furong Huang
Abstract:
We introduce AegisLLM, a cooperative multi-agent defense against adversarial attacks and information leakage. In AegisLLM, a structured workflow of autonomous agents - orchestrator, deflector, responder, and evaluator - collaborate to ensure safe and compliant LLM outputs, while self-improving over time through prompt optimization. We show that scaling agentic reasoning system at test-time - both by incorporating additional agent roles and by leveraging automated prompt optimization (such as DSPy)- substantially enhances robustness without compromising model utility. This test-time defense enables real-time adaptability to evolving attacks, without requiring model retraining. Comprehensive evaluations across key threat scenarios, including unlearning and jailbreaking, demonstrate the effectiveness of AegisLLM. On the WMDP unlearning benchmark, AegisLLM achieves near-perfect unlearning with only 20 training examples and fewer than 300 LM calls. For jailbreaking benchmarks, we achieve 51% improvement compared to the base model on StrongReject, with false refusal rates of only 7.9% on PHTest compared to 18-55% for comparable methods. Our results highlight the advantages of adaptive, agentic reasoning over static defenses, establishing AegisLLM as a strong runtime alternative to traditional approaches based on model modifications. Code is available at https://github.com/zikuicai/aegisllm
中文: AegisLLM是一种多智能体防御系统,通过协同工作的智能体角色和自动提示优化来增强大语言模型的安全性,在无需重新训练模型的情况下,在遗忘和越狱基准测试中取得了显著改进。
English: AegisLLM is a multi-agent defense system that enhances LLM security through collaborative agent roles and automated prompt optimization, achieving significant improvements in unlearning and jailbreaking benchmarks without requiring model retraining.
Authors:Quentin Guimard, Moreno D'IncÃ, Massimiliano Mancini, Elisa Ricci
Abstract:
A person downloading a pre-trained model from the web should be aware of its biases. Existing approaches for bias identification rely on datasets containing labels for the task of interest, something that a non-expert may not have access to, or may not have the necessary resources to collect: this greatly limits the number of tasks where model biases can be identified. In this work, we present Classifier-to-Bias (C2B), the first bias discovery framework that works without access to any labeled data: it only relies on a textual description of the classification task to identify biases in the target classification model. This description is fed to a large language model to generate bias proposals and corresponding captions depicting biases together with task-specific target labels. A retrieval model collects images for those captions, which are then used to assess the accuracy of the model w.r.t. the given biases. C2B is training-free, does not require any annotations, has no constraints on the list of biases, and can be applied to any pre-trained model on any classification task. Experiments on two publicly available datasets show that C2B discovers biases beyond those of the original datasets and outperforms a recent state-of-the-art bias detection baseline that relies on task-specific annotations, being a promising first step toward addressing task-agnostic unsupervised bias detection.
中文:C2B框架无需标注数据即可检测预训练模型中的偏见,它通过任务描述生成偏见建议并评估模型准确性,其性能优于依赖标注的方法,推动了无监督偏见检测的发展。
English: The C2B framework enables bias detection in pre-trained models without labeled data by using task descriptions to generate bias proposals and assess model accuracy, outperforming annotation-dependent methods and advancing unsupervised bias discovery.
Authors:Harry Mead, Clarissa Costen, Bruno Lacerda, Nick Hawes
Abstract:
When optimising for conditional value at risk (CVaR) using policy gradients (PG), current methods rely on discarding a large proportion of trajectories, resulting in poor sample efficiency. We propose a reformulation of the CVaR optimisation problem by capping the total return of trajectories used in training, rather than simply discarding them, and show that this is equivalent to the original problem if the cap is set appropriately. We show, with empirical results in an number of environments, that this reformulation of the problem results in consistently improved performance compared to baselines. We have made all our code available here: https://github.com/HarryMJMead/cvar-return-capping.
中文: 该方法通过限制轨迹回报而非直接丢弃来优化CVaR,在多种环境中均展现出比基线方法更优的性能表现。
English: The proposed method improves sample efficiency in CVaR optimization by capping trajectory returns instead of discarding them, demonstrating consistent performance gains across multiple environments.
Authors:Florian Vahl, Jörn Griepenburg, Jan Gutsche, Jasper Güldenstein, Jianwei Zhang
Abstract:
This paper introduces SoccerDiffusion, a transformer-based diffusion model designed to learn end-to-end control policies for humanoid robot soccer directly from real-world gameplay recordings. Using data collected from RoboCup competitions, the model predicts joint command trajectories from multi-modal sensor inputs, including vision, proprioception, and game state. We employ a distillation technique to enable real-time inference on embedded platforms that reduces the multi-step diffusion process to a single step. Our results demonstrate the model's ability to replicate complex motion behaviors such as walking, kicking, and fall recovery both in simulation and on physical robots. Although high-level tactical behavior remains limited, this work provides a robust foundation for subsequent reinforcement learning or preference optimization methods. We release the dataset, pretrained models, and code under: https://bit-bots.github.io/SoccerDiffusion
Authors:Hasan Abed Al Kader Hammoud, Hani Itani, Bernard Ghanem
Abstract:
Large Language Models (LLMs) leverage step-by-step reasoning to solve complex problems. Standard evaluation practice involves generating a complete reasoning trace and assessing the correctness of the final answer presented at its conclusion. In this paper, we challenge the reliance on the final answer by posing the following two questions: Does the final answer reliably represent the model's optimal conclusion? Can alternative reasoning paths yield different results? To answer these questions, we analyze intermediate reasoning steps, termed subthoughts, and propose a method based on our findings. Our approach involves segmenting a reasoning trace into sequential subthoughts based on linguistic cues. We start by prompting the model to generate continuations from the end-point of each intermediate subthought. We extract a potential answer from every completed continuation originating from different subthoughts. We find that aggregating these answers by selecting the most frequent one (the mode) often yields significantly higher accuracy compared to relying solely on the answer derived from the original complete trace. Analyzing the consistency among the answers derived from different subthoughts reveals characteristics that correlate with the model's confidence and correctness, suggesting potential for identifying less reliable answers. Our experiments across various LLMs and challenging mathematical reasoning datasets (AIME2024 and AIME2025) show consistent accuracy improvements, with gains reaching up to 13\% and 10\% respectively. Implementation is available at: https://github.com/hammoudhasan/SubthoughtReasoner.
中文: 本研究通过分析大语言模型的中间推理步骤,质疑最终答案的可靠性,并发现聚合分段子思维的答案能显著提高不同模型和数据集上的准确性。
English: This study questions the reliability of final answers from Large Language Models by analyzing intermediate reasoning steps and finds that aggregating answers from segmented subthoughts significantly improves accuracy across various models and datasets.
Authors:Adam GudyÅ, Cezary Maszczyk, Joanna Badura, Adam Grzelak, Marek Sikora, Åukasz Wróbel
Abstract:
Rules offer an invaluable combination of predictive and descriptive capabilities. Our package for rule-based data analysis, RuleKit, has proven its effectiveness in classification, regression, and survival problems. Here we present its second version. New algorithms and optimized implementations of those previously included, significantly improved the computational performance of our suite, reducing the analysis time of some data sets by two orders of magnitude. The usability of RuleKit 2 is provided by two new components: Python package and browser application with a graphical user interface. The former complies with scikit-learn, the most popular data mining library for Python, allowing RuleKit 2 to be straightforwardly integrated into existing data analysis pipelines. RuleKit 2 is available at GitHub under GNU AGPL 3 license (https://github.com/adaa-polsl/RuleKit)
Chinese: RuleKit 2 版本引入了增强算法和优化实现,显著提升了计算性能,并通过新增的Python包和浏览器图形界面,实现了与现有数据分析流程的无缝集成。
English: RuleKit 2 introduces enhanced algorithms and optimized implementations, drastically improving computational performance and adding Python and browser-based interfaces for seamless integration into data analysis workflows.
Authors:Andrew Fitzgibbon, Stephen Felix
Abstract:
Large-scale numerical computations make increasing use of low-precision (LP) floating point formats and mixed precision arithmetic, which can be enhanced by the technique of stochastic rounding (SR), that is, rounding an intermediate high-precision value up or down randomly as a function of the value's distance to the two rounding candidates. Stochastic rounding requires, in addition to the high-precision input value, a source of random bits. As the provision of high-quality random bits is an additional computational cost, it is of interest to require as few bits as possible while maintaining the desirable properties of SR in a given computation, or computational domain. This paper examines a number of possible implementations of few-bit stochastic rounding (FBSR), and shows how several natural implementations can introduce sometimes significant bias into the rounding process, which are not present in the case of infinite-bit, infinite-precision examinations of these implementations. The paper explores the impact of these biases in machine learning examples, and hence opens another class of configuration parameters of which practitioners should be aware when developing or adopting low-precision floating point. Code is available at http://github.com/graphcore-research/arith25-stochastic-rounding.
中文摘要:本文研究了几种少位随机舍入方法,发现某些实现方式会引入理想情况下不存在的显著偏差,并在机器学习应用中展示了这些偏差的影响。
English Summary: This paper investigates few-bit stochastic rounding methods, revealing that certain implementations can introduce significant biases not present in ideal scenarios, and demonstrates their impact in machine learning applications.
Authors:Rulin Shao, Rui Qiao, Varsha Kishore, Niklas Muennighoff, Xi Victoria Lin, Daniela Rus, Bryan Kian Hsiang Low, Sewon Min, Wen-tau Yih, Pang Wei Koh, Luke Zettlemoyer
Abstract:
We present ReasonIR-8B, the first retriever specifically trained for general reasoning tasks. Existing retrievers have shown limited gains on reasoning tasks, in part because existing training datasets focus on short factual queries tied to documents that straightforwardly answer them. We develop a synthetic data generation pipeline that, for each document, our pipeline creates a challenging and relevant query, along with a plausibly related but ultimately unhelpful hard negative. By training on a mixture of our synthetic data and existing public data, ReasonIR-8B achieves a new state-of-the-art of 29.9 nDCG@10 without reranker and 36.9 nDCG@10 with reranker on BRIGHT, a widely-used reasoning-intensive information retrieval (IR) benchmark. When applied to RAG tasks, ReasonIR-8B improves MMLU and GPQA performance by 6.4% and 22.6% respectively, relative to the closed-book baseline, outperforming other retrievers and search engines. In addition, ReasonIR-8B uses test-time compute more effectively: on BRIGHT, its performance consistently increases with longer and more information-rich rewritten queries; it continues to outperform other retrievers when combined with an LLM reranker. Our training recipe is general and can be easily extended to future LLMs; to this end, we open-source our code, data, and model.
中文:ReasonIR-8B是首个专为通用推理任务训练的检索器,通过创新的合成数据生成方法,在推理基准测试中取得最优性能,并显著提升了RAG任务的表现。
English: ReasonIR-8B is the first retriever specifically trained for general reasoning tasks, achieving state-of-the-art performance on reasoning benchmarks and significantly enhancing RAG task results through a novel synthetic data generation pipeline.
Authors:Rilind Sahitaj, Paulius Sasnauskas, YiÄit Yalın, Debmalya Mandal, Goran RadanoviÄ
Abstract:
Performative Reinforcement Learning (PRL) refers to a scenario in which the deployed policy changes the reward and transition dynamics of the underlying environment. In this work, we study multi-agent PRL by incorporating performative effects into Markov Potential Games (MPGs). We introduce the notion of a performatively stable equilibrium (PSE) and show that it always exists under a reasonable sensitivity assumption. We then provide convergence results for state-of-the-art algorithms used to solve MPGs. Specifically, we show that independent policy gradient ascent (IPGA) and independent natural policy gradient (INPG) converge to an approximate PSE in the best-iterate sense, with an additional term that accounts for the performative effects. Furthermore, we show that INPG asymptotically converges to a PSE in the last-iterate sense. As the performative effects vanish, we recover the convergence rates from prior work. For a special case of our game, we provide finite-time last-iterate convergence results for a repeated retraining approach, in which agents independently optimize a surrogate objective. We conduct extensive experiments to validate our theoretical findings.
中文: 本研究针对马尔可夫势博弈中的多智能体表演性强化学习提出了表演性稳定均衡(PSE)概念,通过理论证明和实验验证了独立策略梯度方法能够收敛至近似PSE的结论。
English: This work introduces performatively stable equilibrium (PSE) for multi-agent performative reinforcement learning in Markov Potential Games, demonstrating that independent policy gradient methods converge to approximate PSE with theoretical guarantees validated by experiments.
Authors:Yiping Wang, Qing Yang, Zhiyuan Zeng, Liliang Ren, Liyuan Liu, Baolin Peng, Hao Cheng, Xuehai He, Kuan Wang, Jianfeng Gao, Weizhu Chen, Shuohang Wang, Simon Shaolei Du, Yelong Shen
Abstract:
We show that reinforcement learning with verifiable reward using one training example (1-shot RLVR) is effective in incentivizing the mathematical reasoning capabilities of large language models (LLMs). Applying RLVR to the base model Qwen2.5-Math-1.5B, we identify a single example that elevates model performance on MATH500 from 36.0% to 73.6%, and improves the average performance across six common mathematical reasoning benchmarks from 17.6% to 35.7%. This result matches the performance obtained using the 1.2k DeepScaleR subset (MATH500: 73.6%, average: 35.9%), which includes the aforementioned example. Furthermore, RLVR with only two examples even slightly exceeds these results (MATH500: 74.8%, average: 36.6%). Similar substantial improvements are observed across various models (Qwen2.5-Math-7B, Llama3.2-3B-Instruct, DeepSeek-R1-Distill-Qwen-1.5B), RL algorithms (GRPO and PPO), and different math examples (when employed as a single training example). In addition, we identify some interesting phenomena during 1-shot RLVR, including cross-domain generalization, increased frequency of self-reflection, and sustained test performance improvement even after the training accuracy has saturated, a phenomenon we term post-saturation generalization. Moreover, we verify that the effectiveness of 1-shot RLVR primarily arises from the policy gradient loss, distinguishing it from the "grokking" phenomenon. We also show the critical role of promoting exploration (e.g., by incorporating entropy loss with an appropriate coefficient) in 1-shot RLVR training. We also further discuss related observations about format correction, label robustness and prompt modification. These findings can inspire future work on RLVR efficiency and encourage a re-examination of recent progress and the underlying mechanisms in RLVR. Our code, model, and data are open source at https://github.com/ypwang61/One-Shot-RLVR.
Chinese: 单样本可验证奖励强化学习(RLVR)显著提升了大语言模型的数学推理能力,将MATH500基准测试准确率从36.0%提升至73.6%,并展现出跨领域泛化能力和训练饱和后的持续性能提升。
English: One-shot reinforcement learning with verifiable reward (RLVR) significantly enhances large language models' mathematical reasoning, boosting performance on benchmarks like MATH500 from 36.0% to 73.6% and demonstrating cross-domain generalization and post-saturation improvement.
Authors:Elena Martinez, Beatrice Moscoloni, Matteo Salvador, Fanwei Kong, Mathias Peirlinck, Alison Lesley Marsden
Abstract:
Combining physics-based modeling with data-driven methods is critical to enabling the translation of computational methods to clinical use in cardiology. The use of rigorous differential equations combined with machine learning tools allows for model personalization with uncertainty quantification in time frames compatible with clinical practice. However, accurate and efficient surrogate models of cardiac function, built from physics-based numerical simulation, are still mostly geometry-specific and require retraining for different patients and pathological conditions. We propose a novel computational pipeline to embed cardiac anatomies into full-field surrogate models. We generate a dataset of electrophysiology simulations using a complex multi-scale mathematical model coupling partial and ordinary differential equations. We adopt Branched Latent Neural Maps (BLNMs) as an effective scientific machine learning method to encode activation maps extracted from physics-based numerical simulations into a neural network. Leveraging large deformation diffeomorphic metric mappings, we build a biventricular anatomical atlas and parametrize the anatomical variability of a small and challenging cohort of 13 pediatric patients affected by Tetralogy of Fallot. We propose a novel statistical shape modeling based z-score sampling approach to generate a new synthetic cohort of 52 biventricular geometries that are compatible with the original geometrical variability. This synthetic cohort acts as the training set for BLNMs. Our surrogate model demonstrates robustness and great generalization across the complex original patient cohort, achieving an average adimensional mean squared error of 0.0034. The Python implementation of our BLNM model is publicly available under MIT License at https://github.com/StanfordCBCL/BLNM.
中文摘要:本研究提出一种将心脏解剖结构嵌入全场替代模型的计算流程,通过分支潜在神经映射和合成解剖采样,在法洛四联症儿科患者中实现了强大的泛化能力。
English Summary: This study introduces a computational pipeline that integrates cardiac anatomy into full-field surrogate models using Branched Latent Neural Maps, achieving robust generalization across pediatric patients with Tetralogy of Fallot through synthetic anatomical sampling.
Authors:Stefan Kober
Abstract:
Traditional k-means clustering underperforms on non-convex shapes and requires the number of clusters k to be specified in advance. We propose a simple geometric enhancement: after standard k-means, each cluster center is assigned a radius (the distance to its farthest assigned point), and clusters whose radii overlap are merged. This post-processing step loosens the requirement for exact k: as long as k is overestimated (but not excessively), the method can often reconstruct non-convex shapes through meaningful merges. We also show that this approach supports recursive partitioning: clustering can be performed independently on tiled regions of the feature space, then globally merged, making the method scalable and suitable for distributed systems. Implemented as a lightweight post-processing step atop scikit-learn's k-means, the algorithm performs well on benchmark datasets, achieving high accuracy with minimal additional computation.
中文: 该方法通过为聚类中心分配半径并合并重叠簇来增强k均值算法,能够在k值被高估时重建非凸形状,同时通过递归分区实现可扩展的分布式计算。
English: The proposed method enhances k-means clustering by assigning radii to cluster centers and merging overlapping clusters, enabling reconstruction of non-convex shapes with overestimated k values while supporting scalable distributed computation through recursive partitioning.
Authors:Nishant Subramani, Jason Eisner, Justin Svegliato, Benjamin Van Durme, Yu Su, Sam Thomson
Abstract:
Tool-using agents that act in the world need to be both useful and safe. Well-calibrated model confidences can be used to weigh the risk versus reward of potential actions, but prior work shows that many models are poorly calibrated. Inspired by interpretability literature exploring the internals of models, we propose a novel class of model-internal confidence estimators (MICE) to better assess confidence when calling tools. MICE first decodes from each intermediate layer of the language model using logitLens and then computes similarity scores between each layer's generation and the final output. These features are fed into a learned probabilistic classifier to assess confidence in the decoded output. On the simulated trial and error (STE) tool-calling dataset using Llama3 models, we find that MICE beats or matches the baselines on smoothed expected calibration error. Using MICE confidences to determine whether to call a tool significantly improves over strong baselines on a new metric, expected tool-calling utility. Further experiments show that MICE is sample-efficient, can generalize zero-shot to unseen APIs, and results in higher tool-calling utility in scenarios with varying risk levels. Our code is open source, available at https://github.com/microsoft/mice_for_cats.
中文摘要:提出的MICE方法通过逐层相似性分析和概率分类计算内部置信度,显著提升了工具使用代理的安全性和实用性,在不同风险场景下均优于基线模型的校准精度和工具调用效能。
English Summary: The proposed MICE method enhances tool-using agents' safety and utility by computing internal confidence scores through layer-wise similarity analysis and probabilistic classification, outperforming baselines in calibration and tool-calling effectiveness across diverse scenarios.
Authors:Alireza Kazemi, Helia Rezvani, Mahsa Baktashmotlagh
Abstract:
Transferability scores aim to quantify how well a model trained on one domain generalizes to a target domain. Despite numerous methods proposed for measuring transferability, their reliability and practical usefulness remain inconclusive, often due to differing experimental setups, datasets, and assumptions. In this paper, we introduce a comprehensive benchmarking framework designed to systematically evaluate transferability scores across diverse settings. Through extensive experiments, we observe variations in how different metrics perform under various scenarios, suggesting that current evaluation practices may not fully capture each method's strengths and limitations. Our findings underscore the value of standardized assessment protocols, paving the way for more reliable transferability measures and better-informed model selection in cross-domain applications. Additionally, we achieved a 3.5\% improvement using our proposed metric for the head-training fine-tuning experimental setup. Our code is available in this repository: https://github.com/alizkzm/pert_robust_platform.
中文: 本文提出了一个全面的基准测试框架,用于系统评估迁移性评分,发现不同指标在多种场景下表现各异,强调标准化评估协议对于提升跨领域模型选择可靠性的重要性。
English: This paper introduces a comprehensive benchmarking framework to systematically evaluate transferability scores, revealing variations in metric performance and advocating for standardized assessment protocols to enhance reliability in cross-domain model selection.
Authors:Zhonghao Li, Kunpeng Zhang, Jinghuai Ou, Shuliang Liu, Xuming Hu
Abstract:
Retrieval-augmented generation (RAG) systems face significant challenges in multi-hop question answering (MHQA), where complex queries require synthesizing information across multiple document chunks. Existing approaches typically rely on iterative LLM-based query rewriting and routing, resulting in high computational costs due to repeated LLM invocations and multi-stage processes. To address these limitations, we propose TreeHop, an embedding-level framework without the need for LLMs in query refinement. TreeHop dynamically updates query embeddings by fusing semantic information from prior queries and retrieved documents, enabling iterative retrieval through embedding-space operations alone. This method replaces the traditional "Retrieve-Rewrite-Vectorize-Retrieve" cycle with a streamlined "Retrieve-Embed-Retrieve" loop, significantly reducing computational overhead. Moreover, a rule-based stop criterion is introduced to further prune redundant retrievals, balancing efficiency and recall rate. Experimental results show that TreeHop rivals advanced RAG methods across three open-domain MHQA datasets, achieving comparable performance with only 5\%-0.4\% of the model parameter size and reducing the query latency by approximately 99\% compared to concurrent approaches. This makes TreeHop a faster and more cost-effective solution for deployment in a range of knowledge-intensive applications. For reproducibility purposes, codes and data are available here: https://github.com/allen-li1231/TreeHop-RAG.
中文: TreeHop是一种高效的嵌入级框架,通过语义融合动态更新查询嵌入,大幅降低计算成本和延迟,同时在多跳问答任务中保持优异性能。
English: TreeHop is an efficient embedding-level framework that streamlines multi-hop question answering by dynamically updating query embeddings through semantic fusion, significantly reducing computational costs and latency while maintaining competitive performance.
Authors:Damien Martins Gomes
Abstract:
First-order optimization methods remain the standard for training deep neural networks (DNNs). Optimizers like Adam incorporate limited curvature information by preconditioning the stochastic gradient with a diagonal matrix. Despite the widespread adoption of first-order methods, second-order optimization algorithms often exhibit superior convergence compared to methods like Adam and SGD. However, their practicality in training DNNs is still limited by a significantly higher per-iteration computational cost compared to first-order methods. In this thesis, we present AdaFisher, a novel adaptive second-order optimizer that leverages a diagonal block-Kronecker approximation of the Fisher information matrix to adaptively precondition gradients. AdaFisher aims to bridge the gap between the improved convergence and generalization of second-order methods and the computational efficiency needed for training DNNs. Despite the traditionally slower speed of second-order optimizers, AdaFisher is effective for tasks such as image classification and language modeling, exhibiting remarkable stability and robustness during hyperparameter tuning. We demonstrate that AdaFisher outperforms state-of-the-art optimizers in both accuracy and convergence speed. The code is available from https://github.com/AtlasAnalyticsLab/AdaFisher.
中文: AdaFisher是一种新型自适应二阶优化器,利用Fisher信息矩阵的对角块Kronecker近似来预处理梯度,旨在将二阶方法的优越收敛性与训练深度神经网络的计算效率相结合。
English: AdaFisher is a novel adaptive second-order optimizer that uses a diagonal block-Kronecker approximation of the Fisher information matrix to precondition gradients, aiming to combine the superior convergence of second-order methods with computational efficiency for training deep neural networks.
Authors:Zihan Wang, Kangrui Wang, Qineng Wang, Pingyue Zhang, Linjie Li, Zhengyuan Yang, Xing Jin, Kefan Yu, Minh Nhat Nguyen, Licheng Liu, Eli Gottlieb, Yiping Lu, Kyunghyun Cho, Jiajun Wu, Li Fei-Fei, Lijuan Wang, Yejin Choi, Manling Li
Abstract:
Training large language models (LLMs) as interactive agents presents unique challenges including long-horizon decision making and interacting with stochastic environment feedback. While reinforcement learning (RL) has enabled progress in static tasks, multi-turn agent RL training remains underexplored. We propose StarPO (State-Thinking-Actions-Reward Policy Optimization), a general framework for trajectory-level agent RL, and introduce RAGEN, a modular system for training and evaluating LLM agents. Our study on four stylized environments reveals three core findings. First, our agent RL training shows a recurring mode of Echo Trap where reward variance cliffs and gradient spikes; we address this with StarPO-S, a stabilized variant with trajectory filtering, critic incorporation, and gradient stabilization. Second, we find the shaping of RL rollouts would benefit from diverse initial states, medium interaction granularity and more frequent sampling. Third, we show that without fine-grained, reasoning-aware reward signals, agent reasoning hardly emerge through multi-turn RL and they may show shallow strategies or hallucinated thoughts. Code and environments are available at https://github.com/RAGEN-AI/RAGEN.
中文摘要:本研究提出了用于多轮强化学习训练大语言模型智能体的StarPO框架和RAGEN系统,揭示了训练中的回声陷阱等挑战,强调需要采用稳定化训练方法和多样化奖励信号来促进智能体的有效推理能力。
English Summary: The study introduces StarPO and RAGEN frameworks for training LLM agents through multi-turn reinforcement learning, revealing challenges like Echo Trap and emphasizing the need for stabilized training methods and diverse reward signals to foster effective agent reasoning.
Authors:Nicola Debole, Pietro Barbiero, Francesco Giannini, Andrea Passerini, Stefano Teso, Emanuele Marconato
Abstract:
Concept Bottleneck Models (CBMs) are neural networks designed to conjoin high performance with ante-hoc interpretability. CBMs work by first mapping inputs (e.g., images) to high-level concepts (e.g., visible objects and their properties) and then use these to solve a downstream task (e.g., tagging or scoring an image) in an interpretable manner. Their performance and interpretability, however, hinge on the quality of the concepts they learn. The go-to strategy for ensuring good quality concepts is to leverage expert annotations, which are expensive to collect and seldom available in applications. Researchers have recently addressed this issue by introducing "VLM-CBM" architectures that replace manual annotations with weak supervision from foundation models. It is however unclear what is the impact of doing so on the quality of the learned concepts. To answer this question, we put state-of-the-art VLM-CBMs to the test, analyzing their learned concepts empirically using a selection of significant metrics. Our results show that, depending on the task, VLM supervision can sensibly differ from expert annotations, and that concept accuracy and quality are not strongly correlated. Our code is available at https://github.com/debryu/CQA.
Chinese: 概念瓶颈模型通过将输入映射到概念来增强可解释性,但其效果依赖于概念质量;使用基础模型的弱监督替代专家标注会降低概念准确性,导致概念质量与准确性之间关联不强。
English: Concept Bottleneck Models (CBMs) enhance interpretability by mapping inputs to concepts, but their effectiveness depends on concept quality, which can be compromised when using weak supervision from foundation models instead of expert annotations, leading to discrepancies in concept accuracy and quality.
Authors:Yonghui Zhai, Yang Zhang, Minghao Shang, Lihua Pang, Yaxin Ren
Abstract:
Graph Transformers (GTs) have shown advantages in numerous graph structure tasks but their self-attention mechanism ignores the generalization bias of graphs, with existing methods mainly compensating for this bias from aspects like position encoding, attention bias and relative distance yet still having sub-optimal performance and being insufficient by only considering the structural perspective of generalization bias. To address this, this paper proposes Grafourierformer, which innovatively combines GT with inductive bias containing Frequency-Structure information by applying Graph Fourier Transform to the Attention Matrix: specifically, eigenvalues from the Graph Laplacian matrix are used to construct an Eigenvalue matrix mask (reflecting node positions and structural relationships with neighboring nodes to enable consideration of node range structural characteristics and focus on local graph details), and inverse Fourier transform is employed to extract node high-frequency and low-frequency features, calculate low-frequency and high-frequency energy, and construct a node frequency-energy matrix to filter the eigenvalue matrix mask, allowing attention heads to incorporate both graph structural information and node frequency information optimization, adaptively distinguish global trends from local details, and effectively suppress redundant information interference. Extensive experiments on various benchmarks show Grafourierformer consistently outperforms GNN and GT-based models in graph classification and node classification tasks, with ablation experiments further validating the effectiveness and necessity of the method. Codes are available at https://github.com/Arichibald/Grafourierformer.git
中文摘要:本文提出Grafourierformer,通过图傅里叶变换将频率结构归纳偏置创新性融入图Transformer的注意力机制,同时考虑图结构关系与节点频率信息,在图表征和节点分类任务中实现了最优性能。
English Summary: This paper introduces Grafourierformer, a novel Graph Transformer that integrates frequency-structure inductive bias through Graph Fourier Transform to enhance attention mechanisms by combining structural relationships with node frequency information, achieving superior performance in graph and node classification tasks.
Authors:Yingbin Bai, Sylvie Thiebaux, Felipe Trevizan
Abstract:
Learning-based planners leveraging Graph Neural Networks can learn search guidance applicable to large search spaces, yet their potential to address symmetries remains largely unexplored. In this paper, we introduce a graph representation of planning problems allying learning efficiency with the ability to detect symmetries, along with two pruning methods, action pruning and state pruning, designed to manage symmetries during search. The integration of these techniques into Fast Downward achieves a first-time success over LAMA on the latest IPC learning track dataset. Code is released at: https://github.com/bybeye/Distincter.
中文: 本文提出了一种基于图的规划表示方法,既能高效学习又能检测对称性,结合剪枝技术集成到Fast Downward后,在最新IPC数据集上首次超越了LAMA。
English: This paper introduces a graph-based planning representation that enables efficient learning and symmetry detection, along with pruning methods that, when integrated into Fast Downward, outperform LAMA on the latest IPC dataset.
Authors:Nikolaos Chaidos, Angeliki Dimitriou, Nikolaos Spanos, Athanasios Voulodimos, Giorgos Stamou
Abstract:
Graph Neural Networks (GNNs) have emerged as an efficient alternative to convolutional approaches for vision tasks such as image classification, leveraging patch-based representations instead of raw pixels. These methods construct graphs where image patches serve as nodes, and edges are established based on patch similarity or classification relevance. Despite their efficiency, the explainability of GNN-based vision models remains underexplored, even though graphs are naturally interpretable. In this work, we analyze the semantic consistency of the graphs formed at different layers of GNN-based image classifiers, focusing on how well they preserve object structures and meaningful relationships. A comprehensive analysis is presented by quantifying the extent to which inter-layer graph connections reflect semantic similarity and spatial coherence. Explanations from standard and adversarial settings are also compared to assess whether they reflect the classifiers' robustness. Additionally, we visualize the flow of information across layers through heatmap-based visualization techniques, thereby highlighting the models' explainability. Our findings demonstrate that the decision-making processes of these models can be effectively explained, while also revealing that their reasoning does not necessarily align with human perception, especially in deeper layers.
中文摘要:本研究通过分析图神经网络各层语义一致性和可视化信息流动,评估其在图像分类中的可解释性,发现虽然模型决策可被有效解释,但其深层推理逻辑与人类感知存在差异。
English Summary: This study evaluates the explainability of Graph Neural Networks (GNNs) in image classification by analyzing semantic consistency across layers and visualizing information flow, revealing that while model decisions are interpretable, their reasoning diverges from human perception in deeper layers.
Authors:Biqing Duan, Qing Wang, Di Liu, Wei Zhou, Zhenli He, Shengfa Miao
Abstract:
Incremental learning that learns new classes over time after the model's deployment is becoming increasingly crucial, particularly for industrial edge systems, where it is difficult to communicate with a remote server to conduct computation-intensive learning. As more classes are expected to learn after their execution for edge devices. In this paper, we propose LODAP, a new on-device incremental learning framework for edge systems. The key part of LODAP is a new module, namely Efficient Incremental Module (EIM). EIM is composed of normal convolutions and lightweight operations. During incremental learning, EIM exploits some lightweight operations, called adapters, to effectively and efficiently learn features for new classes so that it can improve the accuracy of incremental learning while reducing model complexity as well as training overhead. The efficiency of LODAP is further enhanced by a data pruning strategy that significantly reduces the training data, thereby lowering the training overhead. We conducted extensive experiments on the CIFAR-100 and Tiny- ImageNet datasets. Experimental results show that LODAP improves the accuracy by up to 4.32\% over existing methods while reducing around 50\% of model complexity. In addition, evaluations on real edge systems demonstrate its applicability for on-device machine learning. The code is available at https://github.com/duanbiqing/LODAP.
Chinese: LODAP是一种设备端增量学习框架,通过高效增量模块和数据剪枝策略,在提高精度的同时显著降低了模型复杂度和训练开销。
English: LODAP is an on-device incremental learning framework that enhances accuracy while reducing model complexity and training overhead through its Efficient Incremental Module and data pruning strategy.
Authors:Haroui Ma, Francesco Quinzan, Theresa Willem, Stefan Bauer
Abstract:
Machine learning (ML) systems for medical imaging have demonstrated remarkable diagnostic capabilities, but their susceptibility to biases poses significant risks, since biases may negatively impact generalization performance. In this paper, we introduce a novel statistical framework to evaluate the dependency of medical imaging ML models on sensitive attributes, such as demographics. Our method leverages the concept of counterfactual invariance, measuring the extent to which a model's predictions remain unchanged under hypothetical changes to sensitive attributes. We present a practical algorithm that combines conditional latent diffusion models with statistical hypothesis testing to identify and quantify such biases without requiring direct access to counterfactual data. Through experiments on synthetic datasets and large-scale real-world medical imaging datasets, including \textsc{cheXpert} and MIMIC-CXR, we demonstrate that our approach aligns closely with counterfactual fairness principles and outperforms standard baselines. This work provides a robust tool to ensure that ML diagnostic systems generalize well, e.g., across demographic groups, offering a critical step towards AI safety in healthcare. Code: https://github.com/Neferpitou3871/AI-Alignment-Medical-Imaging.
中文: 本文提出了一种基于反事实不变性的新型统计框架,用于评估和减轻医学影像机器学习模型中的偏见,通过在真实数据集上的实验证明该方法能有效提升模型在不同人口群体间的泛化性能。
English: This paper introduces a novel statistical framework using counterfactual invariance to evaluate and mitigate biases in medical imaging ML models, demonstrating its effectiveness in improving generalization across demographic groups through experiments on real-world datasets.
Authors:Kitsuya Azuma, Takayuki Nishio, Yuichi Kitagawa, Wakako Nakano, Takahito Tanimura
Abstract:
Federated Learning (FL) enables collaborative model training across decentralized clients, enhancing privacy by keeping data local. Yet conventional FL, relying on frequent parameter-sharing, suffers from high communication overhead and limited model heterogeneity. Distillation-based FL approaches address these issues by sharing predictions (soft-labels) instead, but they often involve redundant transmissions across communication rounds, reducing efficiency. We propose SCARLET, a novel framework integrating synchronized soft-label caching and an enhanced Entropy Reduction Aggregation (Enhanced ERA) mechanism. SCARLET minimizes redundant communication by reusing cached soft-labels, achieving up to 50% reduction in communication costs compared to existing methods while maintaining accuracy. Enhanced ERA can be tuned to adapt to non-IID data variations, ensuring robust aggregation and performance in diverse client scenarios. Experimental evaluations demonstrate that SCARLET consistently outperforms state-of-the-art distillation-based FL methods in terms of accuracy and communication efficiency. The implementation of SCARLET is publicly available at https://github.com/kitsuyaazuma/SCARLET.
Chinese: SCARLET是一种新颖的联邦学习框架,通过同步软标签缓存和增强的熵减聚合机制,在保持不同数据分布下高精度的同时,将通信成本降低了高达50%。
English: SCARLET is a novel federated learning framework that reduces communication costs by up to 50% through synchronized soft-label caching and an enhanced entropy reduction aggregation mechanism, while maintaining high accuracy across diverse data distributions.
Authors:Yasir Ghunaim, Andrés Villa, Gergo Ignacz, Gyorgy Szekely, Motasem Alfarra, Bernard Ghanem
Abstract:
Advancements in machine learning for molecular property prediction have improved accuracy but at the expense of higher computational cost and longer training times. Recently, the Joint Multi-domain Pre-training (JMP) foundation model has demonstrated strong performance across various downstream tasks with reduced training time over previous models. Despite JMP's advantages, fine-tuning it on molecular datasets ranging from small-scale to large-scale requires considerable time and computational resources. In this work, we investigate strategies to enhance efficiency by reducing model size while preserving performance. To better understand the model's efficiency, we analyze the layer contributions of JMP and find that later interaction blocks provide diminishing returns, suggesting an opportunity for model compression. We explore block reduction strategies by pruning the pre-trained model and evaluating its impact on efficiency and accuracy during fine-tuning. Our analysis reveals that removing two interaction blocks results in a minimal performance drop, reducing the model size by 32% while increasing inference throughput by 1.3x. These results suggest that JMP-L is over-parameterized and that a smaller, more efficient variant can achieve comparable performance with lower computational cost. Our study provides insights for developing lighter, faster, and more scalable foundation models for molecular and materials discovery. The code is publicly available at: https://github.com/Yasir-Ghunaim/efficient-jmp.
Chinese: 研究表明,联合多领域预训练(JMP)模型存在参数冗余,通过剪除两个交互模块可在保持性能基本不变的同时实现模型体积减小32%、推理速度提升1.3倍,为分子发现提供了更高效的基础模型解决方案。
English: This study demonstrates that the Joint Multi-domain Pre-training (JMP) model is over-parameterized and can be compressed by removing two interaction blocks, achieving a 32% size reduction and 1.3x faster inference with minimal performance loss, offering a more efficient foundation model for molecular discovery.
Authors:Sonia Joseph, Praneet Suresh, Lorenz Hufe, Edward Stevinson, Robert Graham, Yash Vadi, Danilo Bzdok, Sebastian Lapuschkin, Lee Sharkey, Blake Aaron Richards
Abstract:
Robust tooling and publicly available pre-trained models have helped drive recent advances in mechanistic interpretability for language models. However, similar progress in vision mechanistic interpretability has been hindered by the lack of accessible frameworks and pre-trained weights. We present Prisma (Access the codebase here: https://github.com/Prisma-Multimodal/ViT-Prisma), an open-source framework designed to accelerate vision mechanistic interpretability research, providing a unified toolkit for accessing 75+ vision and video transformers; support for sparse autoencoder (SAE), transcoder, and crosscoder training; a suite of 80+ pre-trained SAE weights; activation caching, circuit analysis tools, and visualization tools; and educational resources. Our analysis reveals surprising findings, including that effective vision SAEs can exhibit substantially lower sparsity patterns than language SAEs, and that in some instances, SAE reconstructions can decrease model loss. Prisma enables new research directions for understanding vision model internals while lowering barriers to entry in this emerging field.
中文:Prisma是一个开源框架,通过提供统一的工具包来访问视觉变换器、训练工具、预训练权重和分析资源,加速视觉机制可解释性研究,同时揭示了视觉稀疏自编码器相比语言模型具有更低稀疏性等惊人发现。
English: Prisma is an open-source framework that accelerates vision mechanistic interpretability research by providing a unified toolkit for accessing vision transformers, training tools, pre-trained weights, and analytical resources, while revealing surprising findings such as lower sparsity in vision SAEs compared to language models.
Authors:Dylan Bouchard, Mohit Singh Chauhan
Abstract:
Hallucinations are a persistent problem with Large Language Models (LLMs). As these models become increasingly used in high-stakes domains, such as healthcare and finance, the need for effective hallucination detection is crucial. To this end, we outline a versatile framework for zero-resource hallucination detection that practitioners can apply to real-world use cases. To achieve this, we adapt a variety of existing uncertainty quantification (UQ) techniques, including black-box UQ, white-box UQ, and LLM-as-a-Judge, transforming them as necessary into standardized response-level confidence scores ranging from 0 to 1. To enhance flexibility, we propose a tunable ensemble approach that incorporates any combination of the individual confidence scores. This approach enables practitioners to optimize the ensemble for a specific use case for improved performance. To streamline implementation, the full suite of scorers is offered in this paper's companion Python toolkit, UQLM. To evaluate the performance of the various scorers, we conduct an extensive set of experiments using several LLM question-answering benchmarks. We find that our tunable ensemble typically surpasses its individual components and outperforms existing hallucination detection methods. Our results demonstrate the benefits of customized hallucination detection strategies for improving the accuracy and reliability of LLMs.
中文: 本文提出了一种无需外部资源的通用框架,通过将多种不确定性量化技术转化为标准化置信度分数,并采用可调整的集成方法,有效检测大语言模型的幻觉问题,其性能优于现有方法。
English: This paper introduces a versatile zero-resource framework for detecting hallucinations in Large Language Models by adapting uncertainty quantification techniques into standardized confidence scores and proposing a tunable ensemble approach that outperforms existing methods.
Authors:Jianlong Chen, Chao Li, Yang Yuan, Andrew C Yao
Abstract:
Large language models (LLMs) have shown promise in formal theorem proving, but their token-level processing often fails to capture the inherent hierarchical nature of mathematical proofs. We introduce \textbf{Hierarchical Attention}, a regularization method that aligns LLMs' attention mechanisms with mathematical reasoning structures. Our approach establishes a five-level hierarchy from foundational elements to high-level concepts, ensuring structured information flow in proof generation. Experiments demonstrate that our method improves proof success rates by 2.05\% on miniF2F and 1.69\% on ProofNet while reducing proof complexity by 23.81\% and 16.50\% respectively. The code is available at https://github.com/Car-pe/HAGBP.
中文: 分层注意力是一种正则化方法,通过将大语言模型的注意力机制与数学推理结构对齐,提高了定理证明的成功率并降低了证明复杂度。
English: Hierarchical Attention is a regularization method that aligns LLMs' attention with mathematical reasoning structures, improving proof success rates and reducing complexity in theorem proving.
Authors:Piotr Migus
Abstract:
Complex-valued neural networks (CVNNs) excel where phase matters, yet their multi-sheeted decision surfaces defy standard explainability and calibration tools. We propose a \emph{Newton-Puiseux} framework that fits a local polynomial surrogate to a high-uncertainty input and analytically decomposes this surrogate into fractional-power series. The resulting Puiseux expansions, dominant Puiseux coefficients, and phase-aligned curvature descriptors deliver closed-form estimates of robustness and over-confidence that gradient - or perturbation-based methods (saliency, LIME, SHAP) cannot provide. On a controlled $\mathbb{C}^2$ helix the surrogate attains RMSE $< 0.09$ while recovering the number of decision sheets; quartic coefficients predict adversarial flip radii within $10^{-3}$. On the real-world MIT-BIH arrhythmia corpus, Puiseux-guided, phase-aware temperature scaling lowers expected calibration error from 0.087 to 0.034, contributing to the advancement of CVNNs. Full code, pre-trained weights, and scripts are at https://github.com/piotrmgs/puiseux-cvnn.
Chinese: 提出的牛顿-普伊瑟框架通过将局部多项式代理分解为分数幂级数,能够对复数神经网络的鲁棒性和过度自信进行闭式估计,在合成和真实数据集上均实现了更好的校准效果和预测精度。
English: The proposed Newton-Puiseux framework enables closed-form estimation of robustness and over-confidence in complex-valued neural networks by decomposing local polynomial surrogates into fractional-power series, achieving improved calibration and predictive accuracy on both synthetic and real-world datasets.
Authors:Jiaqi Chen, Bang Zhang, Ruotian Ma, Peisong Wang, Xiaodan Liang, Zhaopeng Tu, Xiaolong Li, Kwan-Yee K. Wong
Abstract:
Evaluating the step-by-step reliability of large language model (LLM) reasoning, such as Chain-of-Thought, remains challenging due to the difficulty and cost of obtaining high-quality step-level supervision. In this paper, we introduce Self-Play Critic (SPC), a novel approach where a critic model evolves its ability to assess reasoning steps through adversarial self-play games, eliminating the need for manual step-level annotation. SPC involves fine-tuning two copies of a base model to play two roles, namely a "sneaky generator" that deliberately produces erroneous steps designed to be difficult to detect, and a "critic" that analyzes the correctness of reasoning steps. These two models engage in an adversarial game in which the generator aims to fool the critic, while the critic model seeks to identify the generator's errors. Using reinforcement learning based on the game outcomes, the models iteratively improve; the winner of each confrontation receives a positive reward and the loser receives a negative reward, driving continuous self-evolution. Experiments on three reasoning process benchmarks (ProcessBench, PRM800K, DeltaBench) demonstrate that our SPC progressively enhances its error detection capabilities (e.g., accuracy increases from 70.8% to 77.7% on ProcessBench) and surpasses strong baselines, including distilled R1 model. Furthermore, SPC can guide the test-time search of diverse LLMs and significantly improve their mathematical reasoning performance on MATH500 and AIME2024, surpassing those guided by state-of-the-art process reward models.
Authors:Yun Qu, Qi Cheems Wang, Yixiu Mao, Yiqin Lv, Xiangyang Ji
Abstract:
Task robust adaptation is a long-standing pursuit in sequential decision-making. Some risk-averse strategies, e.g., the conditional value-at-risk principle, are incorporated in domain randomization or meta reinforcement learning to prioritize difficult tasks in optimization, which demand costly intensive evaluations. The efficiency issue prompts the development of robust active task sampling to train adaptive policies, where risk-predictive models are used to surrogate policy evaluation. This work characterizes the optimization pipeline of robust active task sampling as a Markov decision process, posits theoretical and practical insights, and constitutes robustness concepts in risk-averse scenarios. Importantly, we propose an easy-to-implement method, referred to as Posterior and Diversity Synergized Task Sampling (PDTS), to accommodate fast and robust sequential decision-making. Extensive experiments show that PDTS unlocks the potential of robust active task sampling, significantly improves the zero-shot and few-shot adaptation robustness in challenging tasks, and even accelerates the learning process under certain scenarios. Our project website is at https://thu-rllab.github.io/PDTS_project_page.
Authors:Mohammad Mahdi Abootorabi, Omid Ghahroodi, Pardis Sadat Zahraei, Hossein Behzadasl, Alireza Mirrokni, Mobina Salimipanah, Arash Rasouli, Bahar Behzadipour, Sara Azarnoush, Benyamin Maleki, Erfan Sadraiye, Kiarash Kiani Feriz, Mahdi Teymouri Nahad, Ali Moghadasi, Abolfazl Eshagh Abianeh, Nizi Nazar, Hamid R. Rabiee, Mahdieh Soleymani Baghshah, Meisam Ahmadi, Ehsaneddin Asgari
Abstract:
Generative AI is reshaping art, gaming, and most notably animation. Recent breakthroughs in foundation and diffusion models have reduced the time and cost of producing animated content. Characters are central animation components, involving motion, emotions, gestures, and facial expressions. The pace and breadth of advances in recent months make it difficult to maintain a coherent view of the field, motivating the need for an integrative review. Unlike earlier overviews that treat avatars, gestures, or facial animation in isolation, this survey offers a single, comprehensive perspective on all the main generative AI applications for character animation. We begin by examining the state-of-the-art in facial animation, expression rendering, image synthesis, avatar creation, gesture modeling, motion synthesis, object generation, and texture synthesis. We highlight leading research, practical deployments, commonly used datasets, and emerging trends for each area. To support newcomers, we also provide a comprehensive background section that introduces foundational models and evaluation metrics, equipping readers with the knowledge needed to enter the field. We discuss open challenges and map future research directions, providing a roadmap to advance AI-driven character-animation technologies. This survey is intended as a resource for researchers and developers entering the field of generative AI animation or adjacent fields. Resources are available at: https://github.com/llm-lab-org/Generative-AI-for-Character-Animation-Survey.
Chinese: 生成式AI通过整合面部、手势和动作合成等领域的进展,正在革新角色动画技术,本综述为研究人员提供了全面指导,涵盖当前技术与未来发展方向。
English: Generative AI is revolutionizing character animation by integrating advancements in facial, gesture, and motion synthesis, offering a comprehensive review to guide researchers through current technologies and future directions.
Authors:Justin Mücke, Ansgar Scherp
Abstract:
Semantic reasoning aims to infer new knowledge from existing knowledge, with OWL ontologies serving as a standardized framework for organizing information. A key challenge in semantic reasoning is verifying ontology consistency. However, state-of-the-art reasoners are computationally expensive, and their efficiency decreases as ontology sizes grow. While classical machine learning models have been explored for consistency checking, they struggle to capture complex relationships within ontologies. Large language models (LLMs) have shown promising results for simple reasoning tasks but perform poorly on structured reasoning. The recently introduced Graph Language Model (GLM) offers a way to simultaneously process graph-structured data and text. This paper proposes GLaMoR (Graph Language Model for Reasoning), a reasoning pipeline that transforms OWL ontologies into graph-structured data and adapts the GLM architecture for consistency checking. We evaluate GLaMoR on ontologies from the NCBO BioPortal repository, converting them into triples suitable for model input. Our results show that the GLM outperforms all baseline models, achieving $95\%$ accuracy while being 20 times faster than classical reasoners.
The Code is accessible under: https://github.com/JustinMuecke/GLaMoR
Chinese Summary: 本文提出GLaMoR推理框架,通过将OWL本体转换为图结构数据并采用图语言模型进行一致性检测,在保持95%准确率的同时,比传统推理器提速20倍。
English Summary: This paper introduces GLaMoR, a reasoning pipeline that converts OWL ontologies into graph-structured data and utilizes a Graph Language Model for efficient consistency checking, achieving 95% accuracy and 20x faster performance compared to traditional reasoners.
Authors:Gal Almog, Ariel Shamir, Ohad Fried
Abstract:
While latent diffusion models achieve impressive image editing results, their application to iterative editing of the same image is severely restricted. When trying to apply consecutive edit operations using current models, they accumulate artifacts and noise due to repeated transitions between pixel and latent spaces. Some methods have attempted to address this limitation by performing the entire edit chain within the latent space, sacrificing flexibility by supporting only a limited, predetermined set of diffusion editing operations. We present a RE-encode decode (REED) training scheme for variational autoencoders (VAEs), which promotes image quality preservation even after many iterations. Our work enables multi-method iterative image editing: users can perform a variety of iterative edit operations, with each operation building on the output of the previous one using both diffusion-based operations and conventional editing techniques. We demonstrate the advantage of REED-VAE across a range of image editing scenarios, including text-based and mask-based editing frameworks. In addition, we show how REED-VAE enhances the overall editability of images, increasing the likelihood of successful and precise edit operations. We hope that this work will serve as a benchmark for the newly introduced task of multi-method image editing. Our code and models will be available at https://github.com/galmog/REED-VAE
中文: REED训练方案通过优化变分自编码器,实现了高质量的多方法迭代图像编辑,即使在多次操作后仍能保持图像质量,支持扩散模型与传统编辑技术的灵活结合。
English: The REED training scheme for VAEs enables high-quality iterative image editing by preserving image integrity across multiple operations, supporting both diffusion-based and conventional techniques without accumulating artifacts.
Authors:Debarati Das, Khanh Chi Le, Ritik Sachin Parkar, Karin De Langis, Brendan Madson, Chad M. Berryman, Robin M. Willis, Daniel H. Moses, Brett McDonnell, Daniel Schwarcz, Dongyeop Kang
Abstract:
Legal practitioners, particularly those early in their careers, face complex, high-stakes tasks that require adaptive, context-sensitive reasoning. While AI holds promise in supporting legal work, current datasets and models are narrowly focused on isolated subtasks and fail to capture the end-to-end decision-making required in real-world practice. To address this gap, we introduce LawFlow, a dataset of complete end-to-end legal workflows collected from trained law students, grounded in real-world business entity formation scenarios. Unlike prior datasets focused on input-output pairs or linear chains of thought, LawFlow captures dynamic, modular, and iterative reasoning processes that reflect the ambiguity, revision, and client-adaptive strategies of legal practice. Using LawFlow, we compare human and LLM-generated workflows, revealing systematic differences in structure, reasoning flexibility, and plan execution. Human workflows tend to be modular and adaptive, while LLM workflows are more sequential, exhaustive, and less sensitive to downstream implications. Our findings also suggest that legal professionals prefer AI to carry out supportive roles, such as brainstorming, identifying blind spots, and surfacing alternatives, rather than executing complex workflows end-to-end. Our results highlight both the current limitations of LLMs in supporting complex legal workflows and opportunities for developing more collaborative, reasoning-aware legal AI systems. All data and code are available on our project page (https://minnesotanlp.github.io/LawFlow-website/).
Authors:Robert Leppich, Michael Stenger, Daniel Grillmeyer, Vanessa Borst, Samuel Kounev
Abstract:
We introduce a temporal feature encoding architecture called Time Series Representation Model (TSRM) for multivariate time series forecasting and imputation. The architecture is structured around CNN-based representation layers, each dedicated to an independent representation learning task and designed to capture diverse temporal patterns, followed by an attention-based feature extraction layer and a merge layer, designed to aggregate extracted features. The architecture is fundamentally based on a configuration that is inspired by a Transformer encoder, with self-attention mechanisms at its core. The TSRM architecture outperforms state-of-the-art approaches on most of the seven established benchmark datasets considered in our empirical evaluation for both forecasting and imputation tasks. At the same time, it significantly reduces complexity in the form of learnable parameters. The source code is available at https://github.com/RobertLeppich/TSRM.
中文摘要:时间序列表示模型(TSRM)采用基于CNN和注意力机制的架构,在多元时间序列预测和填补任务中优于现有最优方法,同时显著降低了模型复杂度。
English Summary: The Time Series Representation Model (TSRM) introduces a CNN-based architecture with attention mechanisms for multivariate time series forecasting and imputation, outperforming state-of-the-art methods while reducing complexity.
Authors:Jong Inn Park, Maanas Taneja, Qianwen Wang, Dongyeop Kang
Abstract:
Generating engaging, accurate short-form videos from scientific papers is challenging due to content complexity and the gap between expert authors and readers. Existing end-to-end methods often suffer from factual inaccuracies and visual artifacts, limiting their utility for scientific dissemination. To address these issues, we propose SciTalk, a novel multi-LLM agentic framework, grounding videos in various sources, such as text, figures, visual styles, and avatars. Inspired by content creators' workflows, SciTalk uses specialized agents for content summarization, visual scene planning, and text and layout editing, and incorporates an iterative feedback mechanism where video agents simulate user roles to give feedback on generated videos from previous iterations and refine generation prompts. Experimental evaluations show that SciTalk outperforms simple prompting methods in generating scientifically accurate and engaging content over the refined loop of video generation. Although preliminary results are still not yet matching human creators' quality, our framework provides valuable insights into the challenges and benefits of feedback-driven video generation. Our code, data, and generated videos will be publicly available.
Authors:Tung D. Vu, Chung Hoang, Truong-Son Hy
Abstract:
The Design2Code problem, which involves converting digital designs into functional source code, is a significant challenge in software development due to its complexity and time-consuming nature. Traditional approaches often struggle with accurately interpreting the intricate visual details and structural relationships inherent in webpage designs, leading to limitations in automation and efficiency. In this paper, we propose a novel method that leverages multimodal graph representation learning to address these challenges. By integrating both visual and structural information from design sketches, our approach enhances the accuracy and efficiency of code generation, particularly in producing semantically correct and structurally sound HTML code. We present a comprehensive evaluation of our method, demonstrating significant improvements in both accuracy and efficiency compared to existing techniques. Extensive evaluation demonstrates significant improvements of multimodal graph learning over existing techniques, highlighting the potential of our method to revolutionize design-to-code automation. Code available at https://github.com/HySonLab/Design2Code
中文摘要:本文提出一种多模态图表示学习方法,通过整合视觉和结构信息将设计稿精确转换为HTML代码,相比现有技术展现出显著优越的性能。
English Summary: This paper introduces a multimodal graph representation learning method that effectively converts digital designs into accurate HTML code by integrating visual and structural information, demonstrating superior performance over existing techniques.
Authors:Felix Burr, Marcel Hoffmann, Ansgar Scherp
Abstract:
Despite the ample availability of graph data, obtaining vertex labels is a tedious and expensive task. Therefore, it is desirable to learn from a few labeled vertices only. Existing few-shot learners assume a class oracle, which provides labeled vertices for a desired class. However, such an oracle is not available in a real-world setting, i.e., when drawing a vertex for labeling it is unknown to which class the vertex belongs. Few-shot learners are often combined with prototypical networks, while classical semi-supervised vertex classification uses discriminative models, e.g., Graph Convolutional Networks (GCN). In this paper, we train our models by iteratively prompting a human annotator with vertices to annotate. We perform three experiments where we continually relax our assumptions. First, we assume a class oracle, i.e., the human annotator is provided with an equal number of vertices to label for each class. We denote this as "Balanced Sampling''. In the subsequent experiment, "Unbalanced Sampling,'' we replace the class oracle with $k$-medoids clustering and draw vertices to label from the clusters. In the last experiment, the "Unknown Number of Classes,'' we no longer assumed we knew the number and distribution of classes. Our results show that prototypical models outperform discriminative models in all experiments when fewer than $20$ samples per class are available. While dropping the assumption of the class oracle for the "Unbalanced Sampling'' experiment reduces the performance of the GCN by $9\%$, the prototypical network loses only $1\%$ on average. For the "Unknown Number of Classes'' experiment, the average performance for both models decreased further by $1\%$.
Source code: https://github.com/Ximsa/2023-felix-ma
中文摘要:本研究表明,在顶点分类的少样本学习中,原型网络始终优于如GCN等判别模型,尤其在标记数据稀缺时表现更佳,且在放宽类别分布假设时仍保持较强鲁棒性。
English Summary: This study demonstrates that prototypical networks consistently outperform discriminative models like GCNs in few-shot vertex classification, particularly when labeled data is scarce, and maintains robustness even when class distribution assumptions are relaxed.
Authors:Jialei Song, Xingquan Zuo, Feiyang Wang, Hai Huang, Tianle Zhang
Abstract:
Deep neural networks (DNNs) are highly susceptible to adversarial samples, raising concerns about their reliability in safety-critical tasks. Currently, methods of evaluating adversarial robustness are primarily categorized into attack-based and certified robustness evaluation approaches. The former not only relies on specific attack algorithms but also is highly time-consuming, while the latter due to its analytical nature, is typically difficult to implement for large and complex models. A few studies evaluate model robustness based on the model's decision boundary, but they suffer from low evaluation accuracy. To address the aforementioned issues, we propose a novel adversarial robustness evaluation metric, Robustness Difference Index (RDI), which is based on model statistical features. RDI draws inspiration from clustering evaluation by analyzing the intra-class and inter-class distances of feature vectors separated by the decision boundary to quantify model robustness. It is attack-independent and has high computational efficiency. Experiments show that, RDI demonstrates a stronger correlation with the gold-standard adversarial robustness metric of attack success rate (ASR). The average computation time of RDI is only 1/30 of the evaluation method based on the PGD attack. Our open-source code is available at: https://github.com/BUPTAIOC/RDI.
Chinese: 提出的鲁棒性差异指数(RDI)通过分析决策边界分隔的特征向量类内与类间距离,提供了一种独立于攻击且计算高效的对抗鲁棒性评估方法,与攻击成功率高度相关,计算时间仅为基于PGD攻击评估方法的1/30。
English: The proposed Robustness Difference Index (RDI) offers an attack-independent and computationally efficient method for evaluating adversarial robustness by analyzing intra-class and inter-class feature distances, showing strong correlation with attack success rates and requiring only 1/30 of the time compared to PGD-based evaluation.
Authors:Sungnyun Kim, Sungwoo Cho, Sangmin Bae, Kangwook Jang, Se-Young Yun
Abstract:
Audio-visual speech recognition (AVSR) incorporates auditory and visual modalities to improve recognition accuracy, particularly in noisy environments where audio-only speech systems are insufficient. While previous research has largely addressed audio disruptions, few studies have dealt with visual corruptions, e.g., lip occlusions or blurred videos, which are also detrimental. To address this real-world challenge, we propose CAV2vec, a novel self-supervised speech representation learning framework particularly designed to handle audio-visual joint corruption. CAV2vec employs a self-distillation approach with a corrupted prediction task, where the student model learns to predict clean targets, generated by the teacher model, with corrupted input frames. Specifically, we suggest a unimodal multi-task learning, which distills cross-modal knowledge and aligns the corrupted modalities, by predicting clean audio targets with corrupted videos, and clean video targets with corrupted audios. This strategy mitigates the dispersion in the representation space caused by corrupted modalities, leading to more reliable and robust audio-visual fusion. Our experiments on robust AVSR benchmarks demonstrate that the corrupted representation learning method significantly enhances recognition accuracy across generalized environments involving various types of corruption. Our code is available at https://github.com/sungnyun/cav2vec.
中文摘要:该研究提出CAV2vec自监督框架,通过训练模型从受损输入中预测纯净目标,有效提升视听语音识别系统在现实环境中应对音频和视觉干扰的鲁棒性。
English Summary: The study introduces CAV2vec, a self-supervised framework that enhances audio-visual speech recognition by training models to predict clean targets from corrupted inputs, improving robustness against real-world audio and visual disruptions.
Authors:Anna Katariina Wisakanto, Joe Rogero, Avyay M. Casheekar, Richard Mallah
Abstract:
Modern general-purpose artificial intelligence (AI) systems present an urgent risk management challenge, as their rapidly evolving capabilities and potential for catastrophic harm outpace our ability to reliably assess their risks. Current methods often rely on selective testing and undocumented assumptions about risk priorities, frequently failing to make a serious attempt at assessing the set of pathways through which AI systems pose direct or indirect risks to society and the biosphere. This paper introduces the probabilistic risk assessment (PRA) for AI framework, adapting established PRA techniques from high-reliability industries (e.g., nuclear power, aerospace) for the new challenges of advanced AI. The framework guides assessors in identifying potential risks, estimating likelihood and severity bands, and explicitly documenting evidence, underlying assumptions, and analyses at appropriate granularities. The framework's implementation tool synthesizes the results into a risk report card with aggregated risk estimates from all assessed risks. It introduces three methodological advances: (1) Aspect-oriented hazard analysis provides systematic hazard coverage guided by a first-principles taxonomy of AI system aspects (e.g. capabilities, domain knowledge, affordances); (2) Risk pathway modeling analyzes causal chains from system aspects to societal impacts using bidirectional analysis and incorporating prospective techniques; and (3) Uncertainty management employs scenario decomposition, reference scales, and explicit tracing protocols to structure credible projections with novelty or limited data. Additionally, the framework harmonizes diverse assessment methods by integrating evidence into comparable, quantified absolute risk estimates for lifecycle decisions. We have implemented this as a workbook tool for AI developers, evaluators, and regulators.
Authors:KimiTeam, Ding Ding, Zeqian Ju, Yichong Leng, Songxiang Liu, Tong Liu, Zeyu Shang, Kai Shen, Wei Song, Xu Tan, Heyi Tang, Zhengtao Wang, Chu Wei, Yifei Xin, Xinran Xu, Jianwei Yu, Yutao Zhang, Xinyu Zhou, Y. Charles, Jun Chen, Yanru Chen, Yulun Du, Weiran He, Zhenxing Hu, Guokun Lai, Qingcheng Li, Yangyang Liu, Weidong Sun, Jianzhou Wang, Yuzhi Wang, Yuefeng Wu, Yuxin Wu, Dongchao Yang, Hao Yang, Ying Yang, Zhilin Yang, Aoxiong Yin, Ruibin Yuan, Yutong Zhang, Zaida Zhou
Abstract:
We present Kimi-Audio, an open-source audio foundation model that excels in audio understanding, generation, and conversation. We detail the practices in building Kimi-Audio, including model architecture, data curation, training recipe, inference deployment, and evaluation. Specifically, we leverage a 12.5Hz audio tokenizer, design a novel LLM-based architecture with continuous features as input and discrete tokens as output, and develop a chunk-wise streaming detokenizer based on flow matching. We curate a pre-training dataset that consists of more than 13 million hours of audio data covering a wide range of modalities including speech, sound, and music, and build a pipeline to construct high-quality and diverse post-training data. Initialized from a pre-trained LLM, Kimi-Audio is continual pre-trained on both audio and text data with several carefully designed tasks, and then fine-tuned to support a diverse of audio-related tasks. Extensive evaluation shows that Kimi-Audio achieves state-of-the-art performance on a range of audio benchmarks including speech recognition, audio understanding, audio question answering, and speech conversation. We release the codes, model checkpoints, as well as the evaluation toolkits in https://github.com/MoonshotAI/Kimi-Audio.
Chinese: Kimi-Audio是一款开源的音频基础模型,凭借创新的架构和海量数据训练,在音频理解、生成与对话方面表现卓越,并在多项基准测试中达到领先水平。
English: Kimi-Audio is an open-source audio foundation model that excels in understanding, generating, and conversing with audio, achieving state-of-the-art performance across various benchmarks through innovative architecture and extensive data training.
Authors:Marco Turzi, Siamak Mehrkanoon
Abstract:
Weather forecasting is essential for facilitating diverse socio-economic activity and environmental conservation initiatives. Deep learning techniques are increasingly being explored as complementary approaches to Numerical Weather Prediction (NWP) models, offering potential benefits such as reduced complexity and enhanced adaptability in specific applications. This work presents a novel design, Small Shuffled Attention UNet (SSA-UNet), which enhances SmaAt-UNet's architecture by including a shuffle channeling mechanism to optimize performance and diminish complexity. To assess its efficacy, this architecture and its reduced variant are examined and trained on two datasets: a Dutch precipitation dataset from 2016 to 2019, and a French cloud cover dataset containing radar images from 2017 to 2018. Three output configurations of the proposed architecture are evaluated, yielding outputs of 1, 6, and 12 precipitation maps, respectively. To better understand how this model operates and produces its predictions, a gradient-based approach called Grad-CAM is used to analyze the outputs generated. The analysis of heatmaps generated by Grad-CAM facilitated the identification of regions within the input maps that the model considers most informative for generating its predictions. The implementation of SSA-UNet can be found on our Github\footnote{\href{https://github.com/MarcoTurzi/SSA-UNet}{https://github.com/MarcoTurzi/SSA-UNet}}
中文: 本研究提出SSA-UNet深度学习模型,通过引入通道混洗机制优化架构以提升天气预报性能,并基于荷兰和法国气象数据集采用Grad-CAM方法进行预测可解释性分析。
English: This study introduces SSA-UNet, a deep learning model that enhances weather forecasting by optimizing architecture with a shuffle channeling mechanism, evaluated on Dutch and French meteorological datasets using Grad-CAM for interpretability.
Authors:Ritesh Goru, Shanay Mehta, Prateek Jain
Abstract:
Fine-tuning Large Language Models (LLMs) on multi-turn reasoning datasets requires N (number of turns) separate forward passes per conversation due to reasoning token visibility constraints, as reasoning tokens for a turn are discarded in subsequent turns. We propose duplicating response tokens along with a custom attention mask to enable single-pass processing of entire conversations. We prove our method produces identical losses to the N-pass approach while reducing time complexity from $O\bigl(N^{3}\bigl)$ to $O\bigl(N^{2}\bigl)$ and maintaining the same memory complexity for a transformer based model. Our approach achieves significant training speedup while preserving accuracy. Our implementation is available online (https://github.com/devrev/One-Pass-to-Reason).
Chinese: 该方法通过复制响应令牌和自定义注意力掩码,实现了多轮对话的单次处理,在保持精度的同时将时间复杂度从O(N³)降低至O(N²),且损失与N次处理方法完全相同。
English: The proposed method duplicates response tokens with a custom attention mask to enable single-pass processing of multi-turn conversations, achieving identical losses to the N-pass approach while reducing time complexity from O(N³) to O(N²) and preserving accuracy.
Authors:Mathieu Petitbois, Rémy Portelas, Sylvain Lamprier, Ludovic Denoyer
Abstract:
Imitation Learning (IL) techniques aim to replicate human behaviors in specific tasks. While IL has gained prominence due to its effectiveness and efficiency, traditional methods often focus on datasets collected from experts to produce a single efficient policy. Recently, extensions have been proposed to handle datasets of diverse behaviors by mainly focusing on learning transition-level diverse policies or on performing entropy maximization at the trajectory level. While these methods may lead to diverse behaviors, they may not be sufficient to reproduce the actual diversity of demonstrations or to allow controlled trajectory generation. To overcome these drawbacks, we propose a different method based on two key features: a) Temporal Consistency that ensures consistent behaviors across entire episodes and not just at the transition level as well as b) Controllability obtained by constructing a latent space of behaviors that allows users to selectively activate specific behaviors based on their requirements. We compare our approach to state-of-the-art methods over a diverse set of tasks and environments. Project page: https://mathieu-petitbois.github.io/projects/swr/
Authors:Kaiyuan Tang, Siyuan Yao, Chaoli Wang
Abstract:
In volume visualization, users can interactively explore the three-dimensional data by specifying color and opacity mappings in the transfer function (TF) or adjusting lighting parameters, facilitating meaningful interpretation of the underlying structure. However, rendering large-scale volumes demands powerful GPUs and high-speed memory access for real-time performance. While existing novel view synthesis (NVS) methods offer faster rendering speeds with lower hardware requirements, the visible parts of a reconstructed scene are fixed and constrained by preset TF settings, significantly limiting user exploration. This paper introduces inverse volume rendering via Gaussian splatting (iVR-GS), an innovative NVS method that reduces the rendering cost while enabling scene editing for interactive volume exploration. Specifically, we compose multiple iVR-GS models associated with basic TFs covering disjoint visible parts to make the entire volumetric scene visible. Each basic model contains a collection of 3D editable Gaussians, where each Gaussian is a 3D spatial point that supports real-time scene rendering and editing. We demonstrate the superior reconstruction quality and composability of iVR-GS against other NVS solutions (Plenoxels, CCNeRF, and base 3DGS) on various volume datasets. The code is available at https://github.com/TouKaienn/iVR-GS.
中文摘要:本文提出iVR-GS方法,通过高斯抛洒实现逆向体绘制,在降低渲染成本的同时支持场景实时编辑,为交互式体数据探索提供了高质量的重建效果。
English Summary: This paper introduces iVR-GS, an inverse volume rendering method using Gaussian splatting that enables real-time scene editing and reduces rendering costs while maintaining high reconstruction quality for interactive volume exploration.
Authors:Mert Sonmezer, Seyda Ertekin
Abstract:
Long-term time series forecasting plays a pivotal role in various real-world applications. Despite recent advancements and the success of different architectures, forecasting is often challenging due to non-stationary nature of the real-world data, which frequently exhibit distribution shifts and temporal changes in statistical properties like mean and variance over time. Previous studies suggest that this inherent variability complicates forecasting, limiting the performance of many models by leading to loss of non-stationarity and resulting in over-stationarization (Liu, Wu, Wang and Long, 2022). To address this challenge, we introduce a novel architecture, ChoronoAdaptive Network (CANet), inspired by style-transfer techniques. The core of CANet is the Non-stationary Adaptive Normalization module, seamlessly integrating the Style Blending Gate and Adaptive Instance Normalization (AdaIN) (Huang and Belongie, 2017). The Style Blending Gate preserves and reintegrates non-stationary characteristics, such as mean and standard deviation, by blending internal and external statistics, preventing over-stationarization while maintaining essential temporal dependencies. Coupled with AdaIN, which dynamically adapts the model to statistical changes, this approach enhances predictive accuracy under non-stationary conditions. CANet also employs multi-resolution patching to handle short-term fluctuations and long-term trends, along with Fourier analysis-based adaptive thresholding to reduce noise. A Stacked Kronecker Product Layer further optimizes the model's efficiency while maintaining high performance. Extensive experiments on real-world datasets validate CANet's superiority over state-of-the-art methods, achieving a 42% reduction in MSE and a 22% reduction in MAE. The source code is publicly available at https://github.com/mertsonmezer/CANet.
中文摘要:本文提出的ChoronoAdaptive网络(CANet)通过非平稳自适应归一化模块解决了时间序列预测中的分布漂移问题,在真实数据集上相比现有方法实现了42%的均方误差降低和22%的平均绝对误差降低。
English Summary: The proposed ChoronoAdaptive Network (CANet) addresses non-stationary time series forecasting challenges through its Non-stationary Adaptive Normalization module, achieving significant performance improvements with 42% lower MSE and 22% lower MAE compared to existing methods.
Authors:Max Muchen Sun, Allison Pinosky, Todd Murphey
Abstract:
Ergodic coverage effectively generates exploratory behaviors for embodied agents by aligning the spatial distribution of the agent's trajectory with a target distribution, where the difference between these two distributions is measured by the ergodic metric. However, existing ergodic coverage methods are constrained by the limited set of ergodic metrics available for control synthesis, fundamentally limiting their performance. In this work, we propose an alternative approach to ergodic coverage based on flow matching, a technique widely used in generative inference for efficient and scalable sampling. We formally derive the flow matching problem for ergodic coverage and show that it is equivalent to a linear quadratic regulator problem with a closed-form solution. Our formulation enables alternative ergodic metrics from generative inference that overcome the limitations of existing ones. These metrics were previously infeasible for control synthesis but can now be supported with no computational overhead. Specifically, flow matching with the Stein variational gradient flow enables control synthesis directly over the score function of the target distribution, improving robustness to the unnormalized distributions; on the other hand, flow matching with the Sinkhorn divergence flow enables an optimal transport-based ergodic metric, improving coverage performance on non-smooth distributions with irregular supports. We validate the improved performance and competitive computational efficiency of our method through comprehensive numerical benchmarks and across different nonlinear dynamics. We further demonstrate the practicality of our method through a series of drawing and erasing tasks on a Franka robot.
Authors:Anirudhan Badrinath, Alex Yang, Kousik Rajesh, Prabhat Agarwal, Jaewon Yang, Haoyu Chen, Jiajing Xu, Charles Rosenberg
Abstract:
Representation learning, a task of learning latent vectors to represent entities, is a key task in improving search and recommender systems in web applications. Various representation learning methods have been developed, including graph-based approaches for relationships among entities, sequence-based methods for capturing the temporal evolution of user activities, and content-based models for leveraging text and visual content. However, the development of a unifying framework that integrates these diverse techniques to support multiple applications remains a significant challenge.
This paper presents OmniSage, a large-scale representation framework that learns universal representations for a variety of applications at Pinterest. OmniSage integrates graph neural networks with content-based models and user sequence models by employing multiple contrastive learning tasks to effectively process graph data, user sequence data, and content signals. To support the training and inference of OmniSage, we developed an efficient infrastructure capable of supporting Pinterest graphs with billions of nodes. The universal representations generated by OmniSage have significantly enhanced user experiences on Pinterest, leading to an approximate 2.5% increase in sitewide repins (saves) across five applications. This paper highlights the impact of unifying representation learning methods, and we make the model code publicly available at https://github.com/pinterest/atg-research/tree/main/omnisage.
中文: OmniSage是一个统一的表示学习框架,通过对比学习整合图神经网络、基于内容的模型和用户序列模型,在Pinterest应用中显著提升用户参与度,使全站转存量增加约2.5%。
English: OmniSage is a unified representation learning framework that integrates graph neural networks, content-based models, and user sequence models through contrastive learning, significantly improving user engagement on Pinterest with a 2.5% increase in repins across applications.
Authors:Haochen Wang, Zhiwei Shi, Chengxi Zhu, Yafei Qiao, Cheng Zhang, Fan Yang, Pengjie Ren, Lan Lu, Dong Xuan
Abstract:
Learning-based methods, such as imitation learning (IL) and reinforcement learning (RL), can produce excel control policies over challenging agile robot tasks, such as sports robot. However, no existing work has harmonized learning-based policy with model-based methods to reduce training complexity and ensure the safety and stability for agile badminton robot control. In this paper, we introduce Hamlet, a novel hybrid control system for agile badminton robots. Specifically, we propose a model-based strategy for chassis locomotion which provides a base for arm policy. We introduce a physics-informed "IL+RL" training framework for learning-based arm policy. In this train framework, a model-based strategy with privileged information is used to guide arm policy training during both IL and RL phases. In addition, we train the critic model during IL phase to alleviate the performance drop issue when transitioning from IL to RL. We present results on our self-engineered badminton robot, achieving 94.5% success rate against the serving machine and 90.7% success rate against human players. Our system can be easily generalized to other agile mobile manipulation tasks such as agile catching and table tennis. Our project website: https://dreamstarring.github.io/HAMLET/.
Authors:Mingchen Jiang, Peng Xu, Xichen Ye, Xiaohui Chen, Yun Yang, Yifan Chen
Abstract:
Distributional data have become increasingly prominent in modern signal processing, highlighting the necessity of computing optimal transport (OT) maps across multiple probability distributions. Nevertheless, recent studies on neural OT methods predominantly focused on the efficient computation of a single map between two distributions. To address this challenge, we introduce a novel approach to learning transport maps for new empirical distributions. Specifically, we employ the transformer architecture to produce embeddings from distributional data of varying length; these embeddings are then fed into a hypernetwork to generate neural OT maps. Various numerical experiments were conducted to validate the embeddings and the generated OT maps. The model implementation and the code are provided on https://github.com/jiangmingchen/HOTET.
中文摘要:本文提出了一种基于Transformer的超网络方法,用于高效学习多个经验分布的最优传输映射,解决了现有方法主要计算两个分布间单一传输映射的局限性。
English summary: This paper introduces a transformer-based hypernetwork approach to efficiently learn optimal transport maps for multiple empirical distributions, addressing the limitation of existing methods that primarily compute single transport maps between two distributions.
Authors:Matthijs van der Lende, Jeremias Lino Ferrao, Niclas Müller-Hof
Abstract:
Reliable uncertainty estimates are crucial in modern machine learning. Deep Gaussian Processes (DGPs) and Deep Sigma Point Processes (DSPPs) extend GPs hierarchically, offering promising methods for uncertainty quantification grounded in Bayesian principles. However, their empirical calibration and robustness under distribution shift relative to baselines like Deep Ensembles remain understudied. This work evaluates these models on regression (CASP dataset) and classification (ESR dataset) tasks, assessing predictive performance (MAE, Accu- racy), calibration using Negative Log-Likelihood (NLL) and Expected Calibration Error (ECE), alongside robustness under various synthetic feature-level distribution shifts. Results indicate DSPPs provide strong in-distribution calibration leveraging their sigma point approximations. However, compared to Deep Ensembles, which demonstrated superior robustness in both per- formance and calibration under the tested shifts, the GP-based methods showed vulnerabilities, exhibiting particular sensitivity in the observed metrics. Our findings underscore ensembles as a robust baseline, suggesting that while deep GP methods offer good in-distribution calibration, their practical robustness under distribution shift requires careful evaluation. To facilitate reproducibility, we make our code available at https://github.com/matthjs/xai-gp.
Chinese: 深度高斯过程和深度西格玛点过程在分布内校准表现优异,但在分布偏移下存在脆弱性,而深度集成方法在性能和校准方面均展现出更强的鲁棒性。
English: Deep Gaussian Processes and Deep Sigma Point Processes provide strong in-distribution calibration but show vulnerabilities under distribution shifts, whereas Deep Ensembles demonstrate superior robustness in both performance and calibration.
Authors:Ãscar Escudero-Arnanz, Antonio G. Marques, Inmaculada Mora-Jiménez, JoaquÃn Ãlvarez-RodrÃguez, Cristina Soguero-Ruiz
Abstract:
Background and Objectives: Multidrug Resistance (MDR) is a critical global health issue, causing increased hospital stays, healthcare costs, and mortality. This study proposes an interpretable Machine Learning (ML) framework for MDR prediction, aiming for both accurate inference and enhanced explainability.
Methods: Patients are modeled as Multivariate Time Series (MTS), capturing clinical progression and patient-to-patient interactions. Similarity among patients is quantified using MTS-based methods: descriptive statistics, Dynamic Time Warping, and Time Cluster Kernel. These similarity measures serve as inputs for MDR classification via Logistic Regression, Random Forest, and Support Vector Machines, with dimensionality reduction and kernel transformations improving model performance. For explainability, patient similarity networks are constructed from these metrics. Spectral clustering and t-SNE are applied to identify MDR-related subgroups and visualize high-risk clusters, enabling insight into clinically relevant patterns.
Results: The framework was validated on ICU Electronic Health Records from the University Hospital of Fuenlabrada, achieving an AUC of 81%. It outperforms baseline ML and deep learning models by leveraging graph-based patient similarity. The approach identifies key risk factors -- prolonged antibiotic use, invasive procedures, co-infections, and extended ICU stays -- and reveals clinically meaningful clusters. Code and results are available at \https://github.com/oscarescuderoarnanz/DM4MTS.
Conclusions: Patient similarity representations combined with graph-based analysis provide accurate MDR prediction and interpretable insights. This method supports early detection, risk factor identification, and patient stratification, highlighting the potential of explainable ML in critical care.
中文: 本研究提出了一种可解释的机器学习框架,通过患者相似性网络和多变量时间序列分析,实现了对多重耐药性的精准预测(AUC达81%),并识别出关键临床风险因素,为重症监护提供了可解释的临床洞见。
English: This study introduces an interpretable machine learning framework that uses patient similarity networks and multivariate time series analysis to accurately predict multidrug resistance, achieving an 81% AUC and identifying key clinical risk factors for enhanced explainability in critical care.
Authors:Shengtao Zhang, Haokai Zhang, Shiqi Lou, Zicheng Wang, Zinan Zeng, Yilin Wang, Minnan Luo
Abstract:
Dynamic node classification is critical for modeling evolving systems like financial transactions and academic collaborations. In such systems, dynamically capturing node information changes is critical for dynamic node classification, which usually requires all labels at every timestamp. However, it is difficult to collect all dynamic labels in real-world scenarios due to high annotation costs and label uncertainty (e.g., ambiguous or delayed labels in fraud detection). In contrast, final timestamp labels are easier to obtain as they rely on complete temporal patterns and are usually maintained as a unique label for each user in many open platforms, without tracking the history data. To bridge this gap, we propose PTCL(Pseudo-label Temporal Curriculum Learning), a pioneering method addressing label-limited dynamic node classification where only final labels are available. PTCL introduces: (1) a temporal decoupling architecture separating the backbone (learning time-aware representations) and decoder (strictly aligned with final labels), which generate pseudo-labels, and (2) a Temporal Curriculum Learning strategy that prioritizes pseudo-labels closer to the final timestamp by assigning them higher weights using an exponentially decaying function. We contribute a new academic dataset (CoOAG), capturing long-range research interest in dynamic graph. Experiments across real-world scenarios demonstrate PTCL's consistent superiority over other methods adapted to this task. Beyond methodology, we propose a unified framework FLiD (Framework for Label-Limited Dynamic Node Classification), consisting of a complete preparation workflow, training pipeline, and evaluation standards, and supporting various models and datasets. The code can be found at https://github.com/3205914485/FLiD.
中文: PTCL是一种仅使用最终时间戳标签进行动态节点分类的创新方法,通过伪标签生成和时间课程学习策略有效解决了标签稀缺问题。
English: PTCL is a novel method for dynamic node classification using only final timestamp labels, employing pseudo-label generation and temporal curriculum learning to overcome limited label availability.
Authors:Ivan Rossi, Flavio Sartori, Cesare Rollo, Giovanni Birolo, Piero Fariselli, Tiziana Sanavia
Abstract:
Survival analysis often relies on Cox models, assuming both linearity and proportional hazards (PH). This study evaluates machine and deep learning methods that relax these constraints, comparing their performance with penalized Cox models on a benchmark of three synthetic and three real datasets. In total, eight different models were tested, including six non-linear models of which four were also non-PH. Although Cox regression often yielded satisfactory performance, we showed the conditions under which machine and deep learning models can perform better. Indeed, the performance of these methods has often been underestimated due to the improper use of Harrell's concordance index (C-index) instead of more appropriate scores such as Antolini's concordance index, which generalizes C-index in cases where the PH assumption does not hold. In addition, since occasionally high C-index models happen to be badly calibrated, combining Antolini's C-index with Brier's score is useful to assess the overall performance of a survival method. Results on our benchmark data showed that survival prediction should be approached by testing different methods to select the most appropriate one according to sample size, non-linearity and non-PH conditions. To allow an easy reproducibility of these tests on our benchmark data, code and documentation are freely available at https://github.com/compbiomed-unito/survhive.
Chinese: 本研究比较了机器学习、深度学习模型与Cox回归在生存分析中的表现,结果表明在特定条件下,使用Antolini一致性指数和Brier评分等恰当指标评估时,非线性和非比例风险模型能够优于传统方法。
English: This study compares machine and deep learning models with Cox regression for survival analysis, demonstrating that non-linear and non-proportional hazards models can outperform traditional methods under specific conditions when evaluated with appropriate metrics like Antolini's C-index and Brier's score.
Authors:Mingqi Yuan, Qi Wang, Guozheng Ma, Bo Li, Xin Jin, Yunbo Wang, Xiaokang Yang, Wenjun Zeng, Dacheng Tao
Abstract:
Developing lifelong learning agents is crucial for artificial general intelligence. However, deep reinforcement learning (RL) systems often suffer from plasticity loss, where neural networks gradually lose their ability to adapt during training. Despite its significance, this field lacks unified benchmarks and evaluation protocols. We introduce Plasticine, the first open-source framework for benchmarking plasticity optimization in deep RL. Plasticine provides single-file implementations of over 13 mitigation methods, 10 evaluation metrics, and learning scenarios with increasing non-stationarity levels from standard to open-ended environments. This framework enables researchers to systematically quantify plasticity loss, evaluate mitigation strategies, and analyze plasticity dynamics across different contexts. Our documentation, examples, and source code are available at https://github.com/RLE-Foundation/Plasticine.
中文: Plasticine是首个用于深度强化学习中可塑性优化的开源基准框架,它提供了多种缓解方法、评估指标和学习场景,以系统性应对可塑性损失问题。
English: Plasticine is introduced as the first open-source framework to benchmark plasticity optimization in deep RL, offering implementations of mitigation methods, evaluation metrics, and learning scenarios to systematically address plasticity loss.
Authors:Vipin Singh, Tianheng Ling, Teodor Chiaburu, Felix Biessmann
Abstract:
Climate change increases the frequency of extreme rainfall, placing a significant strain on urban infrastructures, especially Combined Sewer Systems (CSS). Overflows from overburdened CSS release untreated wastewater into surface waters, posing environmental and public health risks. Although traditional physics-based models are effective, they are costly to maintain and difficult to adapt to evolving system dynamics. Machine Learning (ML) approaches offer cost-efficient alternatives with greater adaptability. To systematically assess the potential of ML for modeling urban infrastructure systems, we propose a protocol for evaluating Neural Network architectures for CSS time series forecasting with respect to predictive performance, model complexity, and robustness to perturbations. In addition, we assess model performance on peak events and critical fluctuations, as these are the key regimes for urban wastewater management. To investigate the feasibility of lightweight models suitable for IoT deployment, we compare global models, which have access to all information, with local models, which rely solely on nearby sensor readings. Additionally, to explore the security risks posed by network outages or adversarial attacks on urban infrastructure, we introduce error models that assess the resilience of models. Our results demonstrate that while global models achieve higher predictive performance, local models provide sufficient resilience in decentralized scenarios, ensuring robust modeling of urban infrastructure. Furthermore, models with longer native forecast horizons exhibit greater robustness to data perturbations. These findings contribute to the development of interpretable and reliable ML solutions for sustainable urban wastewater management. The implementation is available in our GitHub repository.
中文: 气候变化加剧了极端降雨,给城市合流制污水系统带来压力,本研究提出评估机器学习模型的协议,发现全局模型虽预测更优,但局部模型在分散式场景中更具韧性,有助于可持续污水管理。
English: Climate change exacerbates extreme rainfall, stressing urban sewer systems, and this study proposes a protocol to evaluate machine learning models for forecasting, finding that while global models perform better, local models offer resilience for decentralized, sustainable wastewater management.
Authors:Alberto Fernández-Hernández, Jose I. Mestre, Manuel F. Dolz, Jose Duato, Enrique S. Quintana-OrtÃ
Abstract:
We introduce the Overfitting-Underfitting Indicator (OUI), a novel tool for monitoring the training dynamics of Deep Neural Networks (DNNs) and identifying optimal regularization hyperparameters. Specifically, we validate that OUI can effectively guide the selection of the Weight Decay (WD) hyperparameter by indicating whether a model is overfitting or underfitting during training without requiring validation data. Through experiments on DenseNet-BC-100 with CIFAR- 100, EfficientNet-B0 with TinyImageNet and ResNet-34 with ImageNet-1K, we show that maintaining OUI within a prescribed interval correlates strongly with improved generalization and validation scores. Notably, OUI converges significantly faster than traditional metrics such as loss or accuracy, enabling practitioners to identify optimal WD (hyperparameter) values within the early stages of training. By leveraging OUI as a reliable indicator, we can determine early in training whether the chosen WD value leads the model to underfit the training data, overfit, or strike a well-balanced trade-off that maximizes validation scores. This enables more precise WD tuning for optimal performance on the tested datasets and DNNs. All code for reproducing these experiments is available at https://github.com/AlbertoFdezHdez/OUI.
中文: 过拟合-欠拟合指示器(OUI)是一种创新工具,无需验证数据即可通过监测训练动态有效识别最佳权重衰减超参数,从而及早发现过拟合或欠拟合问题以提升模型泛化能力。
English: The Overfitting-Underfitting Indicator (OUI) is a novel tool that efficiently identifies optimal Weight Decay hyperparameters by monitoring training dynamics without validation data, enabling early detection of overfitting or underfitting for improved model generalization.
Authors:Ali Hassani, Fengzhe Zhou, Aditya Kane, Jiannan Huang, Chieh-Yun Chen, Min Shi, Steven Walton, Markus Hoehnerbach, Vijay Thakkar, Michael Isaev, Qinsheng Zhang, Bing Xu, Haicheng Wu, Wen-mei Hwu, Ming-Yu Liu, Humphrey Shi
Abstract:
Many sparse attention mechanisms such as Neighborhood Attention have typically failed to consistently deliver speedup over the self attention baseline. This is largely due to the level of complexity in attention infrastructure, and the rapid evolution of AI hardware architecture. At the same time, many state-of-the-art foundational models, particularly in computer vision, are heavily bound by attention, and need reliable sparsity to escape the O(n^2) complexity. In this paper, we study a class of promising sparse attention mechanisms that focus on locality, and aim to develop a better analytical model of their performance improvements. We first introduce Generalized Neighborhood Attention (GNA), which can describe sliding window, strided sliding window, and blocked attention. We then consider possible design choices in implementing these approaches, and create a simulator that can provide much more realistic speedup upper bounds for any given setting. Finally, we implement GNA on top of a state-of-the-art fused multi-headed attention (FMHA) kernel designed for the NVIDIA Blackwell architecture in CUTLASS. Our implementation can fully realize the maximum speedup theoretically possible in many perfectly block-sparse cases, and achieves an effective utilization of 1.3 petaFLOPs/second in FP16. In addition, we plug various GNA configurations into off-the-shelf generative models, such as Cosmos-7B, HunyuanVideo, and FLUX, and show that it can deliver 28% to 46% end-to-end speedup on B200 without any fine-tuning. We will open source our simulator and Blackwell kernels directly through the NATTEN project.
中文摘要:本研究提出广义邻域注意力(GNA)解决稀疏注意力效率问题,通过优化Blackwell架构实现,在生成模型中最高可获得46%的端到端加速效果。
English Summary: The study introduces Generalized Neighborhood Attention (GNA) to address sparse attention inefficiencies, achieving up to 46% speedup on generative models through optimized Blackwell architecture implementation.
Authors:Muhammad Khalifa, Rishabh Agarwal, Lajanugen Logeswaran, Jaekyeom Kim, Hao Peng, Moontae Lee, Honglak Lee, Lu Wang
Abstract:
Step-by-step verifiers -- also known as process reward models (PRMs) -- are a key ingredient for test-time scaling. PRMs require step-level supervision, making them expensive to train. This work aims to build data-efficient PRMs as verbalized step-wise reward models that verify every step in the solution by generating a verification chain-of-thought (CoT). We propose ThinkPRM, a long CoT verifier fine-tuned on orders of magnitude fewer process labels than those required by discriminative PRMs. Our approach capitalizes on the inherent reasoning abilities of long CoT models, and outperforms LLM-as-a-Judge and discriminative verifiers -- using only 1% of the process labels in PRM800K -- across several challenging benchmarks. Specifically, ThinkPRM beats the baselines on ProcessBench, MATH-500, and AIME '24 under best-of-N selection and reward-guided search. In an out-of-domain evaluation on a subset of GPQA-Diamond and LiveCodeBench, our PRM surpasses discriminative verifiers trained on the full PRM800K by 8% and 4.5%, respectively. Lastly, under the same token budget, ThinkPRM scales up verification compute more effectively compared to LLM-as-a-Judge, outperforming it by 7.2% on a subset of ProcessBench. Our work highlights the value of generative, long CoT PRMs that can scale test-time compute for verification while requiring minimal supervision for training. Our code, data, and models are released at https://github.com/mukhal/thinkprm.
中文: ThinkPRM是一种生成式长思维链验证器,仅需1%的过程标注即可在多个基准测试中超越现有方法,以极少的监督实现高效验证计算扩展。
English: ThinkPRM is a generative, long chain-of-thought verifier that achieves superior performance across multiple benchmarks using only 1% of process labels, effectively scaling test-time verification with minimal supervision.
Authors:Gerardus Croonen, Andreas Trondl, Julia Simon, Daniel Steininger
Abstract:
While sugar beets are stored prior to processing, they lose sugar due to factors such as microorganisms present in adherent soil and excess vegetation. Their automated visual inspection promises to aide in quality assurance and thereby increase efficiency throughout the processing chain of sugar production. In this work, we present a novel high-quality annotated dataset and two-stage method for the detection, semantic segmentation and mass estimation of post-harvest and post-storage sugar beets in monocular RGB images. We conduct extensive ablation experiments for the detection of sugar beets and their fine-grained semantic segmentation regarding damages, rot, soil adhesion and excess vegetation. For these tasks, we evaluate multiple image sizes, model architectures and encoders, as well as the influence of environmental conditions. Our experiments show an mAP50-95 of 98.8 for sugar-beet detection and an mIoU of 64.0 for the best-performing segmentation model.
Chinese Summary: 本研究提出了一种新颖的标注数据集和两阶段方法,用于在单目RGB图像中检测、语义分割和估算收获后储存的甜菜质量,以应对因微生物和多余植被导致的储存期间糖分损失问题。
English Summary: This study introduces a new annotated dataset and a two-stage method for detecting, segmenting, and estimating the mass of sugar beets in images to address sugar loss during storage caused by microorganisms and vegetation.
Authors:Ceren Yildirim, Kamer Kaya, Sinan Yildirim, Erkay Savas
Abstract:
We propose a new framework for Bayesian estimation of differential privacy, incorporating evidence from multiple membership inference attacks (MIA). Bayesian estimation is carried out via a Markov chain Monte Carlo (MCMC) algorithm, named MCMC-DP-Est, which provides an estimate of the full posterior distribution of the privacy parameter (e.g., instead of just credible intervals). Critically, the proposed method does not assume that privacy auditing is performed with the most powerful attack on the worst-case (dataset, challenge point) pair, which is typically unrealistic. Instead, MCMC-DP-Est jointly estimates the strengths of MIAs used and the privacy of the training algorithm, yielding a more cautious privacy analysis. We also present an economical way to generate measurements for the performance of an MIA that is to be used by the MCMC method to estimate privacy. We present the use of the methods with numerical examples with both artificial and real data.
Chinese: 作者提出了一种贝叶斯框架,通过结合多种成员推理攻击的证据来估计差分隐私,采用名为MCMC-DP-Est的马尔可夫链蒙特卡罗算法,在不依赖不切实际的最坏情况假设下,获得隐私参数的完整后验分布。
English: The authors introduce a Bayesian framework for estimating differential privacy using multiple membership inference attacks, employing an MCMC algorithm called MCMC-DP-Est to derive the full posterior distribution of privacy parameters without relying on unrealistic worst-case assumptions.
Authors:William Corrias, Fabio De Gaspari, Dorjan Hitaj, Luigi V. Mancini
Abstract:
Recent advances in generative models have led to their application in password guessing, with the aim of replicating the complexity, structure, and patterns of human-created passwords. Despite their potential, inconsistencies and inadequate evaluation methodologies in prior research have hindered meaningful comparisons and a comprehensive, unbiased understanding of their capabilities. This paper introduces MAYA, a unified, customizable, plug-and-play benchmarking framework designed to facilitate the systematic characterization and benchmarking of generative password-guessing models in the context of trawling attacks. Using MAYA, we conduct a comprehensive assessment of six state-of-the-art approaches, which we re-implemented and adapted to ensure standardization. Our evaluation spans eight real-world password datasets and covers an exhaustive set of advanced testing scenarios, totaling over 15,000 compute hours. Our findings indicate that these models effectively capture different aspects of human password distribution and exhibit strong generalization capabilities. However, their effectiveness varies significantly with long and complex passwords. Through our evaluation, sequential models consistently outperform other generative architectures and traditional password-guessing tools, demonstrating unique capabilities in generating accurate and complex guesses. Moreover, the diverse password distributions learned by the models enable a multi-model attack that outperforms the best individual model. By releasing MAYA, we aim to foster further research, providing the community with a new tool to consistently and reliably benchmark generative password-guessing models. Our framework is publicly available at https://github.com/williamcorrias/MAYA-Password-Benchmarking.
中文: 本文提出了MAYA这一统一基准测试框架,用于评估生成式密码猜测模型,通过广泛测试发现序列模型优于其他方法且多模型攻击效果最佳,该框架已公开以推动标准化研究。
English: This paper introduces MAYA, a unified benchmarking framework for evaluating generative password-guessing models, revealing through extensive testing that sequential models outperform other methods and that multi-model attacks are most effective, with the framework publicly released to standardize future research.
Authors:Seungyoon Choi, Sein Kim, Hongseok Kang, Wonjoong Kim, Chanyoung Park
Abstract:
Traditional user modeling (UM) approaches have primarily focused on designing models for a single specific task, but they face limitations in generalization and adaptability across various tasks. Recognizing these challenges, recent studies have shifted towards continual learning (CL)-based universal user representation learning aiming to develop a single model capable of handling multiple tasks. Despite advancements, existing methods are in fact evaluated under an unrealistic scenario that does not consider the passage of time as tasks progress, which overlooks newly emerged items that may change the item distribution of previous tasks. In this paper, we introduce a practical evaluation scenario on which CL-based universal user representation learning approaches should be evaluated, which takes into account the passage of time as tasks progress. Then, we propose a novel framework Dynamic Time-aware continual user representation learner, named DITTO, designed to alleviate catastrophic forgetting despite continuous shifts in item distribution, while also allowing the knowledge acquired from previous tasks to adapt to the current shifted item distribution. Through our extensive experiments, we demonstrate the superiority of DITTO over state-of-the-art methods under a practical evaluation scenario. Our source code is available at https://github.com/seungyoon-Choi/DITTO_official.
Chinese: 本文提出了一种实用的评估场景和名为DITTO的新型持续学习通用用户表示框架,通过动态适应时间引起的物品分布变化并减轻灾难性遗忘,有效解决了现有方法的局限性。
English: This paper introduces a practical evaluation scenario and a novel framework called DITTO for continual learning-based universal user representation, addressing limitations in existing methods by dynamically adapting to time-induced shifts in item distribution while mitigating catastrophic forgetting.
Authors:Ye Tian, Yanqiu Yu, Jianguo Sun, Yanbin Wang
Abstract:
Malicious URLs persistently threaten the cybersecurity ecosystem, by either deceiving users into divulging private data or distributing harmful payloads to infiltrate host systems. Gaining timely insights into the current state of this ongoing battle holds significant importance. However, existing reviews exhibit 4 critical gaps: 1) Their reliance on algorithm-centric taxonomies obscures understanding of how detection approaches exploit specific modal information channels; 2) They fail to incorporate pivotal LLM/Transformer-based defenses; 3) No open-source implementations are collected to facilitate benchmarking; 4) Insufficient dataset coverage.This paper presents a comprehensive review of malicious URL detection technologies, systematically analyzing methods from traditional blacklisting to advanced deep learning approaches (e.g. Transformer, GNNs, and LLMs). Unlike prior surveys, we propose a novel modality-based taxonomy that categorizes existing works according to their primary data modalities (URL, HTML, Visual, etc.). This hierarchical classification enables both rigorous technical analysis and clear understanding of multimodal information utilization. Furthermore, to establish a profile of accessible datasets and address the lack of standardized benchmarking (where current studies often lack proper baseline comparisons), we curate and analyze: 1) publicly available datasets (2016-2024), and 2) open-source implementations from published works(2013-2025). Then, we outline essential design principles and architectural frameworks for product-level implementations. The review concludes by examining emerging challenges and proposing actionable directions for future research. We maintain a GitHub repository for ongoing curating datasets and open-source implementations: https://github.com/sevenolu7/Malicious-URL-Detection-Open-Source/tree/master.
中文摘要:本文对恶意URL检测技术进行全面综述,提出基于模态的新分类法,并通过整理开源实现和数据集来解决现有研究的不足,为建立标准化基准提供支持。
English Summary: This paper provides a comprehensive review of malicious URL detection technologies, introducing a novel modality-based taxonomy and addressing critical gaps in existing research by curating open-source implementations and datasets to establish standardized benchmarking.
Authors:Charlie Hou, Mei-Yu Wang, Yige Zhu, Daniel Lazar, Giulia Fanti
Abstract:
In practical settings, differentially private Federated learning (DP-FL) is the dominant method for training models from private, on-device client data. Recent work has suggested that DP-FL may be enhanced or outperformed by methods that use DP synthetic data (Wu et al., 2024; Hou et al., 2024). The primary algorithms for generating DP synthetic data for FL applications require careful prompt engineering based on public information and/or iterative private client feedback. Our key insight is that the private client feedback collected by prior DP synthetic data methods (Hou et al., 2024; Xie et al., 2024) can be viewed as an RL (reinforcement learning) reward. Our algorithm, Policy Optimization for Private Data (POPri) harnesses client feedback using policy optimization algorithms such as Direct Preference Optimization (DPO) to fine-tune LLMs to generate high-quality DP synthetic data. To evaluate POPri, we release LargeFedBench, a new federated text benchmark for uncontaminated LLM evaluations on federated client data. POPri substantially improves the utility of DP synthetic data relative to prior work on LargeFedBench datasets and an existing benchmark from Xie et al. (2024). POPri closes the gap between next-token prediction accuracy in the fully-private and non-private settings by up to 58%, compared to 28% for prior synthetic data methods, and 3% for state-of-the-art DP federated learning methods. The code and data are available at https://github.com/meiyuw/POPri.
中文: POPri算法通过策略优化提升联邦学习中差分隐私合成数据的质量,在LargeFedBench等基准测试中将非隐私环境下的效用差距缩小高达58%。
English: The POPri algorithm leverages policy optimization to enhance differentially private synthetic data generation in federated learning, significantly narrowing the utility gap with non-private settings by up to 58% on benchmarks like LargeFedBench.
Authors:André Longon
Abstract:
An important capacity in visual object recognition is invariance to image-altering variables which leave the identity of objects unchanged, such as lighting, rotation, and scale. How do neural networks achieve this? Prior mechanistic interpretability research has illuminated some invariance-building circuitry in InceptionV1, but the results are limited and networks with different architectures have remained largely unexplored. This work investigates ResNet18 with a particular focus on its residual stream, an architectural component which InceptionV1 lacks. We observe that many convolutional channels in intermediate blocks exhibit scale invariant properties, computed by the element-wise residual summation of scale equivariant representations: the block input's smaller-scale copy with the block pre-sum output's larger-scale copy. Through subsequent ablation experiments, we attempt to causally link these neural properties with scale-robust object recognition behavior. Our tentative findings suggest how the residual stream computes scale invariance and its possible role in behavior. Code is available at: https://github.com/cest-andre/residual-stream-interp
中文: 本研究揭示了ResNet18通过残差流整合多尺度等变表征来实现尺度不变性的机制,这种机制可能支撑了模型跨尺度物体识别的稳健表现。
English: This study explores how ResNet18's residual stream achieves scale invariance by combining scale-equivariant representations, potentially enabling robust object recognition across different sizes.
Authors:Obed Korshie Dzikunu, Amirhossein Toosi, Shadab Ahamed, Sara Harsini, Francois Benard, Xiaoxiao Li, Arman Rahmim
Abstract:
This study performs a comprehensive evaluation of quantitative measurements as extracted from automated deep-learning-based segmentation methods, beyond traditional Dice Similarity Coefficient assessments, focusing on six quantitative metrics, namely SUVmax, SUVmean, total lesion activity (TLA), tumor volume (TMTV), lesion count, and lesion spread. We analyzed 380 prostate-specific membrane antigen (PSMA) targeted [18F]DCFPyL PET/CT scans of patients with biochemical recurrence of prostate cancer, training deep neural networks, U-Net, Attention U-Net and SegResNet with four loss functions: Dice Loss, Dice Cross Entropy, Dice Focal Loss, and our proposed L1 weighted Dice Focal Loss (L1DFL). Evaluations indicated that Attention U-Net paired with L1DFL achieved the strongest correlation with the ground truth (concordance correlation = 0.90-0.99 for SUVmax and TLA), whereas models employing the Dice Loss and the other two compound losses, particularly with SegResNet, underperformed. Equivalence testing (TOST, alpha = 0.05, Delta = 20%) confirmed high performance for SUV metrics, lesion count and TLA, with L1DFL yielding the best performance. By contrast, tumor volume and lesion spread exhibited greater variability. Bland-Altman, Coverage Probability, and Total Deviation Index analyses further highlighted that our proposed L1DFL minimizes variability in quantification of the ground truth clinical measures. The code is publicly available at: https://github.com/ObedDzik/pca\_segment.git.
中文: 本研究评估了深度学习分割方法在前列腺癌PET/CT扫描中的应用,发现采用L1DFL损失函数的Attention U-Net模型与临床测量值具有最佳相关性,同时显著降低了定量评估的变异性。
English: This study evaluates deep-learning segmentation methods for prostate cancer PET/CT scans, finding that Attention U-Net with the proposed L1DFL loss function achieves superior correlation with clinical measurements while reducing variability in quantitative assessments.
Authors:Zexi Fan, Yan Sun, Shihao Yang, Yiping Lu
Abstract:
High-dimensional partial differential equations (PDEs) pose significant computational challenges across fields ranging from quantum chemistry to economics and finance. Although scientific machine learning (SciML) techniques offer approximate solutions, they often suffer from bias and neglect crucial physical insights. Inspired by inference-time scaling strategies in language models, we propose Simulation-Calibrated Scientific Machine Learning (SCaSML), a physics-informed framework that dynamically refines and debiases the SCiML predictions during inference by enforcing the physical laws. SCaSML leverages derived new physical laws that quantifies systematic errors and employs Monte Carlo solvers based on the Feynman-Kac and Elworthy-Bismut-Li formulas to dynamically correct the prediction. Both numerical and theoretical analysis confirms enhanced convergence rates via compute-optimal inference methods. Our numerical experiments demonstrate that SCaSML reduces errors by 20-50% compared to the base surrogate model, establishing it as the first algorithm to refine approximated solutions to high-dimensional PDE during inference. Code of SCaSML is available at https://github.com/Francis-Fan-create/SCaSML.
中文:提出的SCaSML框架通过强制物理定律在推理过程中动态优化和校正科学机器学习预测,相比基础模型将高维偏微分方程求解误差降低了20-50%。
English: The proposed SCaSML framework dynamically refines and debiases scientific machine learning predictions during inference by enforcing physical laws, achieving 20-50% error reduction in solving high-dimensional PDEs compared to base models.
Authors:Junwei Liao, Muning Wen, Jun Wang, Weinan Zhang
Abstract:
LLM-based Multi-Agent Systems have demonstrated remarkable capabilities in addressing complex, agentic tasks, from generating high-quality presentation slides to even conducting sophisticated scientific research. Meanwhile, RL has been widely recognized for its effectiveness in enhancing agent intelligence, but limited research has investigated the fine-tuning of LaMAS using foundational RL techniques. Moreover, the direct application of MARL methods to LaMAS introduces significant challenges, stemming from the unique characteristics and mechanisms inherent to LaMAS. To address these challenges, this article presents a comprehensive study of LLM-based MARL and proposes a novel paradigm termed Multi-Agent Reinforcement Fine-Tuning (MARFT). We introduce a brand-new POMDP called Flex-POMDP, which aligns with the LaMAS optimization in real-world applications and a universal algorithmic framework tailored specifically for LaMAS, outlining the conceptual foundations, key distinctions, and practical implementation strategies. We review the evolution from RL to RFT, setting the stage for a parallel analysis in the multi-agent domain. In the context of LaMAS, we elucidate critical differences between MARL and MARFT. These differences motivate a transition toward a LaMAS-oriented formulation of RFT. Central to this work is a robust and scalable MARFT framework. We detail the core algorithm and provide a complete, open-source implementation to facilitate adoption and further research. The latter sections of the paper explore real-world application perspectives and opening challenges in MARFT. By bridging theoretical underpinnings with practical methodologies, this work serves as a roadmap for researchers seeking to advance MARFT toward resilient and adaptive solutions in agentic systems. Our implementation of the proposed framework is publicly available at: https://github.com/jwliao-ai/MARFT.
中文: 本文提出多智能体强化微调(MARFT)新范式,通过设计Flex-POMDP框架和开源实现,解决了多智能体强化学习在基于大语言模型的多智能体系统中应用的核心挑战,为开发自适应智能体系统提供了技术路线。
English: This article introduces Multi-Agent Reinforcement Fine-Tuning (MARFT), a novel paradigm that addresses the challenges of applying multi-agent reinforcement learning to LLM-based multi-agent systems by proposing a flexible POMDP framework and providing open-source implementation to advance adaptive agentic solutions.
Authors:Jun-Peng Jiang, Si-Yang Liu, Hao-Run Cai, Qile Zhou, Han-Jia Ye
Abstract:
Tabular data, structured as rows and columns, is among the most prevalent data types in machine learning classification and regression applications. Models for learning from tabular data have continuously evolved, with Deep Neural Networks (DNNs) recently demonstrating promising results through their capability of representation learning. In this survey, we systematically introduce the field of tabular representation learning, covering the background, challenges, and benchmarks, along with the pros and cons of using DNNs. We organize existing methods into three main categories according to their generalization capabilities: specialized, transferable, and general models. Specialized models focus on tasks where training and evaluation occur within the same data distribution. We introduce a hierarchical taxonomy for specialized models based on the key aspects of tabular data -- features, samples, and objectives -- and delve into detailed strategies for obtaining high-quality feature- and sample-level representations. Transferable models are pre-trained on one or more datasets and subsequently fine-tuned on downstream tasks, leveraging knowledge acquired from homogeneous or heterogeneous sources, or even cross-modalities such as vision and language. General models, also known as tabular foundation models, extend this concept further, allowing direct application to downstream tasks without fine-tuning. We group these general models based on the strategies used to adapt across heterogeneous datasets. Additionally, we explore ensemble methods, which integrate the strengths of multiple tabular models. Finally, we discuss representative extensions of tabular learning, including open-environment tabular machine learning, multimodal learning with tabular data, and tabular understanding. More information can be found in the following repository: https://github.com/LAMDA-Tabular/Tabular-Survey.
中文摘要:表格数据是机器学习中最常用的数据类型之一,本文系统梳理了其表示学习方法,将模型分为专用、可迁移和通用三类,并探讨了相关应用与扩展方向。
English Summary: Tabular data is widely used in machine learning, and this survey systematically explores representation learning methods for it, categorizing models into specialized, transferable, and general types while discussing their applications and extensions.
Authors:Yuxin Zuo, Kaiyan Zhang, Li Sheng, Shang Qu, Ganqu Cui, Xuekai Zhu, Haozhan Li, Yuchen Zhang, Xinwei Long, Ermo Hua, Biqing Qi, Youbang Sun, Zhiyuan Ma, Lifan Yuan, Ning Ding, Bowen Zhou
Abstract:
This paper investigates Reinforcement Learning (RL) on data without explicit labels for reasoning tasks in Large Language Models (LLMs). The core challenge of the problem is reward estimation during inference while not having access to ground-truth information. While this setting appears elusive, we find that common practices in Test-Time Scaling (TTS), such as majority voting, yield surprisingly effective rewards suitable for driving RL training. In this work, we introduce Test-Time Reinforcement Learning (TTRL), a novel method for training LLMs using RL on unlabeled data. TTRL enables self-evolution of LLMs by utilizing the priors in the pre-trained models. Our experiments demonstrate that TTRL consistently improves performance across a variety of tasks and models. Notably, TTRL boosts the pass@1 performance of Qwen-2.5-Math-7B by approximately 211% on the AIME 2024 with only unlabeled test data. Furthermore, although TTRL is only supervised by the maj@n metric, TTRL has demonstrated performance to consistently surpass the upper limit of the initial model maj@n, and approach the performance of models trained directly on test data with ground-truth labels. Our experimental findings validate the general effectiveness of TTRL across various tasks and highlight TTRL's potential for broader tasks and domains. GitHub: https://github.com/PRIME-RL/TTRL
本文提出了一种名为测试时强化学习(TTRL)的新方法,通过利用多数投票等测试时扩展技术进行奖励估计,使大语言模型能够在无标注测试数据上借助强化学习实现自我进化。
This paper introduces Test-Time Reinforcement Learning (TTRL), a novel method that enables large language models to self-improve using reinforcement learning on unlabeled test data by leveraging test-time scaling techniques like majority voting for reward estimation.
Authors:Alycia Carey, Xintao Wu
Abstract:
Client-level fairness metrics for federated learning are used to ensure that all clients in a federation either: a) have similar final performance on their local data distributions (i.e., client parity), or b) obtain final performance on their local data distributions relative to their contribution to the federated learning process (i.e., contribution fairness). While a handful of works that propose either client-parity or contribution-based fairness metrics ground their definitions and decisions in social theories of equality -- such as distributive justice -- most works arbitrarily choose what notion of fairness to align with which makes it difficult for practitioners to choose which fairness metric aligns best with their fairness ethics. In this work, we propose UDJ-FL (Uncertainty-based Distributive Justice for Federated Learning), a flexible federated learning framework that can achieve multiple distributive justice-based client-level fairness metrics. Namely, by utilizing techniques inspired by fair resource allocation, in conjunction with performing aleatoric uncertainty-based client weighing, our UDJ-FL framework is able to achieve egalitarian, utilitarian, Rawls' difference principle, or desert-based client-level fairness. We empirically show the ability of UDJ-FL to achieve all four defined distributive justice-based client-level fairness metrics in addition to providing fairness equivalent to (or surpassing) other popular fair federated learning works. Further, we provide justification for why aleatoric uncertainty weighing is necessary to the construction of our UDJ-FL framework as well as derive theoretical guarantees for the generalization bounds of UDJ-FL. Our code is publicly available at https://github.com/alycia-noel/UDJ-FL.
中文: 本研究提出的UDJ-FL框架通过基于偶然不确定性的客户端加权和公平资源分配技术,能够实现基于分配正义的多种客户端级公平指标,包括平等主义、功利主义、罗尔斯差异原则和应得公平,其公平性表现达到或超越了现有联邦学习方法。
English: The proposed UDJ-FL framework enables federated learning systems to achieve multiple client-level fairness metrics based on distributive justice principles, including egalitarian, utilitarian, Rawlsian, and desert-based fairness, while demonstrating empirical performance that matches or exceeds existing approaches.
Authors:Lotfi Abdelkrim Mecharbat, Ibrahim Almakky, Martin Takac, Mohammad Yaqub
Abstract:
Deep learning (DL) has achieved remarkable progress in the field of medical imaging. However, adapting DL models to medical tasks remains a significant challenge, primarily due to two key factors: (1) architecture selection, as different tasks necessitate specialized model designs, and (2) weight initialization, which directly impacts the convergence speed and final performance of the models. Although transfer learning from ImageNet is a widely adopted strategy, its effectiveness is constrained by the substantial differences between natural and medical images. To address these challenges, we introduce Medical Neural Network Search (MedNNS), the first Neural Network Search framework for medical imaging applications. MedNNS jointly optimizes architecture selection and weight initialization by constructing a meta-space that encodes datasets and models based on how well they perform together. We build this space using a Supernetwork-based approach, expanding the model zoo size by 51x times over previous state-of-the-art (SOTA) methods. Moreover, we introduce rank loss and Fréchet Inception Distance (FID) loss into the construction of the space to capture inter-model and inter-dataset relationships, thereby achieving more accurate alignment in the meta-space. Experimental results across multiple datasets demonstrate that MedNNS significantly outperforms both ImageNet pre-trained DL models and SOTA Neural Architecture Search (NAS) methods, achieving an average accuracy improvement of 1.7% across datasets while converging substantially faster. The code and the processed meta-space is available at https://github.com/BioMedIA-MBZUAI/MedNNS.
中文: 深度学习在医学影像领域取得显著进展,但模型适应仍具挑战;MedNNS通过构建元空间联合优化架构与权重初始化,以更高精度和更快收敛速度超越现有方法。
English: Deep learning has advanced medical imaging but faces challenges in model adaptation, which MedNNS addresses by jointly optimizing architecture and weight initialization through a meta-space, outperforming existing methods with higher accuracy and faster convergence.
Authors:Diego de Oliveira Hitzges, Suman Ghosh, Guillermo Gallego
Abstract:
Event cameras offer a promising avenue for multi-view stereo depth estimation and Simultaneous Localization And Mapping (SLAM) due to their ability to detect blur-free 3D edges at high-speed and over broad illumination conditions. However, traditional deep learning frameworks designed for conventional cameras struggle with the asynchronous, stream-like nature of event data, as their architectures are optimized for discrete, image-like inputs. We propose a scalable, flexible and adaptable framework for pixel-wise depth estimation with event cameras in both monocular and stereo setups. The 3D scene structure is encoded into disparity space images (DSIs), representing spatial densities of rays obtained by back-projecting events into space via known camera poses. Our neural network processes local subregions of the DSIs combining 3D convolutions and a recurrent structure to recognize valuable patterns for depth prediction. Local processing enables fast inference with full parallelization and ensures constant ultra-low model complexity and memory costs, regardless of camera resolution. Experiments on standard benchmarks (MVSEC and DSEC datasets) demonstrate unprecedented effectiveness: (i) using purely monocular data, our method achieves comparable results to existing stereo methods; (ii) when applied to stereo data, it strongly outperforms all state-of-the-art (SOTA) approaches, reducing the mean absolute error by at least 42%; (iii) our method also allows for increases in depth completeness by more than 3-fold while still yielding a reduction in median absolute error of at least 30%. Given its remarkable performance and effective processing of event-data, our framework holds strong potential to become a standard approach for using deep learning for event-based depth estimation and SLAM. Project page: https://github.com/tub-rip/DERD-Net
Chinese: 本文提出了一种基于事件相机的新型深度学习框架,通过3D卷积和循环结构处理局部视差空间图像,在单目和立体设置中均实现最先进性能,同时保持恒定计算复杂度。
English: This paper introduces a novel deep learning framework for event-based depth estimation that processes local disparity space images with 3D convolutions and recurrent structures, achieving state-of-the-art performance in both monocular and stereo setups while maintaining constant computational complexity.
Authors:Yannic Neuhaus, Matthias Hein
Abstract:
Since data annotation is costly, benchmark datasets often incorporate labels from established image datasets. In this work, we assess the impact of label errors in MSCOCO on the frequently used object hallucination benchmark POPE. We re-annotate the benchmark images and identify an imbalance in annotation errors across different subsets. Evaluating multiple models on the revised labels, which we denote as RePOPE, we observe notable shifts in model rankings, highlighting the impact of label quality. Code and data are available at https://github.com/YanNeu/RePOPE .
中文摘要:本研究评估了MSCOCO数据集中的标签错误对POPE物体幻觉基准的影响,通过重新标注数据创建RePOPE后发现,模型性能排名发生显著变化,凸显了标签质量的重要性。
English Summary: This study evaluates how label errors in the MSCOCO dataset affect the POPE object hallucination benchmark, revealing that re-annotating the data (RePOPE) significantly alters model performance rankings and underscores the importance of label quality.
Authors:Anjiang Wei, Huanmi Tan, Tarun Suresh, Daniel Mendoza, Thiago S. F. X. Teixeira, Ke Wang, Caroline Trippel, Alex Aiken
Abstract:
Recent advances in Large Language Models (LLMs) have sparked growing interest in applying them to Electronic Design Automation (EDA) tasks, particularly Register Transfer Level (RTL) code generation. While several RTL datasets have been introduced, most focus on syntactic validity rather than functional validation with tests, leading to training examples that compile but may not implement the intended behavior. We present VERICODER, a model for RTL code generation fine-tuned on a dataset validated for functional correctness. This fine-tuning dataset is constructed using a novel methodology that combines unit test generation with feedback-directed refinement. Given a natural language specification and an initial RTL design, we prompt a teacher model (GPT-4o-mini) to generate unit tests and iteratively revise the RTL design based on its simulation results using the generated tests. If necessary, the teacher model also updates the tests to ensure they comply with the natural language specification. As a result of this process, every example in our dataset is functionally validated, consisting of a natural language description, an RTL implementation, and passing tests. Fine-tuned on this dataset of 125,777 examples, VERICODER achieves state-of-the-art metrics in functional correctness on VerilogEval and RTLLM, with relative gains of up to 71.7% and 27.4%, respectively. An ablation study further shows that models trained on our functionally validated dataset outperform those trained on functionally non-validated datasets, underscoring the importance of high-quality datasets in RTL code generation. Our code, data, and models are publicly available at https://github.com/Anjiang-Wei/VeriCoder
Chinese: VERICODER是一个基于功能验证数据集微调的RTL代码生成模型,通过结合单元测试生成和反馈导向优化的新方法,在功能正确性方面达到了最先进的性能。
English: VERICODER is a model fine-tuned on a functionally validated RTL code generation dataset, achieving state-of-the-art performance in functional correctness through a novel methodology combining unit test generation and feedback-directed refinement.
Authors:Zhiqiu Lin, Siyuan Cen, Daniel Jiang, Jay Karhade, Hewei Wang, Chancharik Mitra, Tiffany Ling, Yuhan Huang, Sifan Liu, Mingyu Chen, Rushikesh Zawar, Xue Bai, Yilun Du, Chuang Gan, Deva Ramanan
Abstract:
We introduce CameraBench, a large-scale dataset and benchmark designed to assess and improve camera motion understanding. CameraBench consists of ~3,000 diverse internet videos, annotated by experts through a rigorous multi-stage quality control process. One of our contributions is a taxonomy of camera motion primitives, designed in collaboration with cinematographers. We find, for example, that some motions like "follow" (or tracking) require understanding scene content like moving subjects. We conduct a large-scale human study to quantify human annotation performance, revealing that domain expertise and tutorial-based training can significantly enhance accuracy. For example, a novice may confuse zoom-in (a change of intrinsics) with translating forward (a change of extrinsics), but can be trained to differentiate the two. Using CameraBench, we evaluate Structure-from-Motion (SfM) and Video-Language Models (VLMs), finding that SfM models struggle to capture semantic primitives that depend on scene content, while VLMs struggle to capture geometric primitives that require precise estimation of trajectories. We then fine-tune a generative VLM on CameraBench to achieve the best of both worlds and showcase its applications, including motion-augmented captioning, video question answering, and video-text retrieval. We hope our taxonomy, benchmark, and tutorials will drive future efforts towards the ultimate goal of understanding camera motions in any video.
中文摘要:CameraBench是一个用于评估和改进摄像机运动理解的大规模数据集和基准,包含专家标注的视频和运动基元分类法,揭示了现有模型的局限性,并通过微调实现了性能提升。
English Summary: CameraBench is a comprehensive dataset and benchmark for evaluating camera motion understanding, featuring expert-annotated videos and a taxonomy of motion primitives that reveals the limitations of current models and enables improved performance through fine-tuning.
Authors:Calvin Luo, Zilai Zeng, Yilun Du, Chen Sun
Abstract:
Video generative models demonstrate great promise in robotics by serving as visual planners or as policy supervisors. When pretrained on internet-scale data, such video models intimately understand alignment with natural language, and can thus facilitate generalization to novel downstream behavior through text-conditioning. However, they may not be sensitive to the specificities of the particular environment the agent inhabits. On the other hand, training video models on in-domain examples of robotic behavior naturally encodes environment-specific intricacies, but the scale of available demonstrations may not be sufficient to support generalization to unseen tasks via natural language specification. In this work, we investigate different adaptation techniques that integrate in-domain information with large-scale pretrained video models, and explore the extent to which they enable novel text-conditioned generalization for robotic tasks, while also considering their independent data and resource considerations. We successfully demonstrate across robotic environments that adapting powerful video models with small scales of example data can successfully facilitate generalization to novel behaviors. In particular, we present a novel adaptation strategy, termed Inverse Probabilistic Adaptation, that not only consistently achieves strong generalization performance across robotic tasks and settings, but also exhibits robustness to the quality of adaptation data, successfully solving novel tasks even when only suboptimal in-domain demonstrations are available.
Authors:Qifan Yan, Andrew Liu, Shiqi He, Mathias Lécuyer, Ivan Beschastnikh
Abstract:
Federated learning (FL) is a machine learning paradigm that facilitates massively distributed model training with end-user data on edge devices directed by a central server. However, the large number of heterogeneous clients in FL deployments leads to a communication bottleneck between the server and the clients. This bottleneck is made worse by straggling clients, any one of which will further slow down training. To tackle these challenges, researchers have proposed techniques like client sampling and update compression. These techniques work well in isolation but combine poorly in the downstream, server-to-client direction. This is because unselected clients have outdated local model states and need to synchronize these states with the server first.
We introduce FedFetch, a strategy to mitigate the download time overhead caused by combining client sampling and compression techniques. FedFetch achieves this with an efficient prefetch schedule for clients to prefetch model states multiple rounds before a stated training round. We empirically show that adding FedFetch to communication efficient FL techniques reduces end-to-end training time by 1.26$\times$ and download time by 4.49$\times$ across compression techniques with heterogeneous client settings. Our implementation is available at https://github.com/DistributedML/FedFetch
联邦学习面临异构客户端导致的通信瓶颈,结合客户端采样和压缩技术会加剧此问题,但FedFetch通过预取模型状态减少了下载时间,将训练时间缩短1.26倍,下载时间降低4.49倍。
Federated learning faces communication bottlenecks from heterogeneous clients, which are worsened by combining client sampling and compression, but FedFetch reduces download time by prefetching model states, cutting training time by 1.26x and download time by 4.49x.
Authors:Yuan-Hong Liao, Sven Elflein, Liu He, Laura Leal-Taixé, Yejin Choi, Sanja Fidler, David Acuna
Abstract:
Recent reasoning models through test-time scaling have demonstrated that long chain-of-thoughts can unlock substantial performance boosts in hard reasoning tasks such as math and code. However, the benefit of such long thoughts for system-2 reasoning is relatively less explored in other domains such as perceptual tasks where shallower, system-1 reasoning seems sufficient. In this paper, we introduce LongPerceptualThoughts, a new synthetic dataset with 30K long-thought traces for perceptual tasks. The key challenges in synthesizing elaborate reasoning thoughts for perceptual tasks are that off-the-shelf models are not yet equipped with such thinking behavior and that it is not straightforward to build a reliable process verifier for perceptual tasks. Thus, we propose a novel three-stage data synthesis framework that first synthesizes verifiable multiple-choice questions from dense image descriptions, then extracts simple CoTs from VLMs for those verifiable problems, and finally expands those simple thoughts to elaborate long thoughts via frontier reasoning models. In controlled experiments with a strong instruction-tuned 7B model, we demonstrate notable improvements over existing visual reasoning data-generation methods. Our model, trained on the generated dataset, achieves an average +3.4 points improvement over 5 vision-centric benchmarks, including +11.8 points on V$^*$ Bench. Notably, despite being tuned for vision tasks, it also improves performance on the text reasoning benchmark, MMLU-Pro, by +2 points.
Authors:Jie Cheng, Ruixi Qiao, Lijun Li, Chao Guo, Junle Wang, Gang Xiong, Yisheng Lv, Fei-Yue Wang
Abstract:
Process reward models (PRMs) have proven effective for test-time scaling of Large Language Models (LLMs) on challenging reasoning tasks. However, reward hacking issues with PRMs limit their successful application in reinforcement fine-tuning. In this paper, we identify the main cause of PRM-induced reward hacking: the canonical summation-form credit assignment in reinforcement learning (RL), which defines the value as cumulative gamma-decayed future rewards, easily induces LLMs to hack steps with high rewards. To address this, we propose PURE: Process sUpervised Reinforcement lEarning. The key innovation of PURE is a min-form credit assignment that formulates the value function as the minimum of future rewards. This method significantly alleviates reward hacking by limiting the value function range and distributing advantages more reasonably. Through extensive experiments on 3 base models, we show that PRM-based approaches enabling min-form credit assignment achieve comparable reasoning performance to verifiable reward-based methods within only 30% steps. In contrast, the canonical sum-form credit assignment collapses training even at the beginning! Additionally, when we supplement PRM-based fine-tuning with just 10% verifiable rewards, we further alleviate reward hacking and produce the best fine-tuned model based on Qwen2.5-Math-7B in our experiments, achieving 82.5% accuracy on AMC23 and 53.3% average accuracy across 5 benchmarks. Moreover, we summarize the observed reward hacking cases and analyze the causes of training collapse. Code and models are available at https://github.com/CJReinforce/PURE.
中文: 过程奖励模型(PRM)在强化微调中因传统求和形式信用分配易导致奖励破解,但提出的PURE方法采用最小值形式信用分配有效缓解此问题,显著提升推理性能和模型准确率。
English: Process reward models (PRMs) face reward hacking issues in reinforcement fine-tuning due to the conventional summation-form credit assignment, but the proposed PURE method with min-form credit assignment effectively mitigates this problem, achieving superior reasoning performance and model accuracy.
Authors:Vaishnavh Nagarajan, Chen Henry Wu, Charles Ding, Aditi Raghunathan
Abstract:
We design a suite of minimal algorithmic tasks that are a loose abstraction of open-ended real-world tasks. This allows us to cleanly and controllably quantify the creative limits of the present-day language model. Much like real-world tasks that require a creative, far-sighted leap of thought, our tasks require an implicit, open-ended stochastic planning step that either (a) discovers new connections in an abstract knowledge graph (like in wordplay, drawing analogies, or research) or (b) constructs new patterns (like in designing math problems or new proteins). In these tasks, we empirically and conceptually argue how next-token learning is myopic; multi-token approaches, namely teacherless training and diffusion models, comparatively excel in producing diverse and original output. Secondly, to elicit randomness without hurting coherence, we find that injecting noise at the input layer (dubbed seed-conditioning) works surprisingly as well as (and in some conditions, better than) temperature sampling from the output layer. Thus, our work offers a principled, minimal test-bed for analyzing open-ended creative skills, and offers new arguments for going beyond next-token learning and temperature sampling. We make part of the code available under https://github.com/chenwu98/algorithmic-creativity
中文摘要:本研究设计了一套最小算法任务来评估语言模型的创造性局限,发现多标记方法在生成多样性输出上优于单标记学习,且输入层噪声注入在平衡随机性与连贯性方面不亚于温度采样。
English Summary: This study introduces minimal algorithmic tasks to evaluate the creative limitations of language models, demonstrating that multi-token approaches outperform next-token learning in generating diverse outputs and that input-layer noise injection rivals temperature sampling for balancing randomness and coherence.
Authors:Anirudh Khatry, Robert Zhang, Jia Pan, Ziteng Wang, Qiaochu Chen, Greg Durrett, Isil Dillig
Abstract:
C-to-Rust transpilation is essential for modernizing legacy C code while enhancing safety and interoperability with modern Rust ecosystems. However, no dataset currently exists for evaluating whether a system can transpile C into safe Rust that passes a set of test cases. We introduce CRUST-Bench, a dataset of 100 C repositories, each paired with manually-written interfaces in safe Rust as well as test cases that can be used to validate correctness of the transpilation. By considering entire repositories rather than isolated functions, CRUST-Bench captures the challenges of translating complex projects with dependencies across multiple files. The provided Rust interfaces provide explicit specifications that ensure adherence to idiomatic, memory-safe Rust patterns, while the accompanying test cases enforce functional correctness. We evaluate state-of-the-art large language models (LLMs) on this task and find that safe and idiomatic Rust generation is still a challenging problem for various state-of-the-art methods and techniques. We also provide insights into the errors LLMs usually make in transpiling code from C to safe Rust. The best performing model, OpenAI o1, is able to solve only 15 tasks in a single-shot setting. Improvements on CRUST-Bench would lead to improved transpilation systems that can reason about complex scenarios and help in migrating legacy codebases from C into languages like Rust that ensure memory safety. You can find the dataset and code at https://github.com/anirudhkhatry/CRUST-bench.
Chinese: CRUST-Bench是一个包含100个C语言仓库及对应安全Rust接口与测试用例的数据集,用于评估C到Rust的转译系统,结果表明现有方法在生成安全地道的Rust代码方面仍面临挑战,最优模型仅能完成15项任务。
English: CRUST-Bench is a dataset of 100 C repositories with safe Rust interfaces and test cases, designed to evaluate C-to-Rust transpilation systems, revealing that current methods struggle with generating safe, idiomatic Rust code as even the best model solved only 15 tasks.
Authors:Yilun Zhou, Austin Xu, Peifeng Wang, Caiming Xiong, Shafiq Joty
Abstract:
Scaling test-time computation, or affording a generator large language model (LLM) extra compute during inference, typically employs the help of external non-generative evaluators (i.e., reward models). Concurrently, LLM-judges, models trained to generate evaluations and critiques (explanations) in natural language, are becoming increasingly popular in automatic evaluation. Despite judge empirical successes, their effectiveness as evaluators in test-time scaling settings is largely unknown. In this paper, we introduce the Judge Evaluation for Test-Time Scaling (JETTS) benchmark, which evaluates judge performance in three domains (math reasoning, code generation, and instruction following) under three task settings: response reranking, step-level beam search, and critique-based response refinement. We evaluate 10 different judge models (7B-70B parameters) for 8 different base generator models (6.7B-72B parameters). Our benchmark shows that while judges are competitive with outcome reward models in reranking, they are consistently worse than process reward models in beam search procedures. Furthermore, though unique to LLM-judges, their natural language critiques are currently ineffective in guiding the generator towards better responses.
中文: JETTS基准测试表明,尽管LLM评判模型在重排序任务中与结果奖励模型表现相当,但在束搜索中不如过程奖励模型,且其自然语言评析目前无法有效指导生成模型改进回答。
English: The JETTS benchmark reveals that while LLM-judges are competitive with outcome reward models in reranking tasks, they underperform process reward models in beam search and their natural language critiques currently fail to effectively guide generators toward improved responses.
Authors:Sarah Alnegheimish, Zelin He, Matthew Reimherr, Akash Chandrayan, Abhinav Pradhan, Luca D'Angelo
Abstract:
With the widespread availability of sensor data across industrial and operational systems, we frequently encounter heterogeneous time series from multiple systems. Anomaly detection is crucial for such systems to facilitate predictive maintenance. However, most existing anomaly detection methods are designed for either univariate or single-system multivariate data, making them insufficient for these complex scenarios. To address this, we introduce M$^2$AD, a framework for unsupervised anomaly detection in multivariate time series data from multiple systems. M$^2$AD employs deep models to capture expected behavior under normal conditions, using the residuals as indicators of potential anomalies. These residuals are then aggregated into a global anomaly score through a Gaussian Mixture Model and Gamma calibration. We theoretically demonstrate that this framework can effectively address heterogeneity and dependencies across sensors and systems. Empirically, M$^2$AD outperforms existing methods in extensive evaluations by 21% on average, and its effectiveness is demonstrated on a large-scale real-world case study on 130 assets in Amazon Fulfillment Centers. Our code and results are available at https://github.com/sarahmish/M2AD.
中文: 本文提出M²AD无监督异常检测框架,通过深度学习建模正常行为并利用高斯混合模型聚合残差,有效处理多系统多元时间序列的异质性问题,在实验中比现有方法平均性能提升21%。
English: This paper introduces M²AD, an unsupervised anomaly detection framework that addresses heterogeneity in multivariate time series from multiple systems by modeling normal behavior with deep learning and aggregating residuals through Gaussian Mixture Models, achieving 21% better performance than existing methods.
Authors:Yatong Bai, Jonah Casebeer, Somayeh Sojoudi, Nicholas J. Bryan
Abstract:
We present Distributional RewArds for Generative OptimizatioN (DRAGON), a versatile framework for fine-tuning media generation models towards a desired outcome. Compared with traditional reinforcement learning with human feedback (RLHF) or pairwise preference approaches such as direct preference optimization (DPO), DRAGON is more flexible. It can optimize reward functions that evaluate either individual examples or distributions of them, making it compatible with a broad spectrum of instance-wise, instance-to-distribution, and distribution-to-distribution rewards. Leveraging this versatility, we construct novel reward functions by selecting an encoder and a set of reference examples to create an exemplar distribution. When cross-modality encoders such as CLAP are used, the reference examples may be of a different modality (e.g., text versus audio). Then, DRAGON gathers online and on-policy generations, scores them to construct a positive demonstration set and a negative set, and leverages the contrast between the two sets to maximize the reward. For evaluation, we fine-tune an audio-domain text-to-music diffusion model with 20 different reward functions, including a custom music aesthetics model, CLAP score, Vendi diversity, and Frechet audio distance (FAD). We further compare instance-wise (per-song) and full-dataset FAD settings while ablating multiple FAD encoders and reference sets. Over all 20 target rewards, DRAGON achieves an 81.45% average win rate. Moreover, reward functions based on exemplar sets indeed enhance generations and are comparable to model-based rewards. With an appropriate exemplar set, DRAGON achieves a 60.95% human-voted music quality win rate without training on human preference annotations. As such, DRAGON exhibits a new approach to designing and optimizing reward functions for improving human-perceived quality. Sound examples at https://ml-dragon.github.io/web.
Authors:Amirmohammad Mohammadi, Davelle Carreiro, Alexandra Van Dine, Joshua Peeples
Abstract:
Parameter-efficient transfer learning (PETL) methods adapt large artificial neural networks to downstream tasks without fine-tuning the entire model. However, existing additive methods, such as adapters, sometimes struggle to capture distributional shifts in intermediate feature embeddings. We propose a novel histogram-based parameter-efficient tuning (HPT) technique that captures the statistics of the target domain and modulates the embeddings. Experimental results on three downstream passive sonar datasets (ShipsEar, DeepShip, VTUAD) demonstrate that HPT outperforms conventional adapters. Notably, HPT achieves 91.8% vs. 89.8% accuracy on VTUAD. Furthermore, HPT trains faster and yields feature representations closer to those of fully fine-tuned models. Overall, HPT balances parameter savings and performance, providing a distribution-aware alternative to existing adapters and shows a promising direction for scalable transfer learning in resource-constrained environments. The code is publicly available: https://github.com/Advanced-Vision-and-Learning-Lab/HLAST_DeepShip_ParameterEfficient.
中文: 提出的基于直方图的参数高效调优(HPT)方法能有效捕捉目标域统计特征来调节嵌入表示,在被动声纳数据集上以更少参数实现了比传统适配器更高的准确率和训练效率。
English: The proposed Histogram-based Parameter-efficient Tuning (HPT) method effectively captures target domain statistics to modulate embeddings, outperforming conventional adapters in accuracy and training efficiency on passive sonar datasets while maintaining parameter efficiency.
Authors:Ziwen Xu, Shuxun Wang, Kewei Xu, Haoming Xu, Mengru Wang, Xinle Deng, Yunzhi Yao, Guozhou Zheng, Huajun Chen, Ningyu Zhang
Abstract:
In this paper, we introduce EasyEdit2, a framework designed to enable plug-and-play adjustability for controlling Large Language Model (LLM) behaviors. EasyEdit2 supports a wide range of test-time interventions, including safety, sentiment, personality, reasoning patterns, factuality, and language features. Unlike its predecessor, EasyEdit2 features a new architecture specifically designed for seamless model steering. It comprises key modules such as the steering vector generator and the steering vector applier, which enable automatic generation and application of steering vectors to influence the model's behavior without modifying its parameters. One of the main advantages of EasyEdit2 is its ease of use-users do not need extensive technical knowledge. With just a single example, they can effectively guide and adjust the model's responses, making precise control both accessible and efficient. Empirically, we report model steering performance across different LLMs, demonstrating the effectiveness of these techniques. We have released the source code on GitHub at https://github.com/zjunlp/EasyEdit along with a demonstration notebook. In addition, we provide a demo video at https://www.youtube.com/watch?v=AkfoiPfp5rQ for a quick introduction.
中文:EasyEdit2是一个即插即用的框架,通过测试时干预让用户无需深厚技术知识即可轻松控制大型语言模型的行为,仅需一个示例即可精确调整模型响应。
English: EasyEdit2 is a plug-and-play framework that enables users to easily control Large Language Model behaviors through test-time interventions, requiring minimal technical knowledge and allowing precise adjustments with just a single example.
Authors:Louis Bradshaw, Simon Colton
Abstract:
We introduce an extensive new dataset of MIDI files, created by transcribing audio recordings of piano performances into their constituent notes. The data pipeline we use is multi-stage, employing a language model to autonomously crawl and score audio recordings from the internet based on their metadata, followed by a stage of pruning and segmentation using an audio classifier. The resulting dataset contains over one million distinct MIDI files, comprising roughly 100,000 hours of transcribed audio. We provide an in-depth analysis of our techniques, offering statistical insights, and investigate the content by extracting metadata tags, which we also provide. Dataset available at https://github.com/loubbrad/aria-midi.
中文: 我们通过多阶段处理流程,包括自动抓取、评分和分割音频,创建了一个包含超过一百万MIDI文件的大型数据集,并提供了详细的技术分析和元数据信息。
English: We present a large-scale dataset of over one million MIDI files transcribed from piano audio recordings using a multi-stage pipeline that includes automated crawling, scoring, and segmentation, along with detailed analysis and metadata.
Authors:Chris Dongjoo Kim, Jihwan Moon, Sangwoo Moon, Heeseung Yun, Sihaeng Lee, Aniruddha Kembhavi, Soonyoung Lee, Gunhee Kim, Sangho Lee, Christopher Clark
Abstract:
The rapid growth of video-text data presents challenges in storage and computation during training. Online learning, which processes streaming data in real-time, offers a promising solution to these issues while also allowing swift adaptations in scenarios demanding real-time responsiveness. One strategy to enhance the efficiency and effectiveness of learning involves identifying and prioritizing data that enhances performance on target downstream tasks. We propose Relevance and Specificity-based online filtering framework (ReSpec) that selects data based on four criteria: (i) modality alignment for clean data, (ii) task relevance for target focused data, (iii) specificity for informative and detailed data, and (iv) efficiency for low-latency processing. Relevance is determined by the probabilistic alignment of incoming data with downstream tasks, while specificity employs the distance to a root embedding representing the least specific data as an efficient proxy for informativeness. By establishing reference points from target task data, ReSpec filters incoming data in real-time, eliminating the need for extensive storage and compute. Evaluating on large-scale datasets WebVid2M and VideoCC3M, ReSpec attains state-of-the-art performance on five zeroshot video retrieval tasks, using as little as 5% of the data while incurring minimal compute. The source code is available at https://github.com/cdjkim/ReSpec.
Chinese: ReSpec框架通过基于模态对齐、任务相关性、特异性和处理效率实时筛选视频文本数据,显著提升了在线学习效率,在仅使用少量数据和计算的情况下,于零样本视频检索任务中实现了最优性能。
English: The ReSpec framework enhances online learning efficiency by filtering video-text data in real-time based on modality alignment, task relevance, specificity, and processing efficiency, achieving top performance on zero-shot retrieval tasks with minimal data and computation.
Authors:Ryu Tadokoro, Tsukasa Takagi, Shin-ichi Maeda
Abstract:
In semantic segmentation, the accuracy of models heavily depends on the high-quality annotations. However, in many practical scenarios such as medical imaging and remote sensing, obtaining true annotations is not straightforward and usually requires significant human labor. Relying on human labor often introduces annotation errors, including mislabeling, omissions, and inconsistency between annotators. In the case of remote sensing, differences in procurement time can lead to misaligned ground truth annotations. These label errors are not independently distributed, and instead usually appear in spatially connected regions where adjacent pixels are more likely to share the same errors. To address these issues, we propose an approximate Bayesian estimation based on a probabilistic model that assumes training data includes label errors, incorporating the tendency for these errors to occur with spatial correlations between adjacent pixels. Bayesian inference requires computing the posterior distribution of label errors, which becomes intractable when spatial correlations are present. We represent the correlation of label errors between adjacent pixels through a Gaussian distribution whose covariance is structured by a Kac-Murdock-Szegö (KMS) matrix, solving the computational challenges. Through experiments on multiple segmentation tasks, we confirm that leveraging the spatial correlation of label errors significantly improves performance. Notably, in specific tasks such as lung segmentation, the proposed method achieves performance comparable to training with clean labels under moderate noise levels. Code is available at https://github.com/pfnet-research/Bayesian_SpatialCorr.
Chinese Summary: 该研究提出了一种基于概率模型的贝叶斯估计方法,用于处理语义分割中空间相关的标注误差,在医学影像和遥感等任务中显著提升了模型性能。
English Summary: The study introduces a Bayesian estimation method using a probabilistic model that accounts for spatially correlated label errors in semantic segmentation, significantly enhancing model performance across tasks like medical imaging and remote sensing.
Authors:Bowei Zhang, Lei Ke, Adam W. Harley, Katerina Fragkiadaki
Abstract:
We introduce TAPIP3D, a novel approach for long-term 3D point tracking in monocular RGB and RGB-D videos. TAPIP3D represents videos as camera-stabilized spatio-temporal feature clouds, leveraging depth and camera motion information to lift 2D video features into a 3D world space where camera movement is effectively canceled out. Within this stabilized 3D representation, TAPIP3D iteratively refines multi-frame motion estimates, enabling robust point tracking over long time horizons. To handle the irregular structure of 3D point distributions, we propose a 3D Neighborhood-to-Neighborhood (N2N) attention mechanism - a 3D-aware contextualization strategy that builds informative, spatially coherent feature neighborhoods to support precise trajectory estimation. Our 3D-centric formulation significantly improves performance over existing 3D point tracking methods and even surpasses state-of-the-art 2D pixel trackers in accuracy when reliable depth is available. The model supports inference in both camera-centric (unstabilized) and world-centric (stabilized) coordinates, with experiments showing that compensating for camera motion leads to substantial gains in tracking robustness. By replacing the conventional 2D square correlation windows used in prior 2D and 3D trackers with a spatially grounded 3D attention mechanism, TAPIP3D achieves strong and consistent results across multiple 3D point tracking benchmarks. Project Page: https://tapip3d.github.io
中文: TAPIP3D提出了一种新颖的3D点跟踪方法,通过深度和相机运动信息将视频特征稳定至三维空间,并采用3D邻域注意力机制实现长期鲁棒跟踪,在深度数据可靠时其性能超越现有3D与2D跟踪器。
English: TAPIP3D introduces a novel 3D point tracking method that stabilizes video features in 3D space using depth and camera motion, employing a 3D N2N attention mechanism to achieve robust long-term tracking and surpass both 3D and 2D trackers in performance when depth data is reliable.
Authors:Yeoreum Lee, Jinwook Jung, Sungyong Baik
Abstract:
Large-scale deep learning models with a pretraining-finetuning paradigm have led to a surge of numerous task-specific models fine-tuned from a common pre-trained model. Recently, several research efforts have been made on merging these large models into a single multi-task model, particularly with simple arithmetic on parameters. Such merging methodology faces a central challenge: interference between model parameters fine-tuned on different tasks. Few recent works have focused on designing a new fine-tuning scheme that can lead to small parameter interference, however at the cost of the performance of each task-specific fine-tuned model and thereby limiting that of a merged model. To improve the performance of a merged model, we note that a fine-tuning scheme should aim for (1) smaller parameter interference and (2) better performance of each fine-tuned model on the corresponding task. In this work, we aim to design a new fine-tuning objective function to work towards these two goals. In the course of this process, we find such objective function to be strikingly similar to sharpness-aware minimization (SAM) objective function, which aims to achieve generalization by finding flat minima. Drawing upon our observation, we propose to fine-tune pre-trained models via sharpness-aware minimization. The experimental and theoretical results showcase the effectiveness and orthogonality of our proposed approach, improving performance upon various merging and fine-tuning methods. Our code is available at https://github.com/baiklab/SAFT-Merge.
Chinese Summary: 本研究提出了一种利用锐度感知最小化的新微调方法,通过减少参数干扰和提升各任务模型的性能,有效增强了合并多任务模型的整体表现。
English Summary: This research introduces a new fine-tuning method using sharpness-aware minimization to enhance the performance of merged multi-task models by reducing parameter interference and improving task-specific model efficacy.
Authors:Binjie Guo, Hanyu Zheng, Guowei Su, Ru Zhang, Haohan Jiang, Xurong Lin, Hongyan Wei, Aisheng Mo, Jie Li, Zhiyuan Qian, Zhuhao Zhang, Xiaoyuan Cheng
Abstract:
Recent years have witnessed significant progress in reinforcement learning, especially with Zero-like paradigms, which have greatly boosted the generalization and reasoning abilities of large-scale language models. Nevertheless, existing frameworks are often plagued by high implementation complexity and poor reproducibility. To tackle these challenges, we present AlphaZero-Edu, a lightweight, education-focused implementation built upon the mathematical framework of AlphaZero. It boasts a modular architecture that disentangles key components, enabling transparent visualization of the algorithmic processes. Additionally, it is optimized for resource-efficient training on a single NVIDIA RTX 3090 GPU and features highly parallelized self-play data generation, achieving a 3.2-fold speedup with 8 processes. In Gomoku matches, the framework has demonstrated exceptional performance, achieving a consistently high win rate against human opponents. AlphaZero-Edu has been open-sourced at https://github.com/StarLight1212/AlphaZero_Edu, providing an accessible and practical benchmark for both academic research and industrial applications.
Chinese: AlphaZero-Edu 是一个轻量级模块化框架,通过简化实现并在有限硬件上实现高效训练,提升了强化学习在教育领域的应用,在五子棋等游戏中表现出色。
English: AlphaZero-Edu is a lightweight, modular framework that enhances reinforcement learning for education by simplifying implementation and enabling efficient training on limited hardware, achieving high performance in games like Gomoku.
Authors:Lawrence Liu, Inesh Chakrabarti, Yixiao Li, Mengdi Wang, Tuo Zhao, Lin F. Yang
Abstract:
Large language models (LLMs) exhibit remarkable performance across various natural language processing tasks but suffer from immense computational and memory demands, limiting their deployment in resource-constrained environments. To address this challenge, we propose NoWag (Normalized Weight and Activation Guided Compression), a unified framework for one-shot shape preserving compression algorithms. We apply NoWag to compress Llama-2 (7B, 13B, 70B) and Llama-3 (8B, 70B) models using two popular shape-preserving techniques: vector quantization (NoWag-VQ) and unstructured/semi-structured pruning (NoWag-P). Our results show that NoWag-VQ significantly outperforms state-of-the-art one-shot vector quantization methods, while NoWag-P performs competitively against leading pruning techniques. These findings highlight underlying commonalities between these compression paradigms and suggest promising directions for future research. Our code is available at https://github.com/LawrenceRLiu/NoWag
中文摘要:NoWag框架通过向量量化和剪枝技术高效压缩Llama-2与Llama-3等大语言模型,其性能超越现有方法并揭示了不同压缩范式间的内在关联。
English Summary: The NoWag framework effectively compresses large language models like Llama-2 and Llama-3 through vector quantization and pruning, outperforming existing methods and revealing commonalities in compression techniques.
Authors:Akshat Ramachandran, Souvik Kundu, Arnab Raha, Shamik Kundu, Deepak K. Mathaikutty, Tushar Krishna
Abstract:
Large language model (LLM) pruning with fixed N:M structured sparsity significantly limits the expressivity of the sparse model, yielding sub-optimal performance. In contrast, supporting multiple N:M patterns to provide sparse representational freedom introduces costly overhead in hardware. To address these challenges for LLMs, we first present a flexible layer-wise outlier-density-aware N:M sparsity (FLOW) selection method. FLOW enables the identification of optimal layer-wise N and M values (from a given range) by simultaneously accounting for the presence and distribution of outliers, allowing a higher degree of representational freedom. To deploy sparse models with such N:M flexibility, we then introduce a flexible, low-overhead digital compute-in-memory architecture (FlexCiM). FlexCiM supports diverse sparsity patterns by partitioning a digital CiM (DCiM) macro into smaller sub-macros, which are adaptively aggregated and disaggregated through distribution and merging mechanisms for different N and M values. Extensive experiments on both transformer-based and recurrence-based state space foundation models (SSMs) demonstrate that FLOW outperforms existing alternatives with an accuracy improvement of up to 36%, while FlexCiM achieves up to 1.75x lower inference latency and 1.5x lower energy consumption compared to existing sparse accelerators. Code is available at: https://github.com/FLOW-open-project/FLOW
Chinese: 提出的FLOW方法和FlexCiM架构通过实现分层稀疏度优化和灵活硬件部署,克服了大语言模型中固定N:M稀疏性的限制,在准确率、延迟和能耗方面均取得显著提升。
English: The proposed FLOW method and FlexCiM architecture overcome the limitations of fixed N:M sparsity in large language models by enabling layer-wise sparsity optimization and flexible hardware deployment, achieving significant improvements in accuracy, latency, and energy efficiency.
Authors:Ze Zhao, Bin Lu, Xiaoying Gan, Gu Tang, Luoyi Fu, Xinbing Wang
Abstract:
Reasoning over Knowledge Graphs (KGs) plays a pivotal role in knowledge graph completion or question answering systems, providing richer and more accurate triples and attributes. As numerical attributes become increasingly essential in characterizing entities and relations in KGs, the ability to reason over these attributes has gained significant importance. Existing graph-based methods such as Graph Neural Networks (GNNs) and Knowledge Graph Embeddings (KGEs), primarily focus on aggregating homogeneous local neighbors and implicitly embedding diverse triples. However, these approaches often fail to fully leverage the potential of logical paths within the graph, limiting their effectiveness in exploiting the reasoning process. To address these limitations, we propose ChainsFormer, a novel chain-based framework designed to support numerical reasoning. Chainsformer not only explicitly constructs logical chains but also expands the reasoning depth to multiple hops. Specially, we introduces Relation-Attribute Chains (RA-Chains), a specialized logic chain, to model sequential reasoning patterns. ChainsFormer captures the step-by-step nature of multi-hop reasoning along RA-Chains by employing sequential in-context learning. To mitigate the impact of noisy chains, we propose a hyperbolic affinity scoring mechanism that selects relevant logic chains in a variable-resolution space. Furthermore, ChainsFormer incorporates an attention-based numerical reasoner to identify critical reasoning paths, enhancing both reasoning accuracy and transparency. Experimental results demonstrate that ChainsFormer significantly outperforms state-of-the-art methods, achieving up to a 20.0% improvement in performance. The implementations are available at https://github.com/zhaodazhuang2333/ChainsFormer.
Chinese: ChainsFormer是一种新颖的基于链的框架,通过显式构建逻辑链、扩展推理深度,并采用机制减少噪声和识别关键路径,显著提升了知识图谱上的数值推理能力,性能比现有最优方法提高了高达20.0%。
English: ChainsFormer is a novel chain-based framework that enhances numerical reasoning over Knowledge Graphs by explicitly constructing logical chains, expanding reasoning depth, and employing mechanisms to mitigate noise and identify critical paths, achieving up to a 20.0% performance improvement over state-of-the-art methods.
Authors:Jindong Li, Yongguang Li, Yali Fu, Jiahong Liu, Yixin Liu, Menglin Yang, Irwin King
Abstract:
As machine learning evolves, domain generalization (DG) and domain adaptation (DA) have become crucial for enhancing model robustness across diverse environments. Contrastive Language-Image Pretraining (CLIP) plays a significant role in these tasks, offering powerful zero-shot capabilities that allow models to perform effectively in unseen domains. However, there remains a significant gap in the literature, as no comprehensive survey currently exists that systematically explores the applications of CLIP in DG and DA, highlighting the necessity for this review. This survey presents a comprehensive review of CLIP's applications in DG and DA. In DG, we categorize methods into optimizing prompt learning for task alignment and leveraging CLIP as a backbone for effective feature extraction, both enhancing model adaptability. For DA, we examine both source-available methods utilizing labeled source data and source-free approaches primarily based on target domain data, emphasizing knowledge transfer mechanisms and strategies for improved performance across diverse contexts. Key challenges, including overfitting, domain diversity, and computational efficiency, are addressed, alongside future research opportunities to advance robustness and efficiency in practical applications. By synthesizing existing literature and pinpointing critical gaps, this survey provides valuable insights for researchers and practitioners, proposing directions for effectively leveraging CLIP to enhance methodologies in domain generalization and adaptation. Ultimately, this work aims to foster innovation and collaboration in the quest for more resilient machine learning models that can perform reliably across diverse real-world scenarios. A more up-to-date version of the papers is maintained at: https://github.com/jindongli-Ai/Survey_on_CLIP-Powered_Domain_Generalization_and_Adaptation.
中文: 本综述系统梳理了CLIP在领域泛化与自适应中的应用方法,通过分类讨论优化策略并应对过拟合、领域差异等关键挑战,旨在提升模型在多样化场景中的鲁棒性能。
English: This survey comprehensively reviews CLIP's applications in domain generalization and adaptation, categorizing methods and addressing challenges like overfitting and domain diversity to enhance model robustness across diverse environments.
Authors:Wenxin Zhang, Cuicui Luo
Abstract:
Time series anomaly detection is crucial for maintaining stable systems. Existing methods face two main challenges. First, it is difficult to directly model the dependencies of diverse and complex patterns within the sequences. Second, many methods that optimize parameters using mean squared error struggle with noise in the time series, leading to performance deterioration. To address these challenges, we propose a transformer-based framework built on decomposition (TransDe) for multivariate time series anomaly detection. The key idea is to combine the strengths of time series decomposition and transformers to effectively learn the complex patterns in normal time series data. A multi-scale patch-based transformer architecture is proposed to exploit the representative dependencies of each decomposed component of the time series. Furthermore, a contrastive learn paradigm based on patch operation is proposed, which leverages KL divergence to align the positive pairs, namely the pure representations of normal patterns between different patch-level views. A novel asynchronous loss function with a stop-gradient strategy is further introduced to enhance the performance of TransDe effectively. It can avoid time-consuming and labor-intensive computation costs in the optimization process. Extensive experiments on five public datasets are conducted and TransDe shows superiority compared with twelve baselines in terms of F1 score. Our code is available at https://github.com/shaieesss/TransDe.
中文: 提出的TransDe框架结合时间序列分解与Transformer架构,通过多尺度补丁转换器和对比学习范式有效解决复杂模式依赖建模和噪声敏感问题,在多元时间序列异常检测中表现出优越性能。
English: The proposed TransDe framework combines time series decomposition with transformers and contrastive learning to effectively detect anomalies in multivariate time series by addressing pattern dependency modeling and noise sensitivity issues.
Authors:Wenxin Zhang, Jingxing Zhong, Guangzhen Yao, Renda Han, Xiaojian Lin, Zeyu Zhang, Cuicui Luo
Abstract:
Fraudulent activities have significantly increased across various domains, such as e-commerce, online review platforms, and social networks, making fraud detection a critical task. Spatial Graph Neural Networks (GNNs) have been successfully applied to fraud detection tasks due to their strong inductive learning capabilities. However, existing spatial GNN-based methods often enhance the graph structure by excluding heterophilic neighbors during message passing to align with the homophilic bias of GNNs. Unfortunately, this approach can disrupt the original graph topology and increase uncertainty in predictions. To address these limitations, this paper proposes a novel framework, Dual-channel Heterophilic Message Passing (DHMP), for fraud detection. DHMP leverages a heterophily separation module to divide the graph into homophilic and heterophilic subgraphs, mitigating the low-pass inductive bias of traditional GNNs. It then applies shared weights to capture signals at different frequencies independently and incorporates a customized sampling strategy for training. This allows nodes to adaptively balance the contributions of various signals based on their labels. Extensive experiments on three real-world datasets demonstrate that DHMP outperforms existing methods, highlighting the importance of separating signals with different frequencies for improved fraud detection. The code is available at https://github.com/shaieesss/DHMP.
中文: 本文提出双通道异质信息传递(DHMP)框架,通过分离图中的同质与异质信号来克服传统空间图神经网络的局限,在真实数据集上实现了更优的欺诈检测性能。
English: This paper introduces the Dual-channel Heterophilic Message Passing (DHMP) framework, which improves fraud detection by separating homophilic and heterophilic signals in graphs to overcome the limitations of traditional spatial GNNs, achieving superior performance on real-world datasets.
Authors:Wenxin Zhang, Xiaojian Lin, Wenjun Yu, Guangzhen Yao, jingxiang Zhong, Yu Li, Renda Han, Songcheng Xu, Hao Shi, Cuicui Luo
Abstract:
Time series anomaly detection holds notable importance for risk identification and fault detection across diverse application domains. Unsupervised learning methods have become popular because they have no requirement for labels. However, due to the challenges posed by the multiplicity of abnormal patterns, the sparsity of anomalies, and the growth of data scale and complexity, these methods often fail to capture robust and representative dependencies within the time series for identifying anomalies. To enhance the ability of models to capture normal patterns of time series and avoid the retrogression of modeling ability triggered by the dependencies on high-quality prior knowledge, we propose a differencing-based contrastive representation learning framework for time series anomaly detection (DConAD). Specifically, DConAD generates differential data to provide additional information about time series and utilizes transformer-based architecture to capture spatiotemporal dependencies, which enhances the robustness of unbiased representation learning ability. Furthermore, DConAD implements a novel KL divergence-based contrastive learning paradigm that only uses positive samples to avoid deviation from reconstruction and deploys the stop-gradient strategy to compel convergence. Extensive experiments on five public datasets show the superiority and effectiveness of DConAD compared with nine baselines. The code is available at https://github.com/shaieesss/DConAD.
中文: 提出的DConAD框架通过生成差分数据来捕捉稳健的时空依赖关系,并采用仅使用正样本的对比学习范式,有效提升了时间序列异常检测性能,在多个数据集上展现出优越性。
English: The proposed DConAD framework enhances time series anomaly detection by generating differential data to capture robust spatiotemporal dependencies and implementing a contrastive learning paradigm using only positive samples, demonstrating superior performance across multiple datasets.
Authors:Zekai Chen, Xunkai Li, Yinlin Zhu, Rong-Hua Li, Guoren Wang
Abstract:
As a new distributed graph learning paradigm, Federated Graph Learning (FGL) facilitates collaborative model training across local systems while preserving data privacy.
We review existing FGL approaches and categorize their optimization mechanisms into:
(1) Server-Client (S-C), where clients upload local model parameters for server-side aggregation and global updates;
(2) Client-Client (C-C), which allows direct exchange of information between clients and customizing their local training process.
We reveal that C-C shows superior potential due to its refined communication structure.
However, existing C-C methods broadcast redundant node representations, incurring high communication costs and privacy risks at the node level. To this end, we propose FedC4, which combines graph Condensation with C-C Collaboration optimization. Specifically, FedC4 employs graph condensation technique to refine the knowledge of each client's graph into a few synthetic embeddings instead of transmitting node-level knowledge. Moreover, FedC4 introduces three novel modules that allow the source client to send distinct node representations tailored to the target client's graph properties. Experiments on eight public real-world datasets show that FedC4 outperforms state-of-the-art baselines in both task performance and communication cost. Our code is now available on https://github.com/Ereshkigal1/FedC4.
中文摘要:联邦图学习(FGL)在保护数据隐私的同时实现协同训练,而提出的FedC4方法通过图压缩技术优化客户端间协作,有效降低通信成本与隐私风险,在性能和效率上均优于现有方法。
English Summary: Federated Graph Learning (FGL) enables collaborative training while preserving data privacy, and the proposed FedC4 method enhances Client-Client optimization by using graph condensation to reduce communication costs and privacy risks, outperforming existing methods in both performance and efficiency.
Authors:Haiwen Huang, Anpei Chen, Volodymyr Havrylov, Andreas Geiger, Dan Zhang
Abstract:
Vision foundation models (VFMs) such as DINOv2 and CLIP have achieved impressive results on various downstream tasks, but their limited feature resolution hampers performance in applications requiring pixel-level understanding. Feature upsampling offers a promising direction to address this challenge. In this work, we identify two critical factors for enhancing feature upsampling: the upsampler architecture and the training objective. For the upsampler architecture, we introduce a coordinate-based cross-attention transformer that integrates the high-resolution images with coordinates and low-resolution VFM features to generate sharp, high-quality features. For the training objective, we propose constructing high-resolution pseudo-groundtruth features by leveraging class-agnostic masks and self-distillation. Our approach effectively captures fine-grained details and adapts flexibly to various input and feature resolutions. Through experiments, we demonstrate that our approach significantly outperforms existing feature upsampling techniques across various downstream tasks. Our code is released at https://github.com/andrehuang/loftup.
中文: 本研究提出了一种新颖的特征上采样方法,通过基于坐标的交叉注意力变换器和自蒸馏伪地面实况特征来提高视觉基础模型的分辨率,在像素级任务中显著优于现有技术。
English: This study introduces a novel feature upsampling method using a coordinate-based cross-attention transformer and self-distilled pseudo-groundtruth features to enhance the resolution of vision foundation models, significantly outperforming existing techniques in pixel-level tasks.
Authors:Mehmet Yamaç, Muhammad Numan Yousaf, Serkan Kiranyaz, Moncef Gabbouj
Abstract:
Multilayer perceptrons (MLP), or fully connected artificial neural networks, are known for performing vector-matrix multiplications using learnable weight matrices; however, their practical application in many machine learning tasks, especially in computer vision, can be limited due to the high dimensionality of input-output pairs at each layer. To improve efficiency, convolutional operators have been utilized to facilitate weight sharing and local connections, yet they are constrained by limited receptive fields. In this paper, we introduce Multiscale Tensor Summation (MTS) Factorization, a novel neural network operator that implements tensor summation at multiple scales, where each tensor to be summed is obtained through Tucker-decomposition-like mode products. Unlike other tensor decomposition methods in the literature, MTS is not introduced as a network compression tool; instead, as a new backbone neural layer. MTS not only reduces the number of parameters required while enhancing the efficiency of weight optimization compared to traditional dense layers (i.e., unfactorized weight matrices in MLP layers), but it also demonstrates clear advantages over convolutional layers. The proof-of-concept experimental comparison of the proposed MTS networks with MLPs and Convolutional Neural Networks (CNNs) demonstrates their effectiveness across various tasks, such as classification, compression, and signal restoration. Additionally, when integrated with modern non-linear units such as the multi-head gate (MHG), also introduced in this study, the corresponding neural network, MTSNet, demonstrates a more favorable complexity-performance tradeoff compared to state-of-the-art transformers in various computer vision applications. The software implementation of the MTS layer and the corresponding MTS-based networks, MTSNets, is shared at https://github.com/mehmetyamac/MTSNet.
中文摘要:本文提出的多尺度张量求和分解作为一种新型神经网络层,在减少参数的同时,在多种计算机视觉任务中展现出优于传统全连接层和卷积网络的性能。
English summary: This paper introduces Multiscale Tensor Summation (MTS) Factorization, a novel neural network layer that reduces parameters while outperforming traditional MLPs and CNNs across various computer vision tasks.
Authors:Chao Yang, Xiannan Huang, Shuhan Qiu, Yan Cheng
Abstract:
Accurate short-term traffic demand prediction is critical for the operation of traffic systems. Besides point estimation, the confidence interval of the prediction is also of great importance. Many models for traffic operations, such as shared bike rebalancing and taxi dispatching, take into account the uncertainty of future demand and require confidence intervals as the input. However, existing methods for confidence interval modeling rely on strict assumptions, such as unchanging traffic patterns and correct model specifications, to guarantee enough coverage. Therefore, the confidence intervals provided could be invalid, especially in a changing traffic environment. To fill this gap, we propose an efficient method, CONTINA (Conformal Traffic Intervals with Adaptation) to provide interval predictions that can adapt to external changes. By collecting the errors of interval during deployment, the method can adjust the interval in the next step by widening it if the errors are too large or shortening it otherwise. Furthermore, we theoretically prove that the coverage of the confidence intervals provided by our method converges to the target coverage level. Experiments across four real-world datasets and prediction models demonstrate that the proposed method can provide valid confidence intervals with shorter lengths. Our method can help traffic management personnel develop a more reasonable and robust operation plan in practice. And we release the code, model and dataset in \href{ https://github.com/xiannanhuang/CONTINA/}{ Github}.
中文: 所提出的CONTINA方法能够自适应调整交通需求预测区间,在动态环境中确保置信度覆盖的有效性,通过真实数据集验证表明其能以更短的区间长度提供更可靠的预测结果。
English: The proposed method CONTINA adaptively adjusts traffic demand prediction intervals to ensure valid confidence coverage in changing environments, outperforming existing approaches by providing shorter yet more reliable intervals across real-world datasets.
Authors:Zhongxi Qiu, Zhang Zhang, Yan Hu, Heng Li, Jiang Liu
Abstract:
This paper explores optimal data selection strategies for Reinforcement Learning with Verified Rewards (RLVR) training in the medical domain. While RLVR has shown exceptional potential for enhancing reasoning capabilities in large language models, most prior implementations have focused on mathematics and logical puzzles, with limited exploration of domain-specific applications like medicine. We investigate four distinct data sampling strategies from MedQA-USMLE: random sampling (baseline), and filtering using Phi-4, Gemma-3-27b-it, and Gemma-3-12b-it models. Using Gemma-3-12b-it as our base model and implementing Group Relative Policy Optimization (GRPO), we evaluate performance across multiple benchmarks including MMLU, GSM8K, MMLU-Pro, and CMMLU. Our findings demonstrate that models trained on filtered data generally outperform those trained on randomly selected samples. Notably, training on self-filtered samples (using Gemma-3-12b-it for filtering) achieved superior performance in medical domains but showed reduced robustness across different benchmarks, while filtering with larger models from the same series yielded better overall robustness. These results provide valuable insights into effective data organization strategies for RLVR in specialized domains and highlight the importance of thoughtful data selection in achieving optimal performance. You can access our repository (https://github.com/Qsingle/open-medical-r1) to get the codes.
中文摘要:本研究探讨了医疗领域中验证奖励强化学习的最佳数据选择策略,发现经过筛选的训练数据通常能提升模型性能,而自筛选样本在医疗领域表现优异但会降低跨领域鲁棒性。
English Summary: This study examines optimal data selection methods for Reinforcement Learning with Verified Rewards in medical applications, finding that filtered training data generally enhances performance while self-filtered samples excel in medical domains but reduce cross-domain robustness.
Authors:Zhanglin Wu, Tengfei Song, Ning Xie, Mengli Zhu, Weidong Zhang, Shuang Wu, Pengfei Li, Chong Li, Junhao Zhu, Hao Yang, Shiliang Sun
Abstract:
The rapid advancement of large vision-language models (LVLMs) has significantly propelled applications in document understanding, particularly in optical character recognition (OCR) and multilingual translation. However, current evaluations of LVLMs, like the widely used OCRBench, mainly focus on verifying the correctness of their short-text responses and long-text responses with simple layout, while the evaluation of their ability to understand long texts with complex layout design is highly significant but largely overlooked. In this paper, we propose Menu OCR and Translation Benchmark (MOTBench), a specialized evaluation framework emphasizing the pivotal role of menu translation in cross-cultural communication. MOTBench requires LVLMs to accurately recognize and translate each dish, along with its price and unit items on a menu, providing a comprehensive assessment of their visual understanding and language processing capabilities. Our benchmark is comprised of a collection of Chinese and English menus, characterized by intricate layouts, a variety of fonts, and culturally specific elements across different languages, along with precise human annotations. Experiments show that our automatic evaluation results are highly consistent with professional human evaluation. We evaluate a range of publicly available state-of-the-art LVLMs, and through analyzing their output to identify the strengths and weaknesses in their performance, offering valuable insights to guide future advancements in LVLM development. MOTBench is available at https://github.com/gitwzl/MOTBench.
中文: MOTBench是一个专门评估大型视觉语言模型的框架,用于测试其准确识别和翻译具有复杂布局及文化特定元素的菜单的能力,为未来发展提供指导。
English: MOTBench is a specialized evaluation framework for large vision-language models that assesses their ability to accurately recognize and translate complex menu layouts with culturally specific elements, providing insights for future development.
Authors:Dezhao Luo, Bohan Tang, Kang Li, Georgios Papoudakis, Jifei Song, Shaogang Gong, Jianye Hao, Jun Wang, Kun Shao
Abstract:
App agents, which autonomously operate mobile Apps through Graphical User Interfaces (GUIs), have gained significant interest in real-world applications. Yet, they often struggle with long-horizon planning, failing to find the optimal actions for complex tasks with longer steps. To address this, world models are used to predict the next GUI observation based on user actions, enabling more effective agent planning. However, existing world models primarily focus on generating only textual descriptions, lacking essential visual details. To fill this gap, we propose ViMo, the first visual world model designed to generate future App observations as images. For the challenge of generating text in image patches, where even minor pixel errors can distort readability, we decompose GUI generation into graphic and text content generation. We propose a novel data representation, the Symbolic Text Representation~(STR) to overlay text content with symbolic placeholders while preserving graphics. With this design, ViMo employs a STR Predictor to predict future GUIs' graphics and a GUI-text Predictor for generating the corresponding text. Moreover, we deploy ViMo to enhance agent-focused tasks by predicting the outcome of different action options. Experiments show ViMo's ability to generate visually plausible and functionally effective GUIs that enable App agents to make more informed decisions.
中文摘要:ViMo是一种创新的视觉世界模型,通过分离图形与文本生成未来应用界面图像,使应用代理能通过更优规划做出更明智决策。
English Summary: ViMo is a novel visual world model that generates future app interface images by separating graphics and text, enabling app agents to make better decisions through improved planning.
Authors:Deyu Cao, Samin Aref
Abstract:
The growing use of large language models has raised environmental and economic concerns about their intensity of resource usage during inference. Serving these models to each user requires substantial energy and water for cooling. Model compression techniques like quantization can shrink large language models and make them more resource efficient at the cost of potential performance degradation. Quantization methods compress model size through replacing their high-precision parameters by quantized values of lower precision. Among existing methods, the ApiQ method achieves superior accuracy preservation at minimal memory and time overhead. We investigate two ideas to extend performance in ultra-low-bit quantization beyond ApiQ's level. First, we look into combining existing quantization-aware training techniques with ApiQ's partial training. We show that this does not outperform the baseline ApiQ method with limited training data and frozen weights. This leads to two key insights: (1) The substantial representational capacity that is gained through full retraining is unlikely to be feasible through partial training. (2) This gain may depend on using a large and diverse dataset in quantization-aware training. Second, through a novel approach informed by the two insights, we propose an ultra-low-bit quantization method that builds upon ApiQ and extends its performance without the need for full retraining. This publicly available method relies on a saliency-aware regularization term that prioritizes preserving the most impactful parameters during quantization. Our experiments on LLaMA 7B and 13B benchmarks demonstrate that our method reduces the ApiQ's accuracy degradation by 10.85% and 7.54% respectively. A Python implementation of the proposed quantization method is publicly available on GitHub https://github.com/TokuyuSou/ULB-SAPR.
中文: 本文提出了一种新型超低位量化方法,通过显著性感知正则化在无需完全重新训练的情况下,将大型语言模型的资源效率提升的同时,相比ApiQ方法减少了超过7%的精度损失。
English: A novel ultra-low-bit quantization method with saliency-aware regularization is proposed to enhance resource efficiency of large language models while reducing accuracy degradation by over 7% compared to ApiQ, without requiring full retraining.
Authors:Benjamin Cohen-Wang, Yung-Sung Chuang, Aleksander Madry
Abstract:
Given a sequence of tokens generated by a language model, we may want to identify the preceding tokens that influence the model to generate this sequence. Performing such token attribution is expensive; a common approach is to ablate preceding tokens and directly measure their effects. To reduce the cost of token attribution, we revisit attention weights as a heuristic for how a language model uses previous tokens. Naive approaches to attribute model behavior with attention (e.g., averaging attention weights across attention heads to estimate a token's influence) have been found to be unreliable. To attain faithful attributions, we propose treating the attention weights of different attention heads as features. This way, we can learn how to effectively leverage attention weights for attribution (using signal from ablations). Our resulting method, Attribution with Attention (AT2), reliably performs on par with approaches that involve many ablations, while being significantly more efficient. To showcase the utility of AT2, we use it to prune less important parts of a provided context in a question answering setting, improving answer quality. We provide code for AT2 at https://github.com/MadryLab/AT2 .
中文摘要:提出的AT2方法通过将注意力权重作为特征进行学习,实现了语言模型中词元影响的高效归因,其性能与耗时的消融方法相当,同时显著提升了计算效率。
English Summary: The proposed AT2 method efficiently attributes token influence in language models by learning to use attention weights as features, achieving performance comparable to costly ablation-based approaches while significantly improving computational efficiency.
Authors:Paul K. Mandal, Cole Leo, Connor Hurley
Abstract:
Open-source intelligence provides a stream of unstructured textual data that can inform assessments of territorial control. We present CONTACT, a framework for territorial control prediction using large language models (LLMs) and minimal supervision. We evaluate two approaches: SetFit, an embedding-based few-shot classifier, and a prompt tuning method applied to BLOOMZ-560m, a multilingual generative LLM. Our model is trained on a small hand-labeled dataset of news articles covering ISIS activity in Syria and Iraq, using prompt-conditioned extraction of control-relevant signals such as military operations, casualties, and location references. We show that the BLOOMZ-based model outperforms the SetFit baseline, and that prompt-based supervision improves generalization in low-resource settings. CONTACT demonstrates that LLMs fine-tuned using few-shot methods can reduce annotation burdens and support structured inference from open-ended OSINT streams. Our code is available at https://github.com/PaulKMandal/CONTACT/.
中文摘要:CONTACT框架利用大型语言模型和最少监督从开源情报中预测领土控制情况,研究表明基于提示调优的BLOOMZ模型在低资源环境下优于小样本分类器,并能有效降低标注需求。
English Summary: The CONTACT framework utilizes large language models with minimal supervision to predict territorial control from open-source intelligence, demonstrating that prompt-tuned BLOOMZ outperforms few-shot classifiers in low-resource settings while reducing annotation needs.
Authors:Samuel Wertz, Arnaud Vandaele, Nicolas Gillis
Abstract:
The Hadamard decomposition is a powerful technique for data analysis and matrix compression, which decomposes a given matrix into the element-wise product of two or more low-rank matrices. In this paper, we develop an efficient algorithm to solve this problem, leveraging an alternating optimization approach that decomposes the global non-convex problem into a series of convex sub-problems. To improve performance, we explore advanced initialization strategies inspired by the singular value decomposition (SVD) and incorporate acceleration techniques by introducing momentum-based updates. Beyond optimizing the two-matrix case, we also extend the Hadamard decomposition framework to support more than two low-rank matrices, enabling approximations with higher effective ranks while preserving computational efficiency. Finally, we conduct extensive experiments to compare our method with the existing gradient descent-based approaches for the Hadamard decomposition and with traditional low-rank approximation techniques. The results highlight the effectiveness of our proposed method across diverse datasets.
Chinese: 本文提出了一种高效的交替优化算法用于哈达玛分解,通过SVD启发的初始化和动量加速技术将其扩展至多矩阵分解,实验证明该方法在不同数据集上优于现有技术。
English: This paper introduces an efficient alternating optimization algorithm for Hadamard decomposition, extending it to multiple matrices with SVD-inspired initialization and momentum acceleration, demonstrating superior performance over existing methods in experiments.
Authors:Yipeng Sun, Linda-Sophie Schneider, Mingxuan Gu, Siyuan Mei, Chengze Ye, Fabian Wagner, Siming Bayer, Andreas Maier
Abstract:
Effective denoising is crucial in low-dose CT to enhance subtle structures and low-contrast lesions while preventing diagnostic errors. Supervised methods struggle with limited paired datasets, and self-supervised approaches often require multiple noisy images and rely on deep networks like U-Net, offering little insight into the denoising mechanism. To address these challenges, we propose an interpretable self-supervised single-image denoising framework -- Filter2Noise (F2N). Our approach introduces an Attention-Guided Bilateral Filter that adapted to each noisy input through a lightweight module that predicts spatially varying filter parameters, which can be visualized and adjusted post-training for user-controlled denoising in specific regions of interest. To enable single-image training, we introduce a novel downsampling shuffle strategy with a new self-supervised loss function that extends the concept of Noise2Noise to a single image and addresses spatially correlated noise. On the Mayo Clinic 2016 low-dose CT dataset, F2N outperforms the leading self-supervised single-image method (ZS-N2N) by 4.59 dB PSNR while improving transparency, user control, and parametric efficiency. These features provide key advantages for medical applications that require precise and interpretable noise reduction. Our code is demonstrated at https://github.com/sypsyp97/Filter2Noise.git .
中文摘要:提出的Filter2Noise框架通过引入可调节参数的注意力引导双边滤波器,实现了可解释的单图像自监督CT去噪,在性能和透明度方面均优于现有方法。
English Summary: The proposed Filter2Noise framework introduces an interpretable self-supervised method for single-image CT denoising using an attention-guided bilateral filter with adjustable parameters, achieving superior performance and transparency compared to existing approaches.
Authors:Juliette Bertrand, Sophie Giffard-Roisin, James Hollingsworth, Julien Mairal
Abstract:
Dense ground displacement measurements are crucial for geological studies but are impractical to collect directly. Traditionally, displacement fields are estimated using patch matching on optical satellite images from different acquisition times. While deep learning-based optical flow models are promising, their adoption in ground deformation analysis is hindered by challenges such as the absence of real ground truth, the need for sub-pixel precision, and temporal variations due to geological or anthropogenic changes. In particular, we identify that deep learning models relying on explicit correlation layers struggle at estimating small displacements in real-world conditions. Instead, we propose a model that employs iterative refinements with explicit warping layers and a correlation-independent backbone, enabling sub-pixel precision. Additionally, a non-convex variant of Total Variation regularization preserves fault-line sharpness while maintaining smoothness elsewhere. Our model significantly outperforms widely used geophysics methods on semi-synthetic benchmarks and generalizes well to challenging real-world scenarios captured by both medium- and high-resolution sensors. Project page: https://jbertrand89.github.io/microflow/.
Authors:Saksham Rastogi, Pratyush Maini, Danish Pruthi
Abstract:
Given how large parts of publicly available text are crawled to pretrain large language models (LLMs), data creators increasingly worry about the inclusion of their proprietary data for model training without attribution or licensing. Their concerns are also shared by benchmark curators whose test-sets might be compromised. In this paper, we present STAMP, a framework for detecting dataset membership-i.e., determining the inclusion of a dataset in the pretraining corpora of LLMs. Given an original piece of content, our proposal involves first generating multiple rephrases, each embedding a watermark with a unique secret key. One version is to be released publicly, while others are to be kept private. Subsequently, creators can compare model likelihoods between public and private versions using paired statistical tests to prove membership. We show that our framework can successfully detect contamination across four benchmarks which appear only once in the training data and constitute less than 0.001% of the total tokens, outperforming several contamination detection and dataset inference baselines. We verify that STAMP preserves both the semantic meaning and utility of the original data. We apply STAMP to two real-world scenarios to confirm the inclusion of paper abstracts and blog articles in the pretraining corpora.
中文摘要:STAMP框架通过生成带独特水印的多个改写版本,并比较公开与私有版本间的模型似然度,使数据创建者能够检测其内容是否被用于大型语言模型的预训练数据中。
English Summary: The STAMP framework enables data creators to detect if their content was used in training large language models by watermarking multiple rephrased versions and comparing model likelihoods between public and private copies.
Authors:Jiang-Xin Shi, Tong Wei, Yu-Feng Li
Abstract:
The fine-tuning paradigm has emerged as a prominent approach for addressing long-tail learning tasks in the era of foundation models. However, the impact of fine-tuning strategies on long-tail learning performance remains unexplored. In this work, we disclose that existing paradigms exhibit a profound misuse of fine-tuning methods, leaving significant room for improvement in both efficiency and accuracy. Specifically, we reveal that heavy fine-tuning (fine-tuning a large proportion of model parameters) can lead to non-negligible performance deterioration on tail classes, whereas lightweight fine-tuning demonstrates superior effectiveness. Through comprehensive theoretical and empirical validation, we identify this phenomenon as stemming from inconsistent class conditional distributions induced by heavy fine-tuning. Building on this insight, we propose LIFT+, an innovative lightweight fine-tuning framework to optimize consistent class conditions. Furthermore, LIFT+ incorporates semantic-aware initialization, minimalist data augmentation, and test-time ensembling to enhance adaptation and generalization of foundation models. Our framework provides an efficient and accurate pipeline that facilitates fast convergence and model compactness. Extensive experiments demonstrate that LIFT+ significantly reduces both training epochs (from $\sim$100 to $\leq$15) and learned parameters (less than 1%), while surpassing state-of-the-art approaches by a considerable margin. The source code is available at https://github.com/shijxcs/LIFT-plus.
中文: 该研究揭示过度微调会损害尾部类别性能,并提出LIFT+轻量框架,通过大幅减少训练轮次和参数,显著提升模型效率与精度。
English: The study reveals that heavy fine-tuning impairs performance on tail classes, while proposing LIFT+, a lightweight framework that enhances efficiency and accuracy by reducing training epochs and parameters significantly.
Authors:Zichao Yue, Chenhui Deng, Zhiru Zhang
Abstract:
Graph neural networks (GNNs) are widely used for learning node embeddings in graphs, typically adopting a message-passing scheme. This approach, however, leads to the neighbor explosion problem, with exponentially growing computational and memory demands as layers increase. Graph sampling has become the predominant method for scaling GNNs to large graphs, mitigating but not fully solving the issue. Pre-propagation GNNs (PP-GNNs) represent a new class of models that decouple feature propagation from training through pre-processing, addressing neighbor explosion in theory. Yet, their practical advantages and system-level optimizations remain underexplored. This paper provides a comprehensive characterization of PP-GNNs, comparing them with graph-sampling-based methods in training efficiency, scalability, and accuracy. While PP-GNNs achieve comparable accuracy, we identify data loading as the key bottleneck for training efficiency and input expansion as a major scalability challenge. To address these issues, we propose optimized data loading schemes and tailored training methods that improve PP-GNN training throughput by an average of 15$\times$ over the PP-GNN baselines, with speedup of up to 2 orders of magnitude compared to sampling-based GNNs on large graph benchmarks. Our implementation is publicly available at https://github.com/cornell-zhang/preprop-gnn.
中文: 本文系统分析了预传播图神经网络的理论优势,发现数据加载和输入扩展是实际瓶颈,并提出优化方案使训练吞吐量比基线平均提升15倍,在大型图基准上比基于采样的方法快达两个数量级。
English: This paper characterizes pre-propagation GNNs (PP-GNNs) as theoretically addressing neighbor explosion but identifies data loading and input expansion as practical bottlenecks, proposing optimizations that achieve up to 15× throughput improvement over baselines and orders-of-magnitude speedup versus sampling-based GNNs.
Authors:Weijie Shi, Jipeng Zhang, Yaguang Wu, Jingzhi Fang, Ruiyuan Zhang, Jiajie Xu, Jia Zhu, Hao Chen, Yao Zhao, Sirui Han, Xiaofang Zhou
Abstract:
Large language models (LLMs) are commonly trained on multi-domain datasets, where domain sampling strategies significantly impact model performance due to varying domain importance across downstream tasks. Existing approaches for optimizing domain-level sampling strategies struggle with maintaining intra-domain consistency and accurately measuring domain impact. In this paper, we present Domain Impact-aware Data Sampling (DIDS). To ensure intra-domain consistency, a gradient clustering algorithm is proposed to group training data based on their learning effects, where a proxy language model and dimensionality reduction are employed to reduce computational overhead. To accurately measure domain impact, we develop a Fisher Information Matrix (FIM) guided metric that quantifies how domain-specific parameter updates affect the model's output distributions on downstream tasks, with theoretical guarantees. Furthermore, to determine optimal sampling ratios, DIDS combines both the FIM-guided domain impact assessment and loss learning trajectories that indicate domain-specific potential, while accounting for diminishing marginal returns. Extensive experiments demonstrate that DIDS achieves 3.4% higher average performance while maintaining comparable training efficiency. The code is available at https://github.com/shiweijiezero/DIDS.
Chinese: 本文提出的领域影响感知数据采样(DIDS)方法通过梯度聚类确保领域内一致性,并利用费舍尔信息矩阵度量领域影响,在实验中实现了3.4%的性能提升,同时保持训练效率。
English: The paper introduces Domain Impact-aware Data Sampling (DIDS), which optimizes domain sampling for large language models by using gradient clustering for intra-domain consistency and a Fisher Information Matrix metric to measure domain impact, achieving a 3.4% performance boost in experiments.
Authors:Jang Hyun Cho, Andrea Madotto, Effrosyni Mavroudi, Triantafyllos Afouras, Tushar Nagarajan, Muhammad Maaz, Yale Song, Tengyu Ma, Shuming Hu, Suyog Jain, Miguel Martin, Huiyu Wang, Hanoona Rasheed, Peize Sun, Po-Yao Huang, Daniel Bolya, Nikhila Ravi, Shashank Jain, Tammy Stark, Shane Moon, Babak Damavandi, Vivian Lee, Andrew Westbury, Salman Khan, Philipp Krähenbühl, Piotr Dollár, Lorenzo Torresani, Kristen Grauman, Christoph Feichtenhofer
Abstract:
Vision-language models are integral to computer vision research, yet many high-performing models remain closed-source, obscuring their data, design and training recipe. The research community has responded by using distillation from black-box models to label training data, achieving strong benchmark results, at the cost of measurable scientific progress. However, without knowing the details of the teacher model and its data sources, scientific progress remains difficult to measure. In this paper, we study building a Perception Language Model (PLM) in a fully open and reproducible framework for transparent research in image and video understanding. We analyze standard training pipelines without distillation from proprietary models and explore large-scale synthetic data to identify critical data gaps, particularly in detailed video understanding. To bridge these gaps, we release 2.8M human-labeled instances of fine-grained video question-answer pairs and spatio-temporally grounded video captions. Additionally, we introduce PLM-VideoBench, a suite for evaluating challenging video understanding tasks focusing on the ability to reason about "what", "where", "when", and "how" of a video. We make our work fully reproducible by providing data, training recipes, code & models. https://github.com/facebookresearch/perception_models
中文: 本文提出完全开放可复现的感知语言模型框架,避免使用专有模型蒸馏,发布280万人工标注视频数据以弥补关键数据缺口,并建立PLM-VideoBench评估体系来推动透明化的视频理解研究。
English: This paper introduces a fully open and reproducible Perception Language Model framework that avoids proprietary model distillation, releases 2.8M human-labeled video annotations to address data gaps, and establishes PLM-VideoBench for comprehensive video understanding evaluation.
Authors:Evan Casey, Tianyu Zhang, Shu Ishida, William P. McCarthy, John Roger Thompson, Amir Khasahmadi, Joseph George Lambourne, Pradeep Kumar Jayaraman, Karl D. D. Willis
Abstract:
We adapt alignment techniques from reasoning LLMs to the task of generating engineering sketch constraints found in computer-aided design (CAD) models. Engineering sketches consist of geometric primitives (e.g. points, lines) connected by constraints (e.g. perpendicular, tangent) that define the relationships between them. For a design to be easily editable, the constraints must effectively capture design intent, ensuring the geometry updates predictably when parameters change. Although current approaches can generate CAD designs, an open challenge remains to align model outputs with design intent, we label this problem 'design alignment'. A critical first step towards aligning generative CAD models is to generate constraints which fully-constrain all geometric primitives, without over-constraining or distorting sketch geometry. Using alignment techniques to train an existing constraint generation model with feedback from a constraint solver, we are able to fully-constrain 93% of sketches compared to 34% when using a naive supervised fine-tuning (SFT) baseline and only 8.9% without SFT. Our approach can be applied to any existing constraint generation model and sets the stage for further research bridging alignment strategies between the language and design domains. Additional results can be found at https://autodeskailab.github.io/aligning-constraint-generation/.
Authors:Jialuo Li, Wenhao Chai, Xingyu Fu, Haiyang Xu, Saining Xie
Abstract:
We present a novel approach to integrating scientific knowledge into generative models, enhancing their realism and consistency in image synthesis. First, we introduce Science-T2I, an expert-annotated adversarial dataset comprising adversarial 20k image pairs with 9k prompts, covering wide distinct scientific knowledge categories. Leveraging Science-T2I, we present SciScore, an end-to-end reward model that refines the assessment of generated images based on scientific knowledge, which is achieved by augmenting both the scientific comprehension and visual capabilities of pre-trained CLIP model. Additionally, based on SciScore, we propose a two-stage training framework, comprising a supervised fine-tuning phase and a masked online fine-tuning phase, to incorporate scientific knowledge into existing generative models. Through comprehensive experiments, we demonstrate the effectiveness of our framework in establishing new standards for evaluating the scientific realism of generated content. Specifically, SciScore attains performance comparable to human-level, demonstrating a 5% improvement similar to evaluations conducted by experienced human evaluators. Furthermore, by applying our proposed fine-tuning method to FLUX, we achieve a performance enhancement exceeding 50% on SciScore.
Authors:Haojian Huang, Haodong Chen, Shengqiong Wu, Meng Luo, Jinlan Fu, Xinya Du, Hanwang Zhang, Hao Fei
Abstract:
Large Video Models (LVMs) built upon Large Language Models (LLMs) have shown promise in video understanding but often suffer from misalignment with human intuition and video hallucination issues. To address these challenges, we introduce VistaDPO, a novel framework for Video Hierarchical Spatial-Temporal Direct Preference Optimization. VistaDPO enhances text-video preference alignment across three hierarchical levels: i) Instance Level, aligning overall video content with responses; ii) Temporal Level, aligning video temporal semantics with event descriptions; and iii) Perceptive Level, aligning spatial objects with language tokens. Given the lack of datasets for fine-grained video-language preference alignment, we construct VistaDPO-7k, a dataset of 7.2K QA pairs annotated with chosen and rejected responses, along with spatial-temporal grounding information such as timestamps, keyframes, and bounding boxes. Extensive experiments on benchmarks such as Video Hallucination, Video QA, and Captioning performance tasks demonstrate that VistaDPO significantly improves the performance of existing LVMs, effectively mitigating video-language misalignment and hallucination. The code and data are available at https://github.com/HaroldChen19/VistaDPO.
中文摘要:VistaDPO是一种新颖的视频层次时空直接偏好优化框架,通过实例、时序和感知三个层级优化文本-视频偏好对齐,并构建了包含7200对问答数据的新数据集,有效解决了大型视频模型中的视频-语言错位和幻觉问题,显著提升了多项基准测试性能。
English Summary: VistaDPO is a novel framework that addresses video-language misalignment and hallucination in Large Video Models by optimizing text-video preference alignment across instance, temporal, and perceptive levels, using a newly constructed 7.2K QA dataset to significantly enhance model performance on multiple benchmarks.
Authors:Kumar Manas, Christian Schlauch, Adrian Paschke, Christian Wirth, Nadja Klein
Abstract:
Deep learning-based trajectory prediction models have demonstrated promising capabilities in capturing complex interactions. However, their out-of-distribution generalization remains a significant challenge, particularly due to unbalanced data and a lack of enough data and diversity to ensure robustness and calibration. To address this, we propose SHIFT (Spectral Heteroscedastic Informed Forecasting for Trajectories), a novel framework that uniquely combines well-calibrated uncertainty modeling with informative priors derived through automated rule extraction. SHIFT reformulates trajectory prediction as a classification task and employs heteroscedastic spectral-normalized Gaussian processes to effectively disentangle epistemic and aleatoric uncertainties. We learn informative priors from training labels, which are automatically generated from natural language driving rules, such as stop rules and drivability constraints, using a retrieval-augmented generation framework powered by a large language model. Extensive evaluations over the nuScenes dataset, including challenging low-data and cross-location scenarios, demonstrate that SHIFT outperforms state-of-the-art methods, achieving substantial gains in uncertainty calibration and displacement metrics. In particular, our model excels in complex scenarios, such as intersections, where uncertainty is inherently higher. Project page: https://kumarmanas.github.io/SHIFT/.
Authors:Ruizhe Chen, Dongyu Xue, Xiangxin Zhou, Zaixiang Zheng, Xiangxiang Zeng, Quanquan Gu
Abstract:
Proteins typically exist in complexes, interacting with other proteins or biomolecules to perform their specific biological roles. Research on single-chain protein modeling has been extensively and deeply explored, with advancements seen in models like the series of ESM and AlphaFold2. Despite these developments, the study and modeling of multi-chain proteins remain largely uncharted, though they are vital for understanding biological functions. Recognizing the importance of these interactions, we introduce APM (All-Atom Protein Generative Model), a model specifically designed for modeling multi-chain proteins. By integrating atom-level information and leveraging data on multi-chain proteins, APM is capable of precisely modeling inter-chain interactions and designing protein complexes with binding capabilities from scratch. It also performs folding and inverse-folding tasks for multi-chain proteins. Moreover, APM demonstrates versatility in downstream applications: it achieves enhanced performance through supervised fine-tuning (SFT) while also supporting zero-shot sampling in certain tasks, achieving state-of-the-art results. We released our code at https://github.com/bytedance/apm.
中文: 我们推出了APM模型,专为多链蛋白质建模设计,通过整合原子级数据精确模拟相互作用并设计功能性蛋白质复合物,在多项任务中取得了领先成果。
English: We introduce APM, a model designed for multi-chain protein modeling that integrates atom-level data to accurately model interactions and design functional protein complexes, achieving state-of-the-art results in various tasks.
Authors:Linkang Du, Zheng Zhu, Min Chen, Zhou Su, Shouling Ji, Peng Cheng, Jiming Chen, Zhikun Zhang
Abstract:
Text-to-image models based on diffusion processes, such as DALL-E, Stable Diffusion, and Midjourney, are capable of transforming texts into detailed images and have widespread applications in art and design. As such, amateur users can easily imitate professional-level paintings by collecting an artist's work and fine-tuning the model, leading to concerns about artworks' copyright infringement. To tackle these issues, previous studies either add visually imperceptible perturbation to the artwork to change its underlying styles (perturbation-based methods) or embed post-training detectable watermarks in the artwork (watermark-based methods). However, when the artwork or the model has been published online, i.e., modification to the original artwork or model retraining is not feasible, these strategies might not be viable.
To this end, we propose a novel method for data-use auditing in the text-to-image generation model. The general idea of ArtistAuditor is to identify if a suspicious model has been finetuned using the artworks of specific artists by analyzing the features related to the style. Concretely, ArtistAuditor employs a style extractor to obtain the multi-granularity style representations and treats artworks as samplings of an artist's style. Then, ArtistAuditor queries a trained discriminator to gain the auditing decisions. The experimental results on six combinations of models and datasets show that ArtistAuditor can achieve high AUC values (> 0.937). By studying ArtistAuditor's transferability and core modules, we provide valuable insights into the practical implementation. Finally, we demonstrate the effectiveness of ArtistAuditor in real-world cases by an online platform Scenario. ArtistAuditor is open-sourced at https://github.com/Jozenn/ArtistAuditor.
中文:ArtistAuditor是一种新颖的数据使用审计方法,通过分析多粒度风格表征来识别文本到图像模型是否使用了特定艺术家的作品进行微调,在实验和实际应用中均展现出高效能。
English: ArtistAuditor is a novel data-use auditing method that identifies if a text-to-image model has been fine-tuned using specific artists' works by analyzing multi-granularity style representations, achieving high effectiveness in experiments and real-world applications.
Authors:Yaoyao Ding, Bohan Hou, Xiao Zhang, Allan Lin, Tianqi Chen, Cody Yu Hao, Yida Wang, Gennady Pekhimenko
Abstract:
Serving Large Language Models (LLMs) is critical for AI-powered applications, yet it demands substantial computational resources, particularly in memory bandwidth and computational throughput. Low-precision computation has emerged as a key technique to improve efficiency while reducing resource consumption. Existing approaches for generating low-precision kernels are limited to weight bit widths that are powers of two and suffer from suboptimal performance because of high-level GPU programming abstractions. These abstractions restrict critical optimizations, such as fine-grained register management and optimized memory access patterns, that are essential for efficient low-precision computations. In this paper, we introduce Tilus, a domain-specific language designed for General-Purpose GPU (GPGPU) computing that supports low-precision data types with arbitrary bit widths from 1 to 8 while maintaining GPU programmability. Tilus features a thread-block-level programming model, a hierarchical memory space, a novel algebraic layout system, and extensive support for diverse low-precision data types. Tilus programs are compiled into highly efficient GPU programs through automatic vectorization and instruction selection. Extensive experiments demonstrate that Tilus efficiently supports a full spectrum of low-precision data types, and outperforms state-of-the-art low-precision kernels. Compared to existing compilers such as Triton and Ladder, as well as hand-optimized kernels such as QuantLLM and Marlin, Tilus achieves performance improvements of: $1.75\times$, $2.61\times$, $1.29\times$ and $1.03\times$, respectively. We open-source Tilus at https://github.com/NVIDIA/tilus.
中文:Tilus是一种面向通用GPU计算的领域特定语言,支持1至8位任意位宽的低精度计算,通过优化GPU编程显著超越了现有解决方案的性能。
English: Tilus is a domain-specific language for GPGPU computing that enables efficient low-precision computation with arbitrary bit widths from 1 to 8, outperforming existing solutions through optimized GPU programming.
Authors:Shiwen Qin, Gabriela Kadlecová, Martin Pilát, Shay B. Cohen, Roman Neruda, Elliot J. Crowley, Jovita Lukasik, Linus Ericsson
Abstract:
Neural architecture search (NAS) faces a challenge in balancing the exploration of expressive, broad search spaces that enable architectural innovation with the need for efficient evaluation of architectures to effectively search such spaces. We investigate surrogate model training for improving search in highly expressive NAS search spaces based on context-free grammars. We show that i) surrogate models trained either using zero-cost-proxy metrics and neural graph features (GRAF) or by fine-tuning an off-the-shelf LM have high predictive power for the performance of architectures both within and across datasets, ii) these surrogates can be used to filter out bad architectures when searching on novel datasets, thereby significantly speeding up search and achieving better final performances, and iii) the surrogates can be further used directly as the search objective for huge speed-ups.
Authors:Robin Hesse, Jonas Fischer, Simone Schaub-Meyer, Stefan Roth
Abstract:
Mechanistic interpretability is concerned with analyzing individual components in a (convolutional) neural network (CNN) and how they form larger circuits representing decision mechanisms. These investigations are challenging since CNNs frequently learn polysemantic channels that encode distinct concepts, making them hard to interpret. To address this, we propose an algorithm to disentangle a specific kind of polysemantic channel into multiple channels, each responding to a single concept. Our approach restructures weights in a CNN, utilizing that different concepts within the same channel exhibit distinct activation patterns in the previous layer. By disentangling these polysemantic features, we enhance the interpretability of CNNs, ultimately improving explanatory techniques such as feature visualizations.
Chinese: 本文提出一种算法,将卷积神经网络中的多义通道分解为多个单一概念通道,通过基于不同激活模式重构权重来提高模型可解释性。
English: This paper introduces an algorithm to disentangle polysemantic channels in CNNs into distinct concept-specific channels, enhancing interpretability by restructuring weights based on unique activation patterns.
Authors:Leyang Li, Shilin Lu, Yan Ren, Adams Wai-Kin Kong
Abstract:
Ensuring the ethical deployment of text-to-image models requires effective techniques to prevent the generation of harmful or inappropriate content. While concept erasure methods offer a promising solution, existing finetuning-based approaches suffer from notable limitations. Anchor-free methods risk disrupting sampling trajectories, leading to visual artifacts, while anchor-based methods rely on the heuristic selection of anchor concepts. To overcome these shortcomings, we introduce a finetuning framework, dubbed ANT, which Automatically guides deNoising Trajectories to avoid unwanted concepts. ANT is built on a key insight: reversing the condition direction of classifier-free guidance during mid-to-late denoising stages enables precise content modification without sacrificing early-stage structural integrity. This inspires a trajectory-aware objective that preserves the integrity of the early-stage score function field, which steers samples toward the natural image manifold, without relying on heuristic anchor concept selection. For single-concept erasure, we propose an augmentation-enhanced weight saliency map to precisely identify the critical parameters that most significantly contribute to the unwanted concept, enabling more thorough and efficient erasure. For multi-concept erasure, our objective function offers a versatile plug-and-play solution that significantly boosts performance. Extensive experiments demonstrate that ANT achieves state-of-the-art results in both single and multi-concept erasure, delivering high-quality, safe outputs without compromising the generative fidelity. Code is available at https://github.com/lileyang1210/ANT
Chinese Summary: 本文提出ANT微调框架,通过自动引导去噪轨迹避免生成有害内容,在单概念与多概念消除任务中均达到最优性能,且不损害图像生成质量。
English Summary: The paper introduces ANT, a finetuning framework that automatically guides denoising trajectories to prevent harmful content generation in text-to-image models, achieving state-of-the-art performance in both single and multi-concept erasure without compromising image quality.
Authors:Hao Xu, Xiangru Jian, Xinjian Zhao, Wei Pang, Chao Zhang, Suyuchen Wang, Qixin Zhang, Zhengyuan Dong, Joao Monteiro, Bang Liu, Qiuzhuang Sun, Tianshu Yu
Abstract:
This paper introduces GraphOmni, a comprehensive benchmark designed to evaluate the reasoning capabilities of LLMs on graph-theoretic tasks articulated in natural language. GraphOmni encompasses diverse graph types, serialization formats, and prompting schemes, significantly exceeding prior efforts in both scope and depth. Through extensive systematic evaluation, we identify critical interactions among these dimensions, demonstrating their substantial impact on model performance. Our experiments reveal that state-of-the-art models like Claude-3.5 and o4-mini consistently outperform other models, yet even these leading models exhibit substantial room for improvement. Performance variability is evident depending on the specific combinations of factors we considered, underscoring the necessity of comprehensive evaluations across these interconnected dimensions. Additionally, we observe distinct impacts of serialization and prompting strategies between open-source and closed-source models, encouraging the development of tailored approaches. Motivated by the findings, we also propose a reinforcement learning-inspired framework that adaptively selects the optimal factors influencing LLM reasoning capabilities. This flexible and extendable benchmark not only deepens our understanding of LLM performance on structured tasks but also provides a robust foundation for advancing research in LLM-based graph reasoning. The code and datasets are available at https://github.com/GAI-Community/GraphOmni.
Chinese: GraphOmni作为评估大语言模型图推理能力的综合基准,揭示了影响性能的关键因素,并提出自适应框架以提升模型表现。
English: GraphOmni is a comprehensive benchmark for evaluating LLMs' graph reasoning abilities, revealing key performance factors and proposing an adaptive framework to enhance model capabilities.
Authors:Shin'ya Yamaguchi, Dewei Feng, Sekitoshi Kanai, Kazuki Adachi, Daiki Chijiwa
Abstract:
Contrastive language image pre-training (CLIP) is an essential component of building modern vision-language foundation models. While CLIP demonstrates remarkable zero-shot performance on downstream tasks, the multi-modal feature spaces still suffer from a modality gap, which is a gap between image and text feature clusters and limits downstream task performance. Although existing works attempt to address the modality gap by modifying pre-training or fine-tuning, they struggle with heavy training costs with large datasets or degradations of zero-shot performance. This paper presents CLIP-Refine, a post-pre-training method for CLIP models at a phase between pre-training and fine-tuning. CLIP-Refine aims to align the feature space with 1 epoch training on small image-text datasets without zero-shot performance degradations. To this end, we introduce two techniques: random feature alignment (RaFA) and hybrid contrastive-distillation (HyCD). RaFA aligns the image and text features to follow a shared prior distribution by minimizing the distance to random reference vectors sampled from the prior. HyCD updates the model with hybrid soft labels generated by combining ground-truth image-text pair labels and outputs from the pre-trained CLIP model. This contributes to achieving both maintaining the past knowledge and learning new knowledge to align features. Our extensive experiments with multiple classification and retrieval tasks show that CLIP-Refine succeeds in mitigating the modality gap and improving the zero-shot performance.
中文: CLIP-Refine是一种后预训练方法,通过随机特征对齐和混合对比蒸馏技术,有效缩小CLIP模型中的模态差距,在提升零样本性能的同时避免性能下降。
English: CLIP-Refine is a post-pre-training method that reduces the modality gap in CLIP models by aligning image and text features through random feature alignment and hybrid contrastive-distillation, improving zero-shot performance without degradation.
Authors:Pengtao Dang, Tingbo Guo, Melissa Fishel, Guang Lin, Wenzhuo Wu, Sha Cao, Chi Zhang
Abstract:
A physics-informed neural network (PINN) models the dynamics of a system by integrating the governing physical laws into the architecture of a neural network. By enforcing physical laws as constraints, PINN overcomes challenges with data scarsity and potentially high dimensionality. Existing PINN frameworks rely on fully observed time-course data, the acquisition of which could be prohibitive for many systems. In this study, we developed a new PINN learning paradigm, namely Constrained Learning, that enables the approximation of first-order derivatives or motions using non-time course or partially observed data. Computational principles and a general mathematical formulation of Constrained Learning were developed. We further introduced MPOCtrL (Message Passing Optimization-based Constrained Learning) an optimization approach tailored for the Constrained Learning framework that strives to balance the fitting of physical models and observed data. Its code is available at github link: https://github.com/ptdang1001/MPOCtrL Experiments on synthetic and real-world data demonstrated that MPOCtrL can effectively detect the nonlinear dependency between observed data and the underlying physical properties of the system. In particular, on the task of metabolic flux analysis, MPOCtrL outperforms all existing data-driven flux estimators.
中文: 本研究提出了约束学习这一新的物理信息神经网络范式,能够利用非时间序列数据近似系统动力学,并开发了MPOCtrL优化方法,在保持物理模型与观测数据平衡方面表现优异,在代谢通量分析等任务中超越了现有方法。
English: This study introduces Constrained Learning, a new physics-informed neural network paradigm that uses non-time course data to approximate system dynamics, along with an optimization method called MPOCtrL that effectively balances physical models with observed data, outperforming existing approaches in tasks like metabolic flux analysis.
Authors:Kewen Peng, Hao Zhuo, Yicheng Yang, Tim Menzies
Abstract:
Discrimination-aware classification aims to make accurate predictions while satisfying fairness constraints. Traditional decision tree learners typically optimize for information gain in the target attribute alone, which can result in models that unfairly discriminate against protected social groups (e.g., gender, ethnicity). Motivated by these shortcomings, we propose GroupCART, a tree-based ensemble optimizer that avoids bias during model construction by optimizing not only for decreased entropy in the target attribute but also for increased entropy in protected attributes. Our experiments show that GroupCART achieves fairer models without data transformation and with minimal performance degradation. Furthermore, the method supports customizable weighting, offering a smooth and flexible trade-off between predictive performance and fairness based on user requirements. These results demonstrate that algorithmic bias in decision tree models can be mitigated through multi-task, fairness-aware learning. All code and datasets used in this study are available at: https://github.com/anonymous12138/groupCART.
中文:GroupCART是一种基于树集成的公平性优化方法,通过在模型构建中同时优化目标属性和保护属性的熵,无需数据转换即可实现公平分类,且支持根据需求灵活调整预测性能与公平性的平衡。
English: GroupCART is a fairness-aware ensemble method that optimizes for both target prediction accuracy and fairness by balancing entropy in target and protected attributes, achieving equitable models with minimal performance loss and customizable trade-offs.
Authors:Kaira M. Samuel, Faez Ahmed
Abstract:
Engineering problems that apply machine learning often involve computationally intensive methods but rely on limited datasets. As engineering data evolves with new designs and constraints, models must incorporate new knowledge over time. However, high computational costs make retraining models from scratch infeasible. Continual learning (CL) offers a promising solution by enabling models to learn from sequential data while mitigating catastrophic forgetting, where a model forgets previously learned mappings. This work introduces CL to engineering design by benchmarking several CL methods on representative regression tasks. We apply these strategies to five engineering datasets and construct nine new engineering CL benchmarks to evaluate their ability to address forgetting and improve generalization. Preliminary results show that applying existing CL methods to these tasks improves performance over naive baselines. In particular, the Replay strategy achieved performance comparable to retraining in several benchmarks while reducing training time by nearly half, demonstrating its potential for real-world engineering workflows. The code and datasets used in this work will be available at: https://github.com/kmsamuel/cl-for-engineering-release.
中文: 本研究将持续学习引入工程设计,通过在回归任务上对多种方法进行基准测试,结果表明Replay策略在减少近一半训练时间的同时,实现了与完全重新训练相当的性能,相关代码和数据集已公开。
English: This study introduces continual learning to engineering design by benchmarking various methods on regression tasks, showing that the Replay strategy achieves performance close to full retraining while cutting training time by nearly half, with code and datasets made publicly available.
Authors:Ashwinee Panda, Vatsal Baherwani, Zain Sarwar, Benjamin Therien, Supriyo Chakraborty, Tom Goldstein
Abstract:
Mixture of Experts (MoE) pretraining is more scalable than dense Transformer pretraining, because MoEs learn to route inputs to a sparse set of their feedforward parameters. However, this means that MoEs only receive a sparse backward update, leading to training instability and suboptimal performance. We present a lightweight approximation method that gives the MoE router a dense gradient update while continuing to sparsely activate its parameters. Our method, which we refer to as Default MoE, substitutes missing expert activations with default outputs consisting of an exponential moving average of expert outputs previously seen over the course of training. This allows the router to receive signals from every expert for each token, leading to significant improvements in training performance. Our Default MoE outperforms standard TopK routing in a variety of settings without requiring significant computational overhead. Code: https://github.com/vatsal0/default-moe.
中文: 默认混合专家模型通过使用默认输出为路由器提供密集梯度更新,从而显著提升训练稳定性和性能,且无需额外计算开销。
English: Default MoE enhances training stability and performance by providing dense gradient updates to the router through default outputs, without adding significant computational cost.
Authors:Kristjan Greenewald, Luis Lastras, Thomas Parnell, Vraj Shah, Lucian Popa, Giulio Zizzo, Chulaka Gunasekara, Ambrish Rawat, David Cox
Abstract:
Low-Rank Adaptation (LoRA) has emerged as a highly efficient framework for finetuning the weights of large foundation models, and has become the go-to method for data-driven customization of LLMs. Despite the promise of highly customized behaviors and capabilities, switching between relevant LoRAs in a multiturn setting is inefficient, as the key-value (KV) cache of the entire turn history must be recomputed with the LoRA weights before generation can begin. To address this problem, we propose Activated LoRA (aLoRA), an adapter architecture which modifies the LoRA framework to only adapt weights for the tokens in the sequence \emph{after} the aLoRA is invoked. This change crucially allows aLoRA to accept the base model's KV cache of the input string, meaning that aLoRA can be instantly activated whenever needed in a chain without recomputing the cache. This enables building what we call \emph{intrinsics}, i.e. specialized models invoked to perform well-defined operations on portions of an input chain or conversation that otherwise uses the base model by default. We train a set of aLoRA-based intrinsics models, demonstrating competitive accuracy with standard LoRA while achieving significant inference benefits. The codebase is at https://github.com/IBM/activated-lora.
中文: 激活式低秩适应(aLoRA)改进了LoRA框架,仅对调用后的序列令牌进行权重调整,无需重新计算先前的键值缓存即可即时激活,从而显著提升了专用模型操作的推理效率。
English: Activated LoRA (aLoRA) enhances the LoRA framework by adapting weights only for tokens after its invocation, enabling instant activation without recomputing the prior KV cache and improving inference efficiency for specialized model operations.
Authors:Xin Gao, Qizhi Pei, Zinan Tang, Yu Li, Honglin Lin, Jiang Wu, Lijun Wu, Conghui He
Abstract:
While data synthesis and distillation are promising strategies to enhance small language models, current approaches heavily rely on Large Language Models (LLMs), which suffer from high computational costs, environmental inefficiency, and potential biases inherited from monolithic architectures. In contrast, smaller LLMs are more accessible and sustainable, but their individual capabilities often fall short in generating high-quality, diverse, and reliable data. Inspired by collaborative human processes (e.g., peer review), we propose a multiple small LLMs involved framework, GRA, that aggregates specialized roles across small LLMs to iterative refinement and quality control typically achieved by a single large LLM. In this collaborative framework, multiple small LLMs assume distinct roles-Generator, Reviewer, and Adjudicator-to simulate a peer-review-inspired data synthesis pipeline. The Generator proposes initial data samples, the Reviewer critiques their quality and diversity, and the Adjudicator resolves conflicts to finalize the output. By decomposing the synthesis process into specialized sub-tasks, collaborative small LLMs can achieve data-level parity with large LLM-based distillation. Through experiments across multiple benchmarks, we demonstrate that GRA-produced data matches or exceeds the quality of single large LLM outputs, e.g., Qwen-2.5-72B-Instruct. Our results challenge the necessity of monolithic large models for high-quality data synthesis, advocating instead for strategic coordination of smaller agents. Our datasets, models, and code are publicly available at https://github.com/GX-XinGao/GRA.
中文摘要:GRA框架通过模拟同行评审流程,使多个专业化的小型语言模型协作生成高质量数据,在保持高效可持续的同时,实现了与大型模型相当的性能表现。
English Summary: The GRA framework enables multiple specialized small language models to collaboratively generate high-quality data through a peer-review-inspired process, achieving performance comparable to large models while being more efficient and sustainable.
Authors:Liam Schoneveld, Zhe Chen, Davide Davoli, Jiapeng Tang, Saimon Terazawa, Ko Nishino, Matthias NieÃner
Abstract:
Accurate, real-time 3D reconstruction of human heads from monocular images and videos underlies numerous visual applications. As 3D ground truth data is hard to come by at scale, previous methods have sought to learn from abundant 2D videos in a self-supervised manner. Typically, this involves the use of differentiable mesh rendering, which is effective but faces limitations. To improve on this, we propose SHeaP (Self-supervised Head Geometry Predictor Learned via 2D Gaussians). Given a source image, we predict a 3DMM mesh and a set of Gaussians that are rigged to this mesh. We then reanimate this rigged head avatar to match a target frame, and backpropagate photometric losses to both the 3DMM and Gaussian prediction networks. We find that using Gaussians for rendering substantially improves the effectiveness of this self-supervised approach. Training solely on 2D data, our method surpasses existing self-supervised approaches in geometric evaluations on the NoW benchmark for neutral faces and a new benchmark for non-neutral expressions. Our method also produces highly expressive meshes, outperforming state-of-the-art in emotion classification.
Authors:Aditya Prakash, Benjamin Lundell, Dmitry Andreychuk, David Forsyth, Saurabh Gupta, Harpreet Sawhney
Abstract:
We tackle the novel problem of predicting 3D hand motion and contact maps (or Interaction Trajectories) given a single RGB view, action text, and a 3D contact point on the object as input. Our approach consists of (1) Interaction Codebook: a VQVAE model to learn a latent codebook of hand poses and contact points, effectively tokenizing interaction trajectories, (2) Interaction Predictor: a transformer-decoder module to predict the interaction trajectory from test time inputs by using an indexer module to retrieve a latent affordance from the learned codebook. To train our model, we develop a data engine that extracts 3D hand poses and contact trajectories from the diverse HoloAssist dataset. We evaluate our model on a benchmark that is 2.5-10X larger than existing works, in terms of diversity of objects and interactions observed, and test for generalization of the model across object categories, action categories, tasks, and scenes. Experimental results show the effectiveness of our approach over transformer & diffusion baselines across all settings.
Authors:Andreas Plesner, Turlan Kuzhagaliyev, Roger Wattenhofer
Abstract:
Over the past years, advances in artificial intelligence (AI) have demonstrated how AI can solve many perception and generation tasks, such as image classification and text writing, yet reasoning remains a challenge. This paper introduces the FLIP dataset, a benchmark for evaluating AI reasoning capabilities based on human verification tasks on the Idena blockchain. FLIP challenges present users with two orderings of 4 images, requiring them to identify the logically coherent one. By emphasizing sequential reasoning, visual storytelling, and common sense, FLIP provides a unique testbed for multimodal AI systems. Our experiments evaluate state-of-the-art models, leveraging both vision-language models (VLMs) and large language models (LLMs). Results reveal that even the best open-sourced and closed-sourced models achieve maximum accuracies of 75.5% and 77.9%, respectively, in zero-shot settings, compared to human performance of 95.3%. Captioning models aid reasoning models by providing text descriptions of images, yielding better results than when using the raw images directly, 69.6% vs. 75.2% for Gemini 1.5 Pro. Combining the predictions from 15 models in an ensemble increases the accuracy to 85.2%. These findings highlight the limitations of existing reasoning models and the need for robust multimodal benchmarks like FLIP. The full codebase and dataset will be available at https://github.com/aplesner/FLIP-Reasoning-Challenge.
中文: FLIP数据集作为评估AI推理能力的基准,通过视觉序列任务揭示现有模型与人类表现差距显著,而图像描述和集成方法能有效提升其准确性。
English: The FLIP dataset is introduced as a benchmark to evaluate AI reasoning through visual sequence tasks, revealing that current models significantly lag behind human performance and benefit from captioning and ensemble methods.
Authors:Siyan Zhao, Devaansh Gupta, Qinqing Zheng, Aditya Grover
Abstract:
Recent large language models (LLMs) have demonstrated strong reasoning capabilities that benefits from online reinforcement learning (RL). These capabilities have primarily been demonstrated within the left-to-right autoregressive (AR) generation paradigm. In contrast, non-autoregressive paradigms based on diffusion generate text in a coarse-to-fine manner. Although recent diffusion-based large language models (dLLMs) have achieved competitive language modeling performance compared to their AR counterparts, it remains unclear if dLLMs can also leverage recent advances in LLM reasoning. To this end, we propose d1, a framework to adapt pre-trained masked dLLMs into reasoning models via a combination of supervised finetuning (SFT) and RL. Specifically, we develop and extend techniques to improve reasoning in pretrained dLLMs: (a) we utilize a masked SFT technique to distill knowledge and instill self-improvement behavior directly from existing datasets, and (b) we introduce a novel critic-free, policy-gradient based RL algorithm called diffu-GRPO, the first integration of policy gradient methods to masked dLLMs. Through empirical studies, we investigate the performance of different post-training recipes on multiple mathematical and planning benchmarks. We find that d1 yields the best performance and significantly improves performance of a state-of-the-art dLLM. Our code is released at https://dllm-reasoning.github.io/.
Authors:Alejandro Newell, Peiyun Hu, Lahav Lipson, Stephan R. Richter, Vladlen Koltun
Abstract:
We introduce an approach for detecting and tracking detailed 3D poses of multiple people from a single monocular camera stream. Our system maintains temporally coherent predictions in crowded scenes filled with difficult poses and occlusions. Our model performs both strong per-frame detection and a learned pose update to track people from frame to frame. Rather than match detections across time, poses are updated directly from a new input image, which enables online tracking through occlusion. We train on numerous image and video datasets leveraging pseudo-labeled annotations to produce a model that matches state-of-the-art systems in 3D pose estimation accuracy while being faster and more accurate in tracking multiple people through time. Code and weights are provided at https://github.com/apple/ml-comotion
中文: 本文提出一种单目三维多人姿态跟踪系统,通过从输入图像直接更新姿态实现拥挤场景中的时序一致性,在保持顶尖精度的同时实现了更快、更稳定的在线跟踪效果。
English: This paper presents a monocular 3D multi-person pose tracking system that achieves temporal coherence in crowded scenes through direct pose updates from input images, matching state-of-the-art accuracy while enabling faster and more robust online tracking.
Authors:Yuan Luo, Rudolf Hoffmann, Yan Xia, Olaf Wysocki, Benedikt Schwab, Thomas H. Kolbe, Daniel Cremers
Abstract:
Semantic 3D city models are worldwide easy-accessible, providing accurate, object-oriented, and semantic-rich 3D priors. To date, their potential to mitigate the noise impact on radar object detection remains under-explored. In this paper, we first introduce a unique dataset, RadarCity, comprising 54K synchronized radar-image pairs and semantic 3D city models. Moreover, we propose a novel neural network, RADLER, leveraging the effectiveness of contrastive self-supervised learning (SSL) and semantic 3D city models to enhance radar object detection of pedestrians, cyclists, and cars. Specifically, we first obtain the robust radar features via a SSL network in the radar-image pretext task. We then use a simple yet effective feature fusion strategy to incorporate semantic-depth features from semantic 3D city models. Having prior 3D information as guidance, RADLER obtains more fine-grained details to enhance radar object detection. We extensively evaluate RADLER on the collected RadarCity dataset and demonstrate average improvements of 5.46% in mean avarage precision (mAP) and 3.51% in mean avarage recall (mAR) over previous radar object detection methods. We believe this work will foster further research on semantic-guided and map-supported radar object detection. Our project page is publicly available athttps://gpp-communication.github.io/RADLER .
中文: 本文提出了包含同步雷达图像对和语义3D城市模型的新数据集RadarCity,并开发了RADLER神经网络,通过结合对比自监督学习与语义3D先验知识来增强雷达目标检测性能,较现有方法实现了显著提升。
English: This paper introduces RadarCity, a novel dataset with synchronized radar-image pairs and semantic 3D city models, and proposes RADLER, a neural network that enhances radar object detection by integrating contrastive self-supervised learning with semantic 3D priors, achieving significant performance improvements over existing methods.
Authors:Miaosen Luo, Yuncheng Jiang, Sijie Mai
Abstract:
Multimodal Sentiment Analysis (MSA) faces two critical challenges: the lack of interpretability in the decision logic of multimodal fusion and modality imbalance caused by disparities in inter-modal information density. To address these issues, we propose KAN-MCP, a novel framework that integrates the interpretability of Kolmogorov-Arnold Networks (KAN) with the robustness of the Multimodal Clean Pareto (MCPareto) framework. First, KAN leverages its univariate function decomposition to achieve transparent analysis of cross-modal interactions. This structural design allows direct inspection of feature transformations without relying on external interpretation tools, thereby ensuring both high expressiveness and interpretability. Second, the proposed MCPareto enhances robustness by addressing modality imbalance and noise interference. Specifically, we introduce the Dimensionality Reduction and Denoising Modal Information Bottleneck (DRD-MIB) method, which jointly denoises and reduces feature dimensionality. This approach provides KAN with discriminative low-dimensional inputs to reduce the modeling complexity of KAN while preserving critical sentiment-related information. Furthermore, MCPareto dynamically balances gradient contributions across modalities using the purified features output by DRD-MIB, ensuring lossless transmission of auxiliary signals and effectively alleviating modality imbalance. This synergy of interpretability and robustness not only achieves superior performance on benchmark datasets such as CMU-MOSI, CMU-MOSEI, and CH-SIMS v2 but also offers an intuitive visualization interface through KAN's interpretable architecture. Our code is released on https://github.com/LuoMSen/KAN-MCP.
中文: KAN-MCP框架通过结合可解释的柯尔莫哥洛夫-阿诺德网络实现跨模态交互透明分析,并采用具有降噪降维功能的多模态清洁帕累托框架增强鲁棒性,有效解决了多模态情感分析中的可解释性与模态不平衡问题。
English: The KAN-MCP framework addresses interpretability and modality imbalance in Multimodal Sentiment Analysis by integrating Kolmogorov-Arnold Networks for transparent cross-modal interaction analysis and the Multimodal Clean Pareto framework with denoising and dimensionality reduction to enhance robustness.
Authors:Yizhuo Wu, Francesco Fioranelli, Chang Gao
Abstract:
Radar-based HAR has emerged as a promising alternative to conventional monitoring approaches, such as wearable devices and camera-based systems, due to its unique privacy preservation and robustness advantages. However, existing solutions based on convolutional and recurrent neural networks, although effective, are computationally demanding during deployment. This limits their applicability in scenarios with constrained resources or those requiring multiple sensors. Advanced architectures, such as Vision Transformer (ViT) and State-Space Model (SSM) architectures, offer improved modeling capabilities and have made efforts toward lightweight designs. However, their computational complexity remains relatively high. To leverage the strengths of transformer architectures while simultaneously enhancing accuracy and reducing computational complexity, this paper introduces RadMamba, a parameter-efficient, radar micro-Doppler-oriented Mamba SSM specifically tailored for radar-based HAR. Across three diverse datasets, RadMamba matches the top-performing previous model's 99.8% classification accuracy on Dataset DIAT with only 1/400 of its parameters and equals the leading models' 92.0% accuracy on Dataset CI4R with merely 1/10 of their parameters. In scenarios with continuous sequences of actions evaluated on Dataset UoG2020, RadMamba surpasses other models with significantly higher parameter counts by at least 3%, achieving this with only 6.7k parameters. Our code is available at: https://github.com/lab-emi/AIRHAR.
中文: RadMamba提出了一种参数高效的Mamba状态空间模型,用于雷达人体活动识别,在多个数据集上以极低的计算复杂度实现了顶尖的分类精度。
English: RadMamba introduces a parameter-efficient Mamba State-Space Model for radar-based human activity recognition, achieving top-tier accuracy with drastically reduced computational complexity across multiple datasets.
Authors:Heesoo Jung, Hogun Park
Abstract:
Self-supervised learning (SSL) in graphs has garnered significant attention, particularly in employing Graph Neural Networks (GNNs) with pretext tasks initially designed for other domains, such as contrastive learning and feature reconstruction. However, it remains uncertain whether these methods effectively reflect essential graph properties, precisely representation similarity with its neighbors. We observe that existing methods position opposite ends of a spectrum driven by the graph embedding smoothness, with each end corresponding to outperformance on specific downstream tasks. Decomposing the SSL objective into three terms via an information-theoretic framework with a neighbor representation variable reveals that this polarization stems from an imbalance among the terms, which existing methods may not effectively maintain. Further insights suggest that balancing between the extremes can lead to improved performance across a wider range of downstream tasks. A framework, BSG (Balancing Smoothness in Graph SSL), introduces novel loss functions designed to supplement the representation quality in graph-based SSL by balancing the derived three terms: neighbor loss, minimal loss, and divergence loss. We present a theoretical analysis of the effects of these loss functions, highlighting their significance from both the SSL and graph smoothness perspectives. Extensive experiments on multiple real-world datasets across node classification and link prediction consistently demonstrate that BSG achieves state-of-the-art performance, outperforming existing methods. Our implementation code is available at https://github.com/steve30572/BSG.
中文摘要:BSG框架通过平衡三个损失函数解决了图自监督学习中的不平衡问题,在多种下游任务中实现了最先进的性能。
English Summary: The BSG framework addresses the imbalance in self-supervised graph learning by introducing three balanced loss functions, achieving state-of-the-art performance across various downstream tasks.
Authors:Pascal Schlachter, Jonathan Fuss, Bin Yang
Abstract:
A domain (distribution) shift between training and test data often hinders the real-world performance of deep neural networks, necessitating unsupervised domain adaptation (UDA) to bridge this gap. Online source-free UDA has emerged as a solution for practical scenarios where access to source data is restricted and target data is received as a continuous stream. However, the open-world nature of many real-world applications additionally introduces category shifts meaning that the source and target label spaces may differ. Online source-free universal domain adaptation (SF-UniDA) addresses this challenge. Existing methods mainly rely on self-training with pseudo-labels, yet the relationship between pseudo-labeling and adaptation outcomes has not been studied yet. To bridge this gap, we conduct a systematic analysis through controlled experiments with simulated pseudo-labeling, offering valuable insights into pseudo-labeling for online SF-UniDA. Our findings reveal a substantial gap between the current state-of-the-art and the upper bound of adaptation achieved with perfect pseudo-labeling. Moreover, we show that a contrastive loss enables effective adaptation even with moderate pseudo-label accuracy, while a cross-entropy (CE) loss, though less robust to pseudo-label errors, achieves superior results when pseudo-labeling approaches perfection. Lastly, our findings indicate that pseudo-label accuracy is in general more crucial than quantity, suggesting that prioritizing fewer but high-confidence pseudo-labels is beneficial. Overall, our study highlights the critical role of pseudo-labeling in (online) SF-UniDA and provides actionable insights to drive future advancements in the field. Our code is available at https://github.com/pascalschlachter/PLAnalysis.
中文: 本研究系统分析了在线无源通用域自适应中的伪标签技术,揭示了当前方法与理想性能间的显著差距,并证明对比损失在中等精度伪标签下表现稳健,而交叉熵损失在接近完美伪标签时更优,强调伪标签质量比数量更为关键。
English: This study systematically analyzes pseudo-labeling in online source-free universal domain adaptation, revealing a significant performance gap from ideal conditions and demonstrating that contrastive loss adapts well with moderate pseudo-label accuracy while cross-entropy loss excels with near-perfect labels, emphasizing the importance of prioritizing pseudo-label quality over quantity.
Authors:Linjuan Fan, Di Wen, Kunyu Peng, Kailun Yang, Jiaming Zhang, Ruiping Liu, Yufan Chen, Junwei Zheng, Jiamin Wu, Xudong Han, Rainer Stiefelhagen
Abstract:
As an open research topic in the field of deep learning, learning with noisy labels has attracted much attention and grown rapidly over the past ten years. Learning with label noise is crucial for driver distraction behavior recognition, as real-world video data often contains mislabeled samples, impacting model reliability and performance. However, label noise learning is barely explored in the driver activity recognition field. In this paper, we propose the first label noise learning approach for the driver activity recognition task. Based on the cluster assumption, we initially enable the model to learn clustering-friendly low-dimensional representations from given videos and assign the resultant embeddings into clusters. We subsequently perform co-refinement within each cluster to smooth the classifier outputs. Furthermore, we propose a flexible sample selection strategy that combines two selection criteria without relying on any hyperparameters to filter clean samples from the training dataset. We also incorporate a self-adaptive parameter into the sample selection process to enforce balancing across classes. A comprehensive variety of experiments on the public Drive&Act dataset for all granularity levels demonstrates the superior performance of our method in comparison with other label-denoising methods derived from the image classification field. The source code is available at https://github.com/ilonafan/DAR-noisy-labels.
Chinese: 本文提出了首个用于驾驶员活动识别的标签噪声学习方法,通过基于聚类的表示学习和无需超参数的样本选择策略,在Drive&Act数据集上实现了优于其他方法的性能。
English: This paper introduces the first label noise learning method for driver activity recognition, utilizing clustering-based representation learning and a hyperparameter-free sample selection strategy to achieve superior performance on the Drive&Act dataset.
Authors:Kishan Gurumurthy, Himanshu Pal, Charu Sharma
Abstract:
Graph Neural Network (GNN) research is rapidly advancing due to GNNs' capacity to learn distributed representations from graph-structured data. However, centralizing large volumes of real-world graph data for GNN training is often impractical due to privacy concerns, regulatory restrictions, and commercial competition. Federated learning (FL), a distributed learning paradigm, offers a solution by preserving data privacy with collaborative model training. Despite progress in training huge vision and language models, federated learning for GNNs remains underexplored. To address this challenge, we present a novel method for federated learning on GNNs based on spectral GNNs equipped with neural ordinary differential equations (ODE) for better information capture, showing promising results across both homophilic and heterophilic graphs. Our approach effectively handles non-Independent and Identically Distributed (non-IID) data, while also achieving performance comparable to existing methods that only operate on IID data. It is designed to be privacy-preserving and bandwidth-optimized, making it suitable for real-world applications such as social network analysis, recommendation systems, and fraud detection, which often involve complex, non-IID, and heterophilic graph structures. Our results in the area of federated learning on non-IID heterophilic graphs demonstrate significant improvements, while also achieving better performance on homophilic graphs. This work highlights the potential of federated learning in diverse and challenging graph settings. Open-source code available on GitHub (https://github.com/SpringWiz11/Fed-GNODEFormer).
中文: 本文提出了一种基于谱图神经网络和神经常微分方程的新型联邦学习方法,能够有效处理非独立同分布和异配性图数据,在保持隐私和带宽优化的同时,在各类图结构上均展现出优越性能。
English: This paper introduces a novel federated learning method for Graph Neural Networks using spectral GNNs with neural ODEs, which effectively handles non-IID and heterophilic graphs while maintaining privacy and bandwidth efficiency, demonstrating superior performance across various graph types.
Authors:Zihui Zhang, Yafei Yang, Hongtao Wen, Bo Yang
Abstract:
We study the hard problem of 3D object segmentation in complex point clouds without requiring human labels of 3D scenes for supervision. By relying on the similarity of pretrained 2D features or external signals such as motion to group 3D points as objects, existing unsupervised methods are usually limited to identifying simple objects like cars or their segmented objects are often inferior due to the lack of objectness in pretrained features. In this paper, we propose a new two-stage pipeline called GrabS. The core concept of our method is to learn generative and discriminative object-centric priors as a foundation from object datasets in the first stage, and then design an embodied agent to learn to discover multiple objects by querying against the pretrained generative priors in the second stage. We extensively evaluate our method on two real-world datasets and a newly created synthetic dataset, demonstrating remarkable segmentation performance, clearly surpassing all existing unsupervised methods.
Chinese: 本文提出GrabS,一种两阶段无监督方法,通过从数据集中学习以对象为中心的先验知识,并利用具身智能体查询发现对象,在无需人工标注的情况下实现了超越现有方法的3D分割性能。
English: This paper introduces GrabS, a two-stage unsupervised method that learns object-centric priors from datasets and uses an embodied agent to discover objects, achieving superior 3D segmentation performance over existing approaches without human labels.
Authors:Tianyi Zhang, Mohsen Hariri, Shaochen Zhong, Vipin Chaudhary, Yang Sui, Xia Hu, Anshumali Shrivastava
Abstract:
Large-scale AI models, such as Large Language Models (LLMs) and Diffusion Models (DMs), have grown rapidly in size, creating significant challenges for efficient deployment on resource-constrained hardware. In this paper, we introduce Dynamic-Length Float (DFloat11), a lossless compression framework that reduces LLM and DM size by 30% while preserving outputs that are bit-for-bit identical to the original model. DFloat11 is motivated by the low entropy in the BFloat16 weight representation of LLMs, which reveals significant inefficiency in the existing storage format. By applying entropy coding, DFloat11 assigns dynamic-length encodings to weights based on frequency, achieving near information-optimal compression without any loss of precision. To facilitate efficient inference with dynamic-length encodings, we develop a custom GPU kernel for fast online decompression. Our design incorporates the following: (i) compact, hierarchical lookup tables (LUTs) that fit within GPU SRAM for efficient decoding, (ii) a two-phase GPU kernel for coordinating thread read/write positions using lightweight auxiliary variables, and (iii) transformer-block-level decompression to minimize latency. Experiments on Llama 3.3, Qwen 3, Mistral 3, FLUX.1, and others validate our hypothesis that DFloat11 achieves around 30% model size reduction while preserving bit-for-bit identical outputs. Compared to a potential alternative of offloading parts of an uncompressed model to the CPU to meet memory constraints, DFloat11 achieves 2.3--46.2x higher throughput in token generation. With a fixed GPU memory budget, DFloat11 enables 5.7--14.9x longer generation lengths than uncompressed models. Notably, our method enables lossless inference of Llama 3.1 405B, an 810GB model, on a single node equipped with 8x80GB GPUs. Our code is available at https://github.com/LeanModels/DFloat11.
中文: 本文提出DFloat11无损压缩框架,通过动态长度编码和定制GPU解压内核,将大型AI模型体积减少30%同时保持输出结果完全一致。
English: This paper introduces DFloat11, a lossless compression framework that reduces large AI model sizes by 30% while maintaining bit-for-bit identical outputs through dynamic-length encoding and custom GPU decompression kernels.
Authors:Tianjian Yang, Wei Vivian Li
Abstract:
Background: The integration and analysis of multi-modal data are increasingly essential across various domains including bioinformatics. As the volume and complexity of such data grow, there is a pressing need for computational models that not only integrate diverse modalities but also leverage their complementary information to improve clustering accuracy and insights, especially when dealing with partial observations with missing data. Results: We propose Generalized Probabilistic Canonical Correlation Analysis (GPCCA), an unsupervised method for the integration and joint dimensionality reduction of multi-modal data. GPCCA addresses key challenges in multi-modal data analysis by handling missing values within the model, enabling the integration of more than two modalities, and identifying informative features while accounting for correlations within individual modalities. The model demonstrates robustness to various missing data patterns and provides low-dimensional embeddings that facilitate downstream clustering and analysis. In a range of simulation settings, GPCCA outperforms existing methods in capturing essential patterns across modalities. Additionally, we demonstrate its applicability to multi-omics data from TCGA cancer datasets and a multi-view image dataset. Conclusion: GPCCA offers a useful framework for multi-modal data integration, effectively handling missing data and providing informative low-dimensional embeddings. Its performance across cancer genomics and multi-view image data highlights its robustness and potential for broad application. To make the method accessible to the wider research community, we have released an R package, GPCCA, which is available at https://github.com/Kaversoniano/GPCCA.
Chinese: 提出的广义概率典型相关分析(GPCCA)是一种无监督方法,能有效整合多模态数据、处理缺失值,并提供稳健的低维嵌入,从而在多种应用中提升聚类和分析效果。
English: The proposed Generalized Probabilistic Canonical Correlation Analysis (GPCCA) is an unsupervised method that effectively integrates multi-modal data, handles missing values, and provides robust low-dimensional embeddings for improved clustering and analysis across various applications.
Authors:Haokun Liu, Sicong Huang, Jingyu Hu, Yangqiaoyu Zhou, Chenhao Tan
Abstract:
There is growing interest in hypothesis generation with large language models (LLMs). However, fundamental questions remain: what makes a good hypothesis, and how can we systematically evaluate methods for hypothesis generation? To address this, we introduce HypoBench, a novel benchmark designed to evaluate LLMs and hypothesis generation methods across multiple aspects, including practical utility, generalizability, and hypothesis discovery rate. HypoBench includes 7 real-world tasks and 5 synthetic tasks with 194 distinct datasets. We evaluate four state-of-the-art LLMs combined with six existing hypothesis-generation methods. Overall, our results suggest that existing methods are capable of discovering valid and novel patterns in the data. However, the results from synthetic datasets indicate that there is still significant room for improvement, as current hypothesis generation methods do not fully uncover all relevant or meaningful patterns. Specifically, in synthetic settings, as task difficulty increases, performance significantly drops, with best models and methods only recovering 38.8% of the ground-truth hypotheses. These findings highlight challenges in hypothesis generation and demonstrate that HypoBench serves as a valuable resource for improving AI systems designed to assist scientific discovery.
Authors:Mansoor Hayat, Supavadee Aramvith, Subrata Bhattacharjee, Nouman Ahmad
Abstract:
Accurate segmentation of abdominal adipose tissue, including subcutaneous (SAT) and visceral adipose tissue (VAT), along with liver segmentation, is essential for understanding body composition and associated health risks such as type 2 diabetes and cardiovascular disease. This study proposes Attention GhostUNet++, a novel deep learning model incorporating Channel, Spatial, and Depth Attention mechanisms into the Ghost UNet++ bottleneck for automated, precise segmentation. Evaluated on the AATTCT-IDS and LiTS datasets, the model achieved Dice coefficients of 0.9430 for VAT, 0.9639 for SAT, and 0.9652 for liver segmentation, surpassing baseline models. Despite minor limitations in boundary detail segmentation, the proposed model significantly enhances feature refinement, contextual understanding, and computational efficiency, offering a robust solution for body composition analysis. The implementation of the proposed Attention GhostUNet++ model is available at:https://github.com/MansoorHayat777/Attention-GhostUNetPlusPlus.
中文摘要:本研究提出Attention GhostUNet++深度学习模型,通过集成多种注意力机制实现腹部脂肪组织和肝脏的精准分割,在保持高计算效率的同时显著超越了基准模型的性能表现。
English Summary: The study introduces Attention GhostUNet++, a deep learning model integrating multiple attention mechanisms for precise abdominal adipose and liver tissue segmentation, achieving superior performance over baseline models with enhanced computational efficiency.
Authors:Cheng-Yen Hsieh, Xinyou Wang, Daiheng Zhang, Dongyu Xue, Fei Ye, Shujian Huang, Zaixiang Zheng, Quanquan Gu
Abstract:
Multimodal protein language models (PLMs) integrate sequence and token-based structural information, serving as a powerful foundation for protein modeling, generation, and design. However, the reliance on tokenizing 3D structures into discrete tokens causes substantial loss of fidelity about fine-grained structural details and correlations. In this paper, we systematically elucidate the design space of multimodal PLMs to overcome their limitations. We identify tokenization loss and inaccurate structure token predictions by the PLMs as major bottlenecks. To address these, our proposed design space covers improved generative modeling, structure-aware architectures and representation learning, and data exploration. Our advancements approach finer-grained supervision, demonstrating that token-based multimodal PLMs can achieve robust structural modeling. The effective design methods dramatically improve the structure generation diversity, and notably, folding abilities of our 650M model by reducing the RMSD from 5.52 to 2.36 on PDB testset, even outperforming 3B baselines and on par with the specialized folding models. Project page and code: https://bytedance.github.io/dplm/dplm-2.1/.
Authors:Matthew Thomas Jackson, Uljad Berdica, Jarek Liesen, Shimon Whiteson, Jakob Nicolaus Foerster
Abstract:
Progress in offline reinforcement learning (RL) has been impeded by ambiguous problem definitions and entangled algorithmic designs, resulting in inconsistent implementations, insufficient ablations, and unfair evaluations. Although offline RL explicitly avoids environment interaction, prior methods frequently employ extensive, undocumented online evaluation for hyperparameter tuning, complicating method comparisons. Moreover, existing reference implementations differ significantly in boilerplate code, obscuring their core algorithmic contributions. We address these challenges by first introducing a rigorous taxonomy and a transparent evaluation protocol that explicitly quantifies online tuning budgets. To resolve opaque algorithmic design, we provide clean, minimalistic, single-file implementations of various model-free and model-based offline RL methods, significantly enhancing clarity and achieving substantial speed-ups. Leveraging these streamlined implementations, we propose Unifloral, a unified algorithm that encapsulates diverse prior approaches within a single, comprehensive hyperparameter space, enabling algorithm development in a shared hyperparameter space. Using Unifloral with our rigorous evaluation protocol, we develop two novel algorithms - TD3-AWR (model-free) and MoBRAC (model-based) - which substantially outperform established baselines. Our implementation is publicly available at https://github.com/EmptyJackson/unifloral.
中文摘要:该摘要指出离线强化学习面临实现不一致和评估不公等挑战,并提出了一个统一框架,包含简洁实现和严格评估协议,从而开发出超越现有方法的新算法。
English Summary: The abstract highlights challenges in offline RL, such as inconsistent implementations and unfair evaluations, and introduces a unified framework with clean implementations and a rigorous protocol that leads to novel algorithms outperforming existing methods.
Authors:Leon Guertler, Bobby Cheng, Simon Yu, Bo Liu, Leshem Choshen, Cheston Tan
Abstract:
TextArena is an open-source collection of competitive text-based games for training and evaluation of agentic behavior in Large Language Models (LLMs). It spans 57+ unique environments (including single-player, two-player, and multi-player setups) and allows for easy evaluation of model capabilities via an online-play system (against humans and other submitted models) with real-time TrueSkill scores. Traditional benchmarks rarely assess dynamic social skills such as negotiation, theory of mind, and deception, creating a gap that TextArena addresses. Designed with research, community and extensibility in mind, TextArena emphasizes ease of adding new games, adapting the framework, testing models, playing against the models, and training models. Detailed documentation of environments, games, leaderboard, and examples are available on https://github.com/LeonGuertler/TextArena and https://www.textarena.ai/.
中文: TextArena是一个开源的文本竞技游戏集合,包含57+种独特环境,专门用于训练和评估大语言模型的动态社交能力(如谈判与欺骗),通过可扩展框架和实时评分系统弥补传统基准测试的不足。
English: TextArena is an open-source platform featuring over 57 competitive text-based games designed to train and evaluate social skills like negotiation and deception in LLMs, addressing gaps in traditional benchmarks through its extensible framework and real-time scoring system.
Authors:Lewis Clifton, Xin Tian, Duangdao Palasuwan, Phandee Watanaboonyongcharoen, Ponlapat Rojnuckarin, Nantheera Anantrasirichai
Abstract:
White blood cell (WBC) classification assists in assessing immune health and diagnosing various diseases, yet manual classification is labor-intensive and prone to inconsistencies. Recent advancements in deep learning have shown promise over traditional methods; however, challenges such as data imbalance and the computational demands of modern technologies, such as Transformer-based models which do not scale well with input size, limit their practical application. This paper introduces a novel framework that leverages Mamba models integrated with ensemble learning to improve WBC classification. Mamba models, known for their linear complexity, provide a scalable alternative to Transformer-based approaches, making them suitable for deployment in resource-constrained environments. Additionally, we introduce a new WBC dataset, Chula-WBC-8, for benchmarking. Our approach not only validates the effectiveness of Mamba models in this domain but also demonstrates their potential to significantly enhance classification efficiency without compromising accuracy. The source code can be found at https://github.com/LewisClifton/Mamba-WBC-Classification.
中文:本文提出了一种结合Mamba模型与集成学习的新框架,通过线性复杂度模型提升白细胞分类效率,并发布了Chula-WBC-8基准数据集,为资源受限环境提供了可行的解决方案。
English: This paper introduces a novel framework using Mamba models with ensemble learning to improve white blood cell classification, offering a scalable and efficient alternative to Transformer-based methods while introducing a new benchmark dataset.
Authors:Xue Zhang, Songming Zhang, Yunlong Liang, Fandong Meng, Yufeng Chen, Jinan Xu, Jie Zhou
Abstract:
Knowledge distillation (KD) is a promising solution to compress large language models (LLMs) by transferring their knowledge to smaller models. During this process, white-box KD methods usually minimize the distance between the output distributions of the teacher model and the student model to transfer more information. However, we reveal that the current white-box KD framework exhibits two limitations: a) bridging probability distributions from different output spaces will limit the similarity between the teacher model and the student model; b) this framework cannot be applied to LLMs with different vocabularies. One of the root causes for these limitations is that the distributions from the teacher and the student for KD are output by different prediction heads, which yield distributions in different output spaces and dimensions. Therefore, in this paper, we propose a dual-space knowledge distillation (DSKD) framework that unifies the prediction heads of the teacher and the student models for KD. Specifically, we first introduce two projectors with ideal initialization to project the teacher/student hidden states into the student/teacher representation spaces. After this, the hidden states from different models can share the same head and unify the output spaces of the distributions. Furthermore, we develop an exact token alignment (ETA) algorithm to align the same tokens in two differently-tokenized sequences. Based on the above, our DSKD framework is a general KD framework that supports both off-policy and on-policy KD, and KD between any two LLMs regardless of their vocabularies. Extensive experiments on instruction-following, mathematical reasoning, and code generation benchmarks show that DSKD significantly outperforms existing methods based on the current white-box KD framework and surpasses other cross-tokenizer KD methods for LLMs with different vocabularies.
Chinese: 本文提出了一种双空间知识蒸馏(DSKD)框架,通过投影隐藏状态和对齐标记来统一师生模型的输出空间,实现了不同词汇表大语言模型间的有效知识迁移,并在多个基准测试中显著优于现有方法。
English: This paper introduces a dual-space knowledge distillation (DSKD) framework that unifies the output spaces of teacher and student models by projecting hidden states and aligning tokens, enabling effective distillation between large language models with different vocabularies and outperforming existing methods.
Authors:Tianwei Ni, Allen Nie, Sapana Chaudhary, Yao Liu, Huzefa Rangwala, Rasool Fakoor
Abstract:
Leveraging inference-time search in large language models has proven effective in further enhancing a trained model's capability to solve complex mathematical and reasoning problems. However, this approach significantly increases computational costs and inference time, as the model must generate and evaluate multiple candidate solutions to identify a viable reasoning path. To address this, we propose an effective approach that integrates search capabilities directly into the model by fine-tuning it on unpaired successful (learning) and failed reasoning paths (forgetting) derived from diverse search methods. A key challenge we identify is that naive fine-tuning can degrade the model's search capability; we show this can be mitigated with a smaller learning rate. Extensive experiments on the challenging Game-of-24 and Countdown reasoning benchmarks show that, replacing CoT-generated data with search-generated data for offline fine-tuning improves success rates by around 23% over inference-time search baselines, while reducing inference time by 180$\times$. On top of this, our learning and forgetting objective consistently outperforms both supervised fine-tuning and preference-based methods.
Chinese: 该方法通过利用搜索生成的成功与失败路径对大型语言模型进行微调,在显著提升推理成功率的同时,大幅降低了推理时间,优于传统搜索方法。
English: The proposed method enhances reasoning in large language models by fine-tuning them with search-generated successful and failed paths, significantly improving success rates and reducing inference time compared to traditional search-based approaches.
Authors:Lijun Sheng, Jian Liang, Zilei Wang, Ran He
Abstract:
Vision-language models (VLMs), such as CLIP, have gained significant popularity as foundation models, with numerous fine-tuning methods developed to enhance performance on downstream tasks. However, due to their inherent vulnerability and the common practice of selecting from a limited set of open-source models, VLMs suffer from a higher risk of adversarial attacks than traditional vision models. Existing defense techniques typically rely on adversarial fine-tuning during training, which requires labeled data and lacks of flexibility for downstream tasks. To address these limitations, we propose robust test-time prompt tuning (R-TPT), which mitigates the impact of adversarial attacks during the inference stage. We first reformulate the classic marginal entropy objective by eliminating the term that introduces conflicts under adversarial conditions, retaining only the pointwise entropy minimization. Furthermore, we introduce a plug-and-play reliability-based weighted ensembling strategy, which aggregates useful information from reliable augmented views to strengthen the defense. R-TPT enhances defense against adversarial attacks without requiring labeled training data while offering high flexibility for inference tasks. Extensive experiments on widely used benchmarks with various attacks demonstrate the effectiveness of R-TPT. The code is available in https://github.com/TomSheng21/R-TPT.
中文: 针对视觉语言模型易受对抗攻击的问题,本文提出R-TPT方法,通过优化熵目标和加权集成策略,在无需标注数据的情况下实现推理阶段的灵活防御。
English: Vision-language models face heightened adversarial risks, so this paper introduces R-TPT, a test-time prompt tuning method that strengthens defenses without labeled data by optimizing entropy and employing weighted ensembling.
Authors:Taewook Kang, Bum-Jae You, Juyoun Park, Yisoo Lee
Abstract:
The growing demand for robots to operate effectively in diverse environments necessitates the need for robust real-time anomaly detection techniques during robotic operations. However, deep learning-based models in robotics face significant challenges due to limited training data and highly noisy signal features. In this paper, we present Sparse Masked Autoregressive Flow-based Adversarial AutoEncoder model to address these problems. This approach integrates Masked Autoregressive Flow model into Adversarial AutoEncoders to construct a flexible latent space and utilize Sparse autoencoder to efficiently focus on important features, even in scenarios with limited feature space. Our experiments demonstrate that the proposed model achieves a 4.96% to 9.75% higher area under the receiver operating characteristic curve for pick-and-place robotic operations with randomly placed cans, compared to existing state-of-the-art methods. Notably, it showed up to 19.67% better performance in scenarios involving collisions with lightweight objects. Additionally, unlike the existing state-of-the-art model, our model performs inferences within 1 millisecond, ensuring real-time anomaly detection. These capabilities make our model highly applicable to machine learning-based robotic safety systems in dynamic environments. The code is available at https://github.com/twkang43/sparse-maf-aae.
中文: 本文提出了一种基于稀疏掩码自回归流的对抗自编码器模型,通过优化特征聚焦和潜空间灵活性,在机器人抓放任务和碰撞场景中显著提升了实时异常检测性能,同时实现了毫秒级推理速度。
English: This paper introduces a Sparse Masked Autoregressive Flow-based Adversarial AutoEncoder model that enhances real-time anomaly detection in robotics by improving feature focus and latent space flexibility, achieving superior performance in pick-and-place tasks and collision scenarios while ensuring millisecond-level inference speeds.
Authors:P. Tomkiewicz, J. Jaworski, P. Zielonka, A. Wilinski
Abstract:
This paper presents a novel computational approach for evaluating urban metrics through density gradient analysis using multi-modal satellite imagery, with applications including public transport and other urban systems. By combining optical and Synthetic Aperture Radar (SAR) data, we develop a method to segment urban areas, identify urban centers, and quantify density gradients. Our approach calculates two key metrics: the density gradient coefficient ($α$) and the minimum effective distance (LD) at which density reaches a target threshold. We further employ machine learning techniques, specifically K-means clustering, to objectively identify uniform and high-variability regions within density gradient plots. We demonstrate that these metrics provide an effective screening tool for public transport analyses by revealing the underlying urban structure. Through comparative analysis of two representative cities with contrasting urban morphologies (monocentric vs polycentric), we establish relationships between density gradient characteristics and public transport network topologies. Cities with clear density peaks in their gradient plots indicate distinct urban centers requiring different transport strategies than those with more uniform density distributions. This methodology offers urban planners a cost-effective, globally applicable approach to preliminary public transport assessment using freely available satellite data. The complete implementation, with additional examples and documentation, is available in an open-source repository under the MIT license at https://github.com/nexri/Satellite-Imagery-Urban-Analysis.
中文: 本文提出了一种利用多模态卫星影像和机器学习分析城市密度梯度的新计算方法,为基于城市结构的公共交通系统评估提供了一种经济有效的工具。
English: This paper introduces a novel computational method using multi-modal satellite imagery and machine learning to analyze urban density gradients, providing a cost-effective tool for evaluating public transport systems based on urban structure.
Authors:Alexandru Vasilache, Jona Scholz, Vincent Schilling, Sven Nitzsche, Florian Kaelber, Johannes Korsch, Juergen Becker
Abstract:
Spiking Neural Networks (SNNs) offer promising energy efficiency advantages, particularly when processing sparse spike trains. However, their incompatibility with traditional datasets, which consist of batches of input vectors rather than spike trains, necessitates the development of efficient encoding methods. This paper introduces a novel, open-source PyTorch-compatible Python framework for spike encoding, designed for neuromorphic applications in machine learning and reinforcement learning. The framework supports a range of encoding algorithms, including Leaky Integrate-and-Fire (LIF), Step Forward (SF), Pulse Width Modulation (PWM), and Ben's Spiker Algorithm (BSA), as well as specialized encoding strategies covering population coding and reinforcement learning scenarios. Furthermore, we investigate the performance trade-offs of each method on embedded hardware using C/C++ implementations, considering energy consumption, computation time, spike sparsity, and reconstruction accuracy. Our findings indicate that SF typically achieves the lowest reconstruction error and offers the highest energy efficiency and fastest encoding speed, achieving the second-best spike sparsity. At the same time, other methods demonstrate particular strengths depending on the signal characteristics. This framework and the accompanying empirical analysis provide valuable resources for selecting optimal encoding strategies for energy-efficient SNN applications.
中文: 本文提出了一种新型开源脉冲编码框架,支持多种编码算法,实证研究表明步进前向编码在能量效率和重构精度方面表现最优,而其他方法在不同信号特性下各具优势。
English: This paper presents a novel open-source PyTorch-compatible spike encoding framework supporting multiple algorithms, with empirical analysis showing Step Forward encoding achieves optimal energy efficiency and reconstruction accuracy while other methods excel in specific signal scenarios.
Authors:Md Rakibul Hasan, Md Zakir Hossain, Aneesh Krishna, Shafin Rahman, Tom Gedeon
Abstract:
Detecting empathy from video interactions is an emerging area of research, particularly in healthcare and social robotics. However, privacy and ethical concerns often prevent the release of raw video data, with many datasets instead shared as pre-extracted tabular features. Previous work on such datasets has established classical tree-based models as the state of the art. Motivated by recent successes of large-scale foundation models for text, we investigate the potential of tabular foundation models (TFMs) for empathy detection from video-derived tabular data. Our proposed system, TFMPathy, is demonstrated with two recent TFMs (TabPFN v2 and TabICL) under both in-context learning and fine-tuning paradigms. On a public human-robot interaction benchmark, TFMPathy significantly improves empathy detection accuracy reported in the literature. While the established evaluation protocol in the literature does not ensure cross-subject generalisation, our evaluation scheme also captures such generalisation. We show that TFMPathy under a fine-tuning setup has better cross-subject generalisation capacity over baseline methods (accuracy: $0.590 \rightarrow 0.730$; AUC: $0.564 \rightarrow 0.669$). Given the ongoing privacy and ethical constraints around raw video sharing, the proposed TFMPathy system provides a practical and scalable path toward building AI systems dependent on human-centred video datasets. Our code is publicly available at https://github.com/hasan-rakibul/TFMPathy (will be made available upon acceptance of this paper).
中文: 提出的TFMPathy系统利用表格基础模型,在解决隐私限制的同时,显著提升了基于视频衍生数据的共情检测准确率和跨被试泛化能力。
English: The proposed TFMPathy system leverages tabular foundation models to significantly improve empathy detection accuracy and cross-subject generalization from video-derived data while addressing privacy constraints.
Authors:Qixu Chen, Yeye He, Raymond Chi-Wing Wong, Weiwei Cui, Song Ge, Haidong Zhang, Dongmei Zhang, Surajit Chaudhuri
Abstract:
Data cleaning is a long-standing challenge in data management. While powerful logic and statistical algorithms have been developed to detect and repair data errors in tables, existing algorithms predominantly rely on domain-experts to first manually specify data-quality constraints specific to a given table, before data cleaning algorithms can be applied.
In this work, we propose a new class of data-quality constraints that we call Semantic-Domain Constraints, which can be reliably inferred and automatically applied to any tables, without requiring domain-experts to manually specify on a per-table basis. We develop a principled framework to systematically learn such constraints from table corpora using large-scale statistical tests, which can further be distilled into a core set of constraints using our optimization framework, with provable quality guarantees. Extensive evaluations show that this new class of constraints can be used to both (1) directly detect errors on real tables in the wild, and (2) augment existing expert-driven data-cleaning techniques as a new class of complementary constraints.
Our extensively labeled benchmark dataset with 2400 real data columns, as well as our code are available at https://github.com/qixuchen/AutoTest to facilitate future research.
Chinese: 本研究提出了一种名为语义域约束的新型数据质量约束,无需领域专家手动指定即可自动推断并应用于任何表格,既能直接检测错误,又能增强现有数据清洗方法,并具有可证明的质量保证。
English: This study introduces Semantic-Domain Constraints, a novel class of data-quality constraints that can be automatically inferred and applied to any table without manual domain-expert input, enabling both direct error detection and enhancement of existing data-cleaning methods with provable quality guarantees.
Authors:Xiulong Liu, Anurag Kumar, Paul Calamia, Sebastia V. Amengual, Calvin Murdock, Ishwarya Ananthabhotla, Philip Robinson, Eli Shlizerman, Vamsi Krishna Ithapu, Ruohan Gao
Abstract:
In mixed reality applications, a realistic acoustic experience in spatial environments is as crucial as the visual experience for achieving true immersion. Despite recent advances in neural approaches for Room Impulse Response (RIR) estimation, most existing methods are limited to the single environment on which they are trained, lacking the ability to generalize to new rooms with different geometries and surface materials. We aim to develop a unified model capable of reconstructing the spatial acoustic experience of any environment with minimum additional measurements. To this end, we present xRIR, a framework for cross-room RIR prediction. The core of our generalizable approach lies in combining a geometric feature extractor, which captures spatial context from panorama depth images, with a RIR encoder that extracts detailed acoustic features from only a few reference RIR samples. To evaluate our method, we introduce ACOUSTICROOMS, a new dataset featuring high-fidelity simulation of over 300,000 RIRs from 260 rooms. Experiments show that our method strongly outperforms a series of baselines. Furthermore, we successfully perform sim-to-real transfer by evaluating our model on four real-world environments, demonstrating the generalizability of our approach and the realism of our dataset.
Authors:Ankit Kumar Shaw, Kun Jiang, Tuopu Wen, Chandan Kumar Sah, Yining Shi, Mengmeng Yang, Diange Yang, Xiaoli Lian
Abstract:
The rapid growth of intelligent connected vehicles (ICVs) and integrated vehicle-road-cloud systems has increased the demand for accurate, real-time HD map updates. However, ensuring map reliability remains challenging due to inconsistencies in crowdsourced data, which suffer from motion blur, lighting variations, adverse weather, and lane marking degradation. This paper introduces CleanMAP, a Multimodal Large Language Model (MLLM)-based distillation framework designed to filter and refine crowdsourced data for high-confidence HD map updates. CleanMAP leverages an MLLM-driven lane visibility scoring model that systematically quantifies key visual parameters, assigning confidence scores (0-10) based on their impact on lane detection. A novel dynamic piecewise confidence-scoring function adapts scores based on lane visibility, ensuring strong alignment with human evaluations while effectively filtering unreliable data. To further optimize map accuracy, a confidence-driven local map fusion strategy ranks and selects the top-k highest-scoring local maps within an optimal confidence range (best score minus 10%), striking a balance between data quality and quantity. Experimental evaluations on a real-world autonomous vehicle dataset validate CleanMAP's effectiveness, demonstrating that fusing the top three local maps achieves the lowest mean map update error of 0.28m, outperforming the baseline (0.37m) and meeting stringent accuracy thresholds (<= 0.32m). Further validation with real-vehicle data confirms 84.88% alignment with human evaluators, reinforcing the model's robustness and reliability. This work establishes CleanMAP as a scalable and deployable solution for crowdsourced HD map updates, ensuring more precise and reliable autonomous navigation. The code will be available at https://Ankit-Zefan.github.io/CleanMap/
中文摘要:CleanMAP是一个基于多模态大语言模型的蒸馏框架,通过评估车道可见性并融合高质量局部地图来优化众包高清地图更新,在自动驾驶导航中实现了卓越的精度和与人工评估的高度一致性。
English Summary: CleanMAP is a multimodal large language model-based framework that refines crowdsourced data for high-definition map updates by scoring lane visibility and fusing top-quality local maps, achieving superior accuracy and human alignment in autonomous navigation.
Authors:Kristina NikoliÄ, Luze Sun, Jie Zhang, Florian Tramèr
Abstract:
Jailbreak attacks bypass the guardrails of large language models to produce harmful outputs. In this paper, we ask whether the model outputs produced by existing jailbreaks are actually useful. For example, when jailbreaking a model to give instructions for building a bomb, does the jailbreak yield good instructions? Since the utility of most unsafe answers (e.g., bomb instructions) is hard to evaluate rigorously, we build new jailbreak evaluation sets with known ground truth answers, by aligning models to refuse questions related to benign and easy-to-evaluate topics (e.g., biology or math). Our evaluation of eight representative jailbreaks across five utility benchmarks reveals a consistent drop in model utility in jailbroken responses, which we term the jailbreak tax. For example, while all jailbreaks we tested bypass guardrails in models aligned to refuse to answer math, this comes at the expense of a drop of up to 92% in accuracy. Overall, our work proposes the jailbreak tax as a new important metric in AI safety, and introduces benchmarks to evaluate existing and future jailbreaks. We make the benchmark available at https://github.com/ethz-spylab/jailbreak-tax
Chinese: 本文提出"越狱代价"概念,指出尽管越狱攻击能突破AI模型防护,但会导致模型性能显著下降——在多项基准测试中准确率降幅高达92%,并建立了评估越狱攻击效用的新基准。
English: This paper introduces the concept of "jailbreak tax," showing that while jailbreak attacks can bypass AI model safeguards, they significantly reduce the model's utility, as demonstrated by up to a 92% drop in accuracy across various benchmarks.
Authors:Arash Torabi Goodarzi, Roman Kochnev, Waleed Khalid, Hojjat Torabi Goudarzi, Furui Qin, Tolgay Atinc Uzun, Yashkumar Sanjaybhai Dhameliya, Yash Kanubhai Kathiriya, Zofia Antonina Bentyn, Dmitry Ignatov, Radu Timofte
Abstract:
Neural networks are the backbone of modern artificial intelligence, but designing, evaluating, and comparing them remains labor-intensive. While numerous datasets exist for training, there are few standardized collections of the models themselves. We introduce LEMUR, an open-source dataset and framework that provides a large collection of PyTorch-based neural networks across tasks such as classification, segmentation, detection, and natural language processing. Each model follows a unified template, with configurations and results stored in a structured database to ensure consistency and reproducibility. LEMUR integrates automated hyperparameter optimization via Optuna, includes statistical analysis and visualization tools, and offers an API for seamless access to performance data. The framework is extensible, allowing researchers to add new models, datasets, or metrics without breaking compatibility. By standardizing implementations and unifying evaluation, LEMUR aims to accelerate AutoML research, enable fair benchmarking, and reduce barriers to large-scale neural network experimentation. To support adoption and collaboration, LEMUR and its plugins are released under the MIT license at: https://github.com/ABrain-One/nn-dataset https://github.com/ABrain-One/nn-plots https://github.com/ABrain-One/nn-vr
中文: LEMUR 是一个开源数据集与框架,通过统一模板标准化 PyTorch 神经网络模型,集成自动化优化与分析工具,旨在加速 AutoML 研究并保障实验可复现性。
English: LEMUR is an open-source dataset and framework that standardizes PyTorch neural networks across various tasks, integrating automated optimization and analysis tools to accelerate AutoML research and ensure reproducibility.
Authors:Yijun Liang, Ming Li, Chenrui Fan, Ziyue Li, Dang Nguyen, Kwesi Cobbina, Shweta Bhardwaj, Jiuhai Chen, Fuxiao Liu, Tianyi Zhou
Abstract:
Color plays an important role in human perception and usually provides critical clues in visual reasoning. However, it is unclear whether and how vision-language models (VLMs) can perceive, understand, and leverage color as humans. This paper introduces ColorBench, an innovative benchmark meticulously crafted to assess the capabilities of VLMs in color understanding, including color perception, reasoning, and robustness. By curating a suite of diverse test scenarios, with grounding in real applications, ColorBench evaluates how these models perceive colors, infer meanings from color-based cues, and maintain consistent performance under varying color transformations. Through an extensive evaluation of 32 VLMs with varying language models and vision encoders, our paper reveals some undiscovered findings: (i) The scaling law (larger models are better) still holds on ColorBench, while the language model plays a more important role than the vision encoder. (ii) However, the performance gaps across models are relatively small, indicating that color understanding has been largely neglected by existing VLMs. (iii) CoT reasoning improves color understanding accuracies and robustness, though they are vision-centric tasks. (iv) Color clues are indeed leveraged by VLMs on ColorBench but they can also mislead models in some tasks. These findings highlight the critical limitations of current VLMs and underscore the need to enhance color comprehension. Our ColorBenchcan serve as a foundational tool for advancing the study of human-level color understanding of multimodal AI.
中文摘要:ColorBench是一个评估视觉语言模型颜色理解能力的新基准,揭示了现有模型在颜色感知、推理和鲁棒性方面存在显著不足,尽管模型规模扩大和思维链推理能带来一定提升。
English Summary: ColorBench is a novel benchmark that evaluates vision-language models' color understanding, revealing significant limitations in their ability to perceive, reason with, and maintain robustness regarding colors, despite scaling laws and CoT reasoning offering some improvements.
Authors:Ning Li, Jingran Zhang, Justin Cui
Abstract:
Large language models (LLMs) demonstrate strong capabilities in reasoning and question answering, yet their tendency to generate factually incorrect content remains a critical challenge. This study evaluates proprietary and open-source LLMs on generating relevant research papers with accurate arXiv links. Our evaluation reveals critical academic risks: LLMs frequently generate incorrect arXiv links or references to non-existent papers, fundamentally undermining their ability to properly attribute research contributions to the actual authors. We introduce arXivBench, a benchmark specifically designed to assess LLM performance across eight major subject categories on arXiv and five subfields within computer science, one of the most popular categories among them. Our findings show concerning accuracy variations across subjects, with Claude-3.5-Sonnet exhibiting a substantial advantage in generating both relevant and accurate responses. Notably, most LLMs perform significantly better in Artificial Intelligence than other subfields. This benchmark provides a standardized tool for evaluating LLM reliability in scientific contexts, promoting more dependable academic use in research environments. Our code and dataset are available at https://github.com/liningresearch/arXivBench and https://huggingface.co/datasets/arXivBenchLLM/arXivBench.
中文: 本研究推出arXivBench基准测试,揭示大语言模型常生成错误arXiv链接和虚假参考文献,其中Claude-3.5-Sonnet在人工智能领域表现最佳,凸显了其在学术应用中的严重可靠性问题。
English: This study introduces arXivBench, a benchmark revealing that large language models often produce inaccurate arXiv links and references, with Claude-3.5-Sonnet showing superior accuracy particularly in Artificial Intelligence, highlighting critical reliability concerns in academic applications.
Authors:Vikranth Udandarao, Noel Abraham Tiju, Muthuraj Vairamuthu, Harsh Mistry, Dhruv Kumar
Abstract:
In this paper, we present Roamify, an Artificial Intelligence powered travel assistant that aims to ease the process of travel planning. We have tested and used multiple Large Language Models like Llama and T5 to generate personalised itineraries per user preferences. Results from user surveys highlight the preference for AI powered mediums over existing methods to help in travel planning across all user age groups. These results firmly validate the potential need of such a travel assistant. We highlight the two primary design considerations for travel assistance: D1) incorporating a web-scraping method to gather up-to-date news articles about destinations from various blog sources, which significantly improves our itinerary suggestions, and D2) utilising user preferences to create customised travel experiences along with a recommendation system which changes the itinerary according to the user needs. Our findings suggest that Roamify has the potential to improve and simplify how users across multiple age groups plan their travel experiences.
中文: Roamify是一款人工智能驱动的旅行助手,它利用大型语言模型和网络爬取数据,根据用户偏好生成个性化行程,研究显示各年龄段用户均倾向于使用此类AI工具来简化旅行规划。
English: Roamify is an AI-powered travel assistant that uses large language models to create personalized itineraries based on user preferences and real-time web-scraped data, demonstrating strong user preference across all age groups for simplifying travel planning.
Authors:Yasser Benigmim, Mohammad Fahes, Tuan-Hung Vu, Andrei Bursuc, Raoul de Charette
Abstract:
In this paper, we challenge the conventional practice in Open-Vocabulary Semantic Segmentation (OVSS) of using averaged class-wise text embeddings, which are typically obtained by encoding each class name with multiple templates (e.g., a photo of , a sketch of a ). We investigate the impact of templates for OVSS, and find that for each class, there exist single-template classifiers--which we refer to as class-experts--that significantly outperform the conventional averaged classifier. First, to identify these class-experts, we introduce a novel approach that estimates them without any labeled data or training. By leveraging the class-wise prediction entropy of single-template classifiers, we select those yielding the lowest entropy as the most reliable class-experts. Second, we combine the outputs of class-experts in a new fusion process. Our plug-and-play method, coined FLOSS, is orthogonal and complementary to existing OVSS methods, offering an improvement without the need for additional labels or training. Extensive experiments show that FLOSS consistently enhances state-of-the-art OVSS models, generalizes well across datasets with different distribution shifts, and delivers substantial improvements in low-data scenarios where only a few unlabeled images are available. Our code is available at https://github.com/yasserben/FLOSS .
中文摘要:本文提出FLOSS方法,通过熵分析筛选单模板类别专家分类器并进行融合,无需额外训练或标注即可显著提升开放词汇语义分割性能。
English Summary: This paper introduces FLOSS, a plug-and-play method that identifies single-template class-expert classifiers through entropy analysis and combines them to significantly enhance Open-Vocabulary Semantic Segmentation without requiring additional training or labels.
Authors:Davide Piras, Francesco Sorrenti, Ruth Durrer, Martin Kunz
Abstract:
We develop a novel approach to constrain the Hubble parameter $H_0$ and the primordial power spectrum amplitude $A_\mathrm{s}$ using type Ia supernovae (SNIa) data. By considering SNIa as tracers of the peculiar velocity field, we can model their distance and their covariance as a function of cosmological parameters without the need of calibrators like Cepheids; this yields a new independent probe of the large-scale structure based on SNIa data without distance anchors. Crucially, we implement a differentiable pipeline in JAX, including efficient emulators and affine sampling, reducing inference time from years to hours on a single GPU. We first validate our method on mock datasets, demonstrating that we can constrain $H_0$ and $\log 10^{10}A_\mathrm{s}$ within $10\%$ and $15\%$, respectively, using $\mathcal{O}(10^3)$ SNIa. We then test our pipeline with SNIa from an $N$-body simulation, obtaining $6\%$-level unbiased constraints on $H_0$ with a moderate noise level. We finally apply our method to Pantheon+ data, constraining $H_0$ at the $15\%$ level without Cepheids when fixing $A_\mathrm{s}$ to its $\it{Planck}$ value. On the other hand, we obtain $20\%$-level constraints on $\log 10^{10}A_\mathrm{s}$ in agreement with $\it{Planck}$ when including Cepheids in the analysis. In light of upcoming observations of low redshift SNIa from the Zwicky Transient Facility and the Vera Rubin Legacy Survey of Space and Time, surveys for which our method will develop its full potential, we make our code publicly available.
Chinese: 我们开发了一种利用Ia型超新星作为本动速度示踪物的新方法,无需距离校准器即可独立约束哈勃参数和原初功率谱振幅,通过JAX可微分计算框架将推断时间从数年缩短至数小时。
English: We introduce a novel method using type Ia supernovae as peculiar velocity tracers to independently constrain the Hubble parameter and primordial power spectrum amplitude, achieving efficient inference through a differentiable JAX pipeline that reduces computation time from years to hours.
Authors:Junxiong Wang, Wen-Ding Li, Daniele Paliotta, Daniel Ritter, Alexander M. Rush, Tri Dao
Abstract:
Effective reasoning is crucial to solving complex mathematical problems. Recent large language models (LLMs) have boosted performance by scaling test-time computation through long chain-of-thought reasoning. However, transformer-based models are inherently limited in extending context length due to their quadratic computational complexity and linear memory requirements. In this paper, we introduce a novel hybrid linear RNN reasoning model, M1, built on the Mamba architecture, which allows memory-efficient inference. Our approach leverages a distillation process from existing reasoning models and is further enhanced through RL training. Experimental results on the AIME and MATH benchmarks show that M1 not only outperforms previous linear RNN models but also matches the performance of state-of-the-art Deepseek R1 distilled reasoning models at a similar scale. We also compare our generation speed with a highly performant general purpose inference engine, vLLM, and observe more than a 3x speedup compared to a same size transformer. With throughput speedup, we are able to achieve higher accuracy compared to DeepSeek R1 distilled transformer reasoning models under a fixed generation time budget using self-consistency voting. Overall, we introduce a hybrid Mamba reasoning model and provide a more effective approach to scaling test-time generation using self-consistency or long chain of thought reasoning.
Chinese: M1模型基于Mamba架构的混合线性RNN,通过提高推理效率,在性能上超越了以往的线性RNN模型,与顶尖蒸馏模型相当,同时相比Transformer实现了超过3倍的生成加速。
English: The M1 model, a hybrid linear RNN based on the Mamba architecture, enhances reasoning efficiency by outperforming prior linear RNNs and matching state-of-the-art distilled models while achieving over 3x speedup in generation compared to transformers.
Authors:Suyu Ye, Haojun Shi, Darren Shih, Hyokun Yun, Tanya Roosta, Tianmin Shu
Abstract:
To achieve successful assistance with long-horizon web-based tasks, AI agents must be able to sequentially follow real-world user instructions over a long period. Unlike existing web-based agent benchmarks, sequential instruction following in the real world poses significant challenges beyond performing a single, clearly defined task. For instance, real-world human instructions can be ambiguous, require different levels of AI assistance, and may evolve over time, reflecting changes in the user's mental state. To address this gap, we introduce RealWebAssist, a novel benchmark designed to evaluate sequential instruction-following in realistic scenarios involving long-horizon interactions with the web, visual GUI grounding, and understanding ambiguous real-world user instructions. RealWebAssist includes a dataset of sequential instructions collected from real-world human users. Each user instructs a web-based assistant to perform a series of tasks on multiple websites. A successful agent must reason about the true intent behind each instruction, keep track of the mental state of the user, understand user-specific routines, and ground the intended tasks to actions on the correct GUI elements. Our experimental results show that state-of-the-art models struggle to understand and ground user instructions, posing critical challenges in following real-world user instructions for long-horizon web assistance.
中文: RealWebAssist 是一个新颖的基准测试,旨在评估AI代理处理现实世界中连续、模糊用户指令的能力,当前模型在意图推理和图形界面定位方面仍面临显著挑战。
English: RealWebAssist is a new benchmark designed to evaluate AI agents' ability to handle sequential, ambiguous real-world user instructions for long-horizon web tasks, where current models struggle with intent reasoning and GUI grounding.
Authors:Haoran Hao, Jiaming Han, Yiyuan Zhang, Xiangyu Yue
Abstract:
Recent advances in Large Language Models (LLMs) have led to significant breakthroughs in video understanding. However, existing models still struggle with long video processing due to the context length constraint of LLMs and the vast amount of information within the video. Although some recent methods are designed for long video understanding, they often lose crucial information during token compression and struggle with additional modality like audio. In this work, we propose a dynamic long video encoding method utilizing the temporal relationship between frames, named Temporal Dynamic Context (TDC). Firstly, we segment the video into semantically consistent scenes based on inter-frame similarities, then encode each frame into tokens using visual-audio encoders. Secondly, we propose a novel temporal context compressor to reduce the number of tokens within each segment. Specifically, we employ a query-based Transformer to aggregate video, audio, and instruction text tokens into a limited set of temporal context tokens. Finally, we feed the static frame tokens and the temporal context tokens into the LLM for video understanding. Furthermore, to handle extremely long videos, we propose a training-free chain-of-thought strategy that progressively extracts answers from multiple video segments. These intermediate answers serve as part of the reasoning process and contribute to the final answer. We conduct extensive experiments on general video understanding and audio-video understanding benchmarks, where our method demonstrates strong performance. The code and models are available at https://github.com/Hoar012/TDC-Video.
中文: 本文提出时序动态上下文(TDC)方法,通过将视频分割为语义场景、使用时序上下文压缩器减少标记数量,并采用思维链策略处理超长视频,在视频与音频理解基准测试中表现优异。
English: This paper introduces Temporal Dynamic Context (TDC), a dynamic long video encoding method that segments videos into scenes, compresses tokens using a temporal context compressor, and employs a chain-of-thought strategy for enhanced video and audio understanding, achieving strong performance on benchmarks.
Authors:Parshin Shojaee, Ngoc-Hieu Nguyen, Kazem Meidani, Amir Barati Farimani, Khoa D Doan, Chandan K Reddy
Abstract:
Scientific equation discovery is a fundamental task in the history of scientific progress, enabling the derivation of laws governing natural phenomena. Recently, Large Language Models (LLMs) have gained interest for this task due to their potential to leverage embedded scientific knowledge for hypothesis generation. However, evaluating the true discovery capabilities of these methods remains challenging, as existing benchmarks often rely on common equations that are susceptible to memorization by LLMs, leading to inflated performance metrics that do not reflect discovery. In this paper, we introduce LLM-SRBench, a comprehensive benchmark with 239 challenging problems across four scientific domains specifically designed to evaluate LLM-based scientific equation discovery methods while preventing trivial memorization. Our benchmark comprises two main categories: LSR-Transform, which transforms common physical models into less common mathematical representations to test reasoning beyond memorized forms, and LSR-Synth, which introduces synthetic, discovery-driven problems requiring data-driven reasoning. Through extensive evaluation of several state-of-the-art methods, using both open and closed LLMs, we find that the best-performing system so far achieves only 31.5% symbolic accuracy. These findings highlight the challenges of scientific equation discovery, positioning LLM-SRBench as a valuable resource for future research.
中文摘要:LLM-SRBench是一个新颖的基准测试,通过涵盖四个科学领域的239个挑战性问题来严格评估大语言模型在科学方程发现中的能力,其设计能有效防止记忆效应,结果显示当前最优方法仅达31.5%的准确率,凸显了该领域的研究挑战。
English Summary: LLM-SRBench is a novel benchmark designed to rigorously evaluate LLMs' scientific equation discovery capabilities by preventing memorization through 239 challenging problems across four domains, revealing that current methods achieve only 31.5% accuracy and underscoring the field's difficulties.
Authors:Deyuan Liu, Peng Sun, Xufeng Li, Tao Lin
Abstract:
Diffusion models excel at generating high-dimensional data but fall short in training efficiency and representation quality compared to self-supervised methods. We identify a key bottleneck: the underutilization of high-quality, semantically rich representations during training notably slows down convergence. Our systematic analysis reveals a critical representation processing region -- primarily in the early layers -- where semantic and structural pattern learning takes place before generation can occur. To address this, we propose Embedded Representation Warmup (ERW), a plug-and-play framework where in the first stage we get the ERW module serves as a warmup that initializes the early layers of the diffusion model with high-quality, pretrained representations. This warmup minimizes the burden of learning representations from scratch, thereby accelerating convergence and boosting performance. Our theoretical analysis demonstrates that ERW's efficacy depends on its precise integration into specific neural network layers -- termed the representation processing region -- where the model primarily processes and transforms feature representations for later generation. We further establish that ERW not only accelerates training convergence but also enhances representation quality: empirically, our method achieves a 40$\times$ acceleration in training speed compared to REPA, the current state-of-the-art methods. Code is available at https://github.com/LINs-lab/ERW.
Chinese: 生成模型面临高层次语义与低层次合成细节的平衡难题,因此我们提出嵌入式表征预热(ERW)框架,通过先构建语义基础再优化合成的两阶段训练,实现了11.5倍加速和更优性能。
English: Generative models struggle with balancing high-level semantics and low-level synthesis details, so we propose Embedded Representation Warmup (ERW), a two-phase training framework that first builds a semantic foundation and then refines synthesis, achieving an 11.5× speedup and superior performance.
Authors:Soumyadeep Pal, Changsheng Wang, James Diffenderfer, Bhavya Kailkhura, Sijia Liu
Abstract:
Large language model unlearning has become a critical challenge in ensuring safety and controlled model behavior by removing undesired data-model influences from the pretrained model while preserving general utility. Significant recent efforts have been dedicated to developing LLM unlearning benchmarks such as WMDP (Weapons of Mass Destruction Proxy) and MUSE (Machine Unlearning Six-way Evaluation), facilitating standardized unlearning performance assessment and method comparison. Despite their usefulness, we uncover for the first time a novel coreset effect within these benchmarks. Specifically, we find that LLM unlearning achieved with the original (full) forget set can be effectively maintained using a significantly smaller subset (functioning as a "coreset"), e.g., as little as 5% of the forget set, even when selected at random. This suggests that LLM unlearning in these benchmarks can be performed surprisingly easily, even in an extremely low-data regime. We demonstrate that this coreset effect remains strong, regardless of the LLM unlearning method used, such as NPO (Negative Preference Optimization) and RMU (Representation Misdirection Unlearning), the popular ones in these benchmarks. The surprisingly strong coreset effect is also robust across various data selection methods, ranging from random selection to more sophisticated heuristic approaches. We explain the coreset effect in LLM unlearning through a keyword-based perspective, showing that keywords extracted from the forget set alone contribute significantly to unlearning effectiveness and indicating that current unlearning is driven by a compact set of high-impact tokens rather than the entire dataset. We further justify the faithfulness of coreset-unlearned models along additional dimensions, such as mode connectivity and robustness to jailbreaking attacks. Codes are available at https://github.com/OPTML-Group/MU-Coreset.
中文: 大语言模型遗忘可以通过仅使用遗忘集中极小部分(如5%)作为核心集有效实现,这归因于高影响力关键词而非整个数据集的作用,且该效应在不同方法和数据选择策略中均表现稳健。
English: Large language model unlearning can be effectively achieved using a surprisingly small subset of the forget set, known as a coreset, as minimal as 5%, due to the influence of high-impact keywords rather than the entire dataset, with this effect being robust across various methods and data selection approaches.
Authors:Zhaopeng Feng, Shaosheng Cao, Jiahan Ren, Jiayuan Su, Ruizhe Chen, Yan Zhang, Zhe Xu, Yao Hu, Jian Wu, Zuozhu Liu
Abstract:
Large-scale reinforcement learning (RL) methods have proven highly effective in enhancing the reasoning abilities of large language models (LLMs), particularly for tasks with verifiable solutions such as mathematics and coding. However, applying this idea to machine translation (MT), where outputs are flexibly formatted and difficult to automatically evaluate with explicit rules, remains underexplored. In this work, we introduce MT-R1-Zero, the first open-source adaptation of the R1-Zero RL framework for MT without supervised fine-tuning or cold-start. We propose a rule-metric mixed reward mechanism to guide LLMs towards improved translation quality via emergent reasoning. On the WMT 24 English-Chinese benchmark, our MT-R1-Zero-3B-Mix achieves competitive performance, surpassing TowerInstruct-7B-v0.2 by an average of 1.26 points. Meanwhile, our MT-R1-Zero-7B-Mix attains a high average score of 62.25 across all metrics, placing it on par with advanced proprietary models such as GPT-4o and Claude-3.5-Sonnet, while the MT-R1-Zero-7B-Sem variant achieves state-of-the-art scores on semantic metrics. Moreover, our work exhibits strong generalization capabilities on out-of-distribution MT tasks, robustly supporting multilingual and low-resource settings. Extensive analysis of model behavior across different initializations and reward metrics offers pioneering insight into the critical role of reward design, LLM adaptability, training dynamics, and emergent reasoning patterns within the R1-Zero paradigm for MT. Our code is available at https://github.com/fzp0424/MT-R1-Zero.
中文摘要:MT-R1-Zero框架通过规则与指标混合奖励机制,成功将强化学习应用于机器翻译领域,在实现与先进模型相媲美性能的同时,展现出在多语言和低资源场景下的强大泛化能力。
English Summary: The MT-R1-Zero framework successfully adapts reinforcement learning to machine translation by using a rule-metric mixed reward mechanism, achieving competitive performance against advanced models while demonstrating strong generalization in multilingual and low-resource settings.
Authors:Zengyuan Lai, Jiarui Yang, Songpengcheng Xia, Lizhou Lin, Lan Sun, Renwen Wang, Jianran Liu, Qi Wu, Ling Pei
Abstract:
Millimeter-wave radar provides a privacy-preserving solution for human motion analysis, yet its sparse point clouds pose significant challenges for semantic understanding. We present Radar-LLM, the first framework that leverages large language models (LLMs) for human motion understanding using millimeter-wave radar as the sensing modality. Our approach introduces two key innovations: (1) a motion-guided radar tokenizer based on our Aggregate VQ-VAE architecture that incorporates deformable body templates and masked trajectory modeling to encode spatiotemporal point clouds into compact semantic tokens, and (2) a radar-aware language model that establishes cross-modal alignment between radar and text in a shared embedding space. To address data scarcity, we introduce a physics-aware synthesis pipeline that generates realistic radar-text pairs from motion-text datasets. Extensive experiments demonstrate that Radar-LLM achieves state-of-the-art performance across both synthetic and real-world benchmarks, enabling accurate translation of millimeter-wave signals to natural language descriptions. This breakthrough facilitates comprehensive motion understanding in privacy-sensitive applications like healthcare and smart homes. We will release the full implementation to support further research on https://inowlzy.github.io/RadarLLM/.
Authors:Zhisheng Zhang, Derui Wang, Qianyi Yang, Pengyang Huang, Junhan Pu, Yuxin Cao, Kai Ye, Jie Hao, Yixian Yang
Abstract:
Speech synthesis technology has brought great convenience, while the widespread usage of realistic deepfake audio has triggered hazards. Malicious adversaries may unauthorizedly collect victims' speeches and clone a similar voice for illegal exploitation (\textit{e.g.}, telecom fraud). However, the existing defense methods cannot effectively prevent deepfake exploitation and are vulnerable to robust training techniques. Therefore, a more effective and robust data protection method is urgently needed. In response, we propose a defensive framework, \textit{\textbf{SafeSpeech}}, which protects the users' audio before uploading by embedding imperceptible perturbations on original speeches to prevent high-quality synthetic speech. In SafeSpeech, we devise a robust and universal proactive protection technique, \textbf{S}peech \textbf{PE}rturbative \textbf{C}oncealment (\textbf{SPEC}), that leverages a surrogate model to generate universally applicable perturbation for generative synthetic models. Moreover, we optimize the human perception of embedded perturbation in terms of time and frequency domains. To evaluate our method comprehensively, we conduct extensive experiments across advanced models and datasets, both subjectively and objectively. Our experimental results demonstrate that SafeSpeech achieves state-of-the-art (SOTA) voice protection effectiveness and transferability and is highly robust against advanced adaptive adversaries. Moreover, SafeSpeech has real-time capability in real-world tests. The source code is available at \href{https://github.com/wxzyd123/SafeSpeech}{https://github.com/wxzyd123/SafeSpeech}.
中文:SafeSpeech是一种防御框架,通过在原始语音中嵌入难以察觉的扰动来防止高质量合成伪造音频,提供针对未经授权语音克隆和滥用的强大实时保护。
English: SafeSpeech is a defensive framework that embeds imperceptible perturbations into original speech to prevent high-quality synthetic deepfake audio, offering robust, real-time protection against unauthorized voice cloning and exploitation.
Authors:Korel Gundem, Zhengling Qi
Abstract:
In this paper, we study the offline sequential feature-based pricing and inventory control problem where the current demand depends on the past demand levels and any demand exceeding the available inventory is lost. Our goal is to leverage the offline dataset, consisting of past prices, ordering quantities, inventory levels, covariates, and censored sales levels, to estimate the optimal pricing and inventory control policy that maximizes long-term profit. While the underlying dynamic without censoring can be modeled by Markov decision process (MDP), the primary obstacle arises from the observed process where demand censoring is present, resulting in missing profit information, the failure of the Markov property, and a non-stationary optimal policy. To overcome these challenges, we first approximate the optimal policy by solving a high-order MDP characterized by the number of consecutive censoring instances, which ultimately boils down to solving a specialized Bellman equation tailored for this problem. Inspired by offline reinforcement learning and survival analysis, we propose two novel data-driven algorithms to solving these Bellman equations and, thus, estimate the optimal policy. Furthermore, we establish finite sample regret bounds to validate the effectiveness of these algorithms. Finally, we conduct numerical experiments to demonstrate the efficacy of our algorithms in estimating the optimal policy. To the best of our knowledge, this is the first data-driven approach to learning optimal pricing and inventory control policies in a sequential decision-making environment characterized by censored and dependent demand. The implementations of the proposed algorithms are available at https://github.com/gundemkorel/Inventory_Pricing_Control
本文提出了首个数据驱动方法,用于在具有截断和依赖需求的序列决策环境中优化定价与库存策略,通过新算法结合理论保证和实证验证实现了政策学习。
This paper introduces the first data-driven approach for optimizing pricing and inventory policies in sequential decision-making with censored and dependent demand, proposing novel algorithms with theoretical guarantees and empirical validation.
Authors:Nitya Thakkar, Mert Yuksekgonul, Jake Silberg, Animesh Garg, Nanyun Peng, Fei Sha, Rose Yu, Carl Vondrick, James Zou
Abstract:
Peer review at AI conferences is stressed by rapidly rising submission volumes, leading to deteriorating review quality and increased author dissatisfaction. To address these issues, we developed Review Feedback Agent, a system leveraging multiple large language models (LLMs) to improve review clarity and actionability by providing automated feedback on vague comments, content misunderstandings, and unprofessional remarks to reviewers. Implemented at ICLR 2025 as a large randomized control study, our system provided optional feedback to more than 20,000 randomly selected reviews. To ensure high-quality feedback for reviewers at this scale, we also developed a suite of automated reliability tests powered by LLMs that acted as guardrails to ensure feedback quality, with feedback only being sent to reviewers if it passed all the tests. The results show that 27% of reviewers who received feedback updated their reviews, and over 12,000 feedback suggestions from the agent were incorporated by those reviewers. This suggests that many reviewers found the AI-generated feedback sufficiently helpful to merit updating their reviews. Incorporating AI feedback led to significantly longer reviews (an average increase of 80 words among those who updated after receiving feedback) and more informative reviews, as evaluated by blinded researchers. Moreover, reviewers who were selected to receive AI feedback were also more engaged during paper rebuttals, as seen in longer author-reviewer discussions. This work demonstrates that carefully designed LLM-generated review feedback can enhance peer review quality by making reviews more specific and actionable while increasing engagement between reviewers and authors. The Review Feedback Agent is publicly available at https://github.com/zou-group/review_feedback_agent.
中文: 该评审反馈代理系统利用大语言模型为同行评审提供自动反馈,通过在ICLR 2025的大规模实验证明,能显著提升评审质量、增加评审长度并促进审稿人参与度。
English: The Review Feedback Agent uses large language models to provide automated feedback on peer reviews, significantly improving review quality, length, and reviewer engagement as demonstrated in a large-scale ICLR 2025 study.
Authors:Zhenting Wang, Guofeng Cui, Yu-Jhe Li, Kun Wan, Wentian Zhao
Abstract:
Recent advances in reinforcement learning (RL)-based post-training have led to notable improvements in large language models (LLMs), particularly in enhancing their reasoning capabilities to handle complex tasks. However, most existing methods treat the training data as a unified whole, overlooking the fact that modern LLM training often involves a mixture of data from diverse distributions-varying in both source and difficulty. This heterogeneity introduces a key challenge: how to adaptively schedule training across distributions to optimize learning efficiency. In this paper, we present a principled curriculum learning framework grounded in the notion of distribution-level learnability. Our core insight is that the magnitude of policy advantages reflects how much a model can still benefit from further training on a given distribution. Based on this, we propose a distribution-level curriculum learning framework for RL-based LLM post-training, which leverages the Upper Confidence Bound (UCB) principle to dynamically adjust sampling probabilities for different distrubutions. This approach prioritizes distributions with either high average advantage (exploitation) or low sample count (exploration), yielding an adaptive and theoretically grounded training schedule. We instantiate our curriculum learning framework with GRPO as the underlying RL algorithm and demonstrate its effectiveness on logic reasoning datasets with multiple difficulties and sources. Our experiments show that our framework significantly improves convergence speed and final performance, highlighting the value of distribution-aware curriculum strategies in LLM post-training. Code: https://github.com/ZhentingWang/DUMP.
中文: 本文提出了一种课程学习框架,通过策略优势和置信上界原则自适应地调度不同数据分布的强化学习后训练,从而显著提升大语言模型的收敛速度与最终性能。
English: This paper introduces a curriculum learning framework that adaptively schedules training across diverse data distributions in reinforcement learning-based post-training of large language models, using policy advantages and the Upper Confidence Bound principle to enhance convergence speed and performance.
Authors:Jiahao Qiu, Yinghui He, Xinzhe Juan, Yimin Wang, Yuhan Liu, Zixin Yao, Yue Wu, Xun Jiang, Ling Yang, Mengdi Wang
Abstract:
The rise of LLM-driven AI characters raises safety concerns, particularly for vulnerable human users with psychological disorders. To address these risks, we propose EmoAgent, a multi-agent AI framework designed to evaluate and mitigate mental health hazards in human-AI interactions. EmoAgent comprises two components: EmoEval simulates virtual users, including those portraying mentally vulnerable individuals, to assess mental health changes before and after interactions with AI characters. It uses clinically proven psychological and psychiatric assessment tools (PHQ-9, PDI, PANSS) to evaluate mental risks induced by LLM. EmoGuard serves as an intermediary, monitoring users' mental status, predicting potential harm, and providing corrective feedback to mitigate risks. Experiments conducted in popular character-based chatbots show that emotionally engaging dialogues can lead to psychological deterioration in vulnerable users, with mental state deterioration in more than 34.4% of the simulations. EmoGuard significantly reduces these deterioration rates, underscoring its role in ensuring safer AI-human interactions. Our code is available at: https://github.com/1akaman/EmoAgent
中文摘要:EmoAgent框架通过EmoEval组件评估AI交互导致的心理健康风险,并利用EmoGuard实时监测干预,实验证明能显著降低脆弱用户群体34.4%以上的心理状态恶化率。
English Summary: The EmoAgent framework addresses mental health risks in human-AI interactions by using EmoEval to assess psychological deterioration through clinical tools and EmoGuard to monitor and mitigate harm, significantly reducing deterioration rates in vulnerable users.
Authors:Chenbin Zhang, Zhiqiang Hu, Chuchu Jiang, Wen Chen, Jie Xu, Shaoting Zhang
Abstract:
Drug-target binding affinity prediction is a fundamental task for drug discovery. It has been extensively explored in literature and promising results are reported. However, in this paper, we demonstrate that the results may be misleading and cannot be well generalized to real practice. The core observation is that the canonical randomized split of a test set in conventional evaluation leaves the test set dominated by samples with high similarity to the training set. The performance of models is severely degraded on samples with lower similarity to the training set but the drawback is highly overlooked in current evaluation. As a result, the performance can hardly be trusted when the model meets low-similarity samples in real practice. To address this problem, we propose a framework of similarity aware evaluation in which a novel split methodology is proposed to adapt to any desired distribution. This is achieved by a formulation of optimization problems which are approximately and efficiently solved by gradient descent. We perform extensive experiments across five representative methods in four datasets for two typical target evaluations and compare them with various counterpart methods. Results demonstrate that the proposed split methodology can significantly better fit desired distributions and guide the development of models. Code is released at https://github.com/Amshoreline/SAE/tree/main.
中文: 本文指出传统药物靶点结合力评估因测试集与训练集高度相似而产生误导性结果,并提出一种基于优化分割的相似性感知评估框架,能更准确地反映模型在真实场景中的性能。
English: This paper reveals that conventional drug-target binding affinity evaluations are misleading due to test sets dominated by high-similarity samples, and proposes a similarity-aware framework with optimized data splitting to better reflect real-world performance.
Authors:Lin Zhu, Xinbing Wang, Chenghu Zhou, Nanyang Ye
Abstract:
Recent advances in large pre-trained models showed promising results in few-shot learning. However, their generalization ability on two-dimensional Out-of-Distribution (OoD) data, i.e., correlation shift and diversity shift, has not been thoroughly investigated. Researches have shown that even with a significant amount of training data, few methods can achieve better performance than the standard empirical risk minimization method (ERM) in OoD generalization. This few-shot OoD generalization dilemma emerges as a challenging direction in deep neural network generalization research, where the performance suffers from overfitting on few-shot examples and OoD generalization errors. In this paper, leveraging a broader supervision source, we explore a novel Bayesian cross-modal image-text alignment learning method (Bayes-CAL) to address this issue. Specifically, the model is designed as only text representations are fine-tuned via a Bayesian modelling approach with gradient orthogonalization loss and invariant risk minimization (IRM) loss. The Bayesian approach is essentially introduced to avoid overfitting the base classes observed during training and improve generalization to broader unseen classes. The dedicated loss is introduced to achieve better image-text alignment by disentangling the causal and non-casual parts of image features. Numerical experiments demonstrate that Bayes-CAL achieved state-of-the-art OoD generalization performances on two-dimensional distribution shifts. Moreover, compared with CLIP-like models, Bayes-CAL yields more stable generalization performances on unseen classes. Our code is available at https://github.com/LinLLLL/BayesCAL.
中文: 本文提出的贝叶斯跨模态对齐方法Bayes-CAL通过专门设计的损失函数微调文本表征,有效解决了小样本分布外泛化问题,在未见类别上展现出更稳定的性能表现。
English: This paper introduces Bayes-CAL, a Bayesian cross-modal alignment method that addresses few-shot OoD generalization by fine-tuning text representations with specialized losses to prevent overfitting and improve stability on unseen classes.
Authors:Simon Adamov, Joel Oskarsson, Leif Denby, Tomas Landelius, Kasper Hintz, Simon Christiansen, Irene Schicker, Carlos Osuna, Fredrik Lindsten, Oliver Fuhrer, Sebastian Schemm
Abstract:
Machine learning is revolutionizing global weather forecasting, with models that efficiently produce highly accurate forecasts. Apart from global forecasting there is also a large value in high-resolution regional weather forecasts, focusing on accurate simulations of the atmosphere for a limited area. Initial attempts have been made to use machine learning for such limited area scenarios, but these experiments do not consider realistic forecasting settings and do not investigate the many design choices involved. We present a framework for building kilometer-scale machine learning limited area models with boundary conditions imposed through a flexible boundary forcing method. This enables boundary conditions defined either from reanalysis or operational forecast data. Our approach employs specialized graph constructions with rectangular and triangular meshes, along with multi-step rollout training strategies to improve temporal consistency. We perform systematic evaluation of different design choices, including the boundary width, graph construction and boundary forcing integration. Models are evaluated across both a Danish and a Swiss domain, two regions that exhibit different orographical characteristics. Verification is performed against both gridded analysis data and in-situ observations, including a case study for the storm Ciara in February 2020. Both models achieve skillful predictions across a wide range of variables, with our Swiss model outperforming the numerical weather prediction baseline for key surface variables. With their substantially lower computational cost, our findings demonstrate great potential for machine learning limited area models in the future of regional weather forecasting.
中文: 机器学习通过采用灵活边界条件和专业图结构的新框架,推动了高分辨率区域天气预报的发展,实现了在降低计算成本的同时获得精准预测的能力。
English: Machine learning is advancing regional weather forecasting through a novel framework that enables high-resolution, kilometer-scale models with flexible boundary conditions and specialized graph constructions, achieving skillful predictions with lower computational costs.
Authors:Shubham Aggarwal, Dipankar Maity, Tamer BaÅar
Abstract:
In this letter, we explore the communication-control co-design of discrete-time stochastic linear systems through reinforcement learning. Specifically, we examine a closed-loop system involving two sequential decision-makers: a scheduler and a controller. The scheduler continuously monitors the system's state but transmits it to the controller intermittently to balance the communication cost and control performance. The controller, in turn, determines the control input based on the intermittently received information. Given the partially nested information structure, we show that the optimal control policy follows a certainty-equivalence form. Subsequently, we analyze the qualitative behavior of the scheduling policy. To develop the optimal scheduling policy, we propose InterQ, a deep reinforcement learning algorithm which uses a deep neural network to approximate the Q-function. Through extensive numerical evaluations, we analyze the scheduling landscape and further compare our approach against two baseline strategies: (a) a multi-period periodic scheduling policy, and (b) an event-triggered policy. The results demonstrate that our proposed method outperforms both baselines. The open source implementation can be found at https://github.com/AC-sh/InterQ.
中文摘要:本文通过强化学习探索了随机线性系统的通信与控制协同设计,提出的InterQ算法利用深度神经网络逼近Q函数,在平衡通信成本与控制性能方面优于周期性调度和事件触发两种基准策略。
English Summary: This letter presents a reinforcement learning-based co-design of communication and control for stochastic linear systems, introducing the InterQ algorithm that outperforms baseline scheduling policies by optimizing transmission intervals to balance communication costs and control performance.
Authors:Jiawei Li
Abstract:
Instruction fine-tuning attacks pose a significant threat to large language models (LLMs) by subtly embedding poisoned data in fine-tuning datasets, which can trigger harmful or unintended responses across a range of tasks. This undermines model alignment and poses security risks in real-world deployment. In this work, we present a simple and effective approach to detect and mitigate such attacks using influence functions, a classical statistical tool adapted for machine learning interpretation. Traditionally, the high computational costs of influence functions have limited their application to large models and datasets. The recent Eigenvalue-Corrected Kronecker-Factored Approximate Curvature (EK-FAC) approximation method enables efficient influence score computation, making it feasible for large-scale analysis.
We are the first to apply influence functions for detecting language model instruction fine-tuning attacks on large-scale datasets, as both the instruction fine-tuning attack on language models and the influence calculation approximation technique are relatively new. Our large-scale empirical evaluation of influence functions on 50,000 fine-tuning examples and 32 tasks reveals a strong association between influence scores and sentiment. Building on this, we introduce a novel sentiment transformation combined with influence functions to detect and remove critical poisons -- poisoned data points that skew model predictions. Removing these poisons (only 1% of total data) recovers model performance to near-clean levels, demonstrating the effectiveness and efficiency of our approach. Artifact is available at https://github.com/lijiawei20161002/Poison-Detection.
WARNING: This paper contains offensive data examples.
中文: 本文提出了一种针对大型语言模型指令微调攻击的新型检测方法,该方法利用语义变换下的影响函数来识别关键毒化样本,无需攻击先验知识,仅需移除约1%的毒化数据即可将模型性能恢复至接近正常水平。
English: This paper introduces a novel detection method for instruction finetuning attacks on LLMs that uses influence functions under semantic transformation to identify critical poison examples without prior knowledge of the attack, effectively restoring model performance by removing just 1% of poisoned data.
Authors:Han Liao, Shuaishuai Zu
Abstract:
Knowledge Tracing (KT) is a fundamental task in Intelligent Tutoring Systems (ITS), which aims to model the dynamic knowledge states of students based on their interaction histories. However, existing KT models often rely on a global forgetting decay mechanism for capturing learning patterns, assuming that students' performance is predominantly influenced by their most recent interactions. Such approaches fail to account for the diverse and complex learning patterns arising from individual differences and varying learning stages. To address this limitation, we propose RouterKT, a novel Mixture-of-Experts (MoE) architecture designed to capture heterogeneous learning patterns by enabling experts to specialize in different patterns without any handcrafted learning pattern bias such as forgetting decay. Specifically, RouterKT introduces a \textbf{person-wise routing mechanism} to effectively model individual-specific learning behaviors and employs \textbf{multi-heads as experts} to enhance the modeling of complex and diverse patterns. Comprehensive experiments on ten benchmark datasets demonstrate that RouterKT exhibits significant flexibility and improves the performance of various KT backbone models, with a maximum average AUC improvement of 3.29\% across different backbones and datasets, outperforming other state-of-the-art models. Moreover, RouterKT demonstrates consistently superior inference efficiency compared to existing approaches based on handcrafted learning pattern bias, highlighting its usability for real-world educational applications. The source code is available at https://github.com/ringotc/RouterKT.git.
中文: RouterKT通过引入专家混合架构和个性化路由机制,无需依赖全局遗忘衰减假设即可捕捉多样化学习模式,在多个基准测试中显著提升了知识追踪的性能与推理效率。
English: RouterKT introduces a Mixture-of-Experts architecture with person-wise routing to capture diverse learning patterns without relying on global forgetting mechanisms, significantly improving knowledge tracing performance and efficiency across multiple benchmarks.
Authors:Xijin Ge
Abstract:
Motivation: The visualization and analysis of high-dimensional data are essential in biomedical research. There is a need for secure, scalable, and reproducible tools to facilitate data exploration and interpretation. Results: We introduce DataMap, a browser-based application for visualization of high-dimensional data using heatmaps, principal component analysis (PCA), and t-distributed stochastic neighbor embedding (t-SNE). DataMap runs in the web browser, ensuring data privacy while eliminating the need for installation or a server. The application has an intuitive user interface for data transformation, annotation, and generation of reproducible R code. Availability and Implementation: Freely available as a GitHub page https://gexijin.github.io/datamap/. The source code can be found at https://github.com/gexijin/datamap, and can also be installed as an R package. Contact: Xijin.Ge@sdstate.ed
中文:DataMap是一款基于浏览器的安全工具,可通过热图、PCA和t-SNE可视化高维生物医学数据,无需安装即可保障数据隐私并生成可复现的R代码。
English: DataMap is a secure, browser-based tool for visualizing high-dimensional biomedical data through heatmaps, PCA, and t-SNE, offering data privacy and reproducible R code without installation.
Authors:Yuchu Jiang, Jiale Fu, Chenduo Hao, Xinting Hu, Yingzhe Peng, Xin Geng, Xu Yang
Abstract:
Recently, In-context Learning (ICL) has become a significant inference paradigm in Large Multimodal Models (LMMs), utilizing a few in-context demonstrations (ICDs) to prompt LMMs for new tasks. However, the synergistic effects in multimodal data increase the sensitivity of ICL performance to the configurations of ICDs, stimulating the need for a more stable and general mapping function. Mathematically, in Transformer-based models, ICDs act as "shift vectors" added to the hidden states of query tokens. Inspired by this, we introduce Mimic In-Context Learning (MimIC) to learn stable and generalizable shift effects from ICDs. Specifically, compared with some previous shift vector-based methods, MimIC more strictly approximates the shift effects by integrating lightweight learnable modules into LMMs with four key enhancements: 1) inserting shift vectors after attention layers, 2) assigning a shift vector to each attention head, 3) making shift magnitude query-dependent, and 4) employing a layer-wise alignment loss. Extensive experiments on two LMMs (Idefics-9b and Idefics2-8b-base) across three multimodal tasks (VQAv2, OK-VQA, Captioning) demonstrate that MimIC outperforms existing shift vector-based methods. The code is available at https://github.com/Kamichanw/MimIC.
Chinese: 摘要介绍了模仿上下文学习(MimIC)方法,该方法通过四项关键改进从上下文示例中学习稳定的偏移效应,以增强大型多模态模型,在多模态任务中表现优于现有方法。
English: The abstract introduces Mimic In-Context Learning (MimIC), a method that enhances Large Multimodal Models by learning stable shift effects from in-context demonstrations through four key improvements, outperforming existing approaches in multimodal tasks.
Authors:Vasiliki Tassopoulou, Haochang Shou, Christos Davatzikos
Abstract:
Longitudinal biomedical studies monitor individuals over time to capture dynamics in brain development, disease progression, and treatment effects. However, estimating trajectories of brain biomarkers is challenging due to biological variability, inconsistencies in measurement protocols (e.g., differences in MRI scanners), scarcity, and irregularity in longitudinal measurements. Herein, we introduce a novel personalized deep kernel regression framework for forecasting brain biomarkers, with application to regional volumetric measurements. Our approach integrates two key components: a population model that captures brain trajectories from a large and diverse cohort, and a subject-specific model that captures individual trajectories. To optimally combine these, we propose Adaptive Shrinkage Estimation, which effectively balances population and subject-specific models. We assess our model's performance through predictive accuracy metrics, uncertainty quantification, and validation against external clinical studies. Benchmarking against state-of-the-art statistical and machine learning models -- including linear mixed effects models, generalized additive models, and deep learning methods -- demonstrates the superior predictive performance of our approach. Additionally, we apply our method to predict trajectories of composite neuroimaging biomarkers, which highlights the versatility of our approach in modeling the progression of longitudinal neuroimaging biomarkers. Furthermore, validation on three external neuroimaging studies confirms the robustness of our method across different clinical contexts. We make the code available at https://github.com/vatass/AdaptiveShrinkageDKGP.
中文: 本研究提出了一种个性化深度核回归框架,通过自适应收缩估计整合群体与个体模型,能够精确预测脑生物标志物,在预测精度和跨研究验证方面均优于现有方法。
English: This study introduces a personalized deep kernel regression framework that combines population and individual models through Adaptive Shrinkage Estimation to accurately forecast brain biomarkers, demonstrating superior performance over existing methods in predictive accuracy and cross-study validation.
Authors:Zirui Chen, Zhaoyang Zhang, Ziqing Xing, Ridong Li, Zhaohui Yang, Richeng Jin, Chongwen Huang, Yuzhi Yang, Mérouane Debbah
Abstract:
Existing learning models often exhibit poor generalization when deployed across diverse scenarios. It is primarily due to that the underlying reference frame of the data varies with the deployment environment and settings. However, despite that data of each scenario has a distinct reference frame, its generation generally follows common underlying physical rules. Based on this understanding, this article proposes a deep learning framework named analogical learning (AL), which implicitly retrieves the reference frame information associated with a scenario and then to make accurate prediction by relative analogy with other scenarios. Specifically, we design a bipartite neural network called Mateformer. Its first part captures the relativity within multiple latent feature spaces between the input data and a small amount of embedded data from the studied scenario, while its second part uses this relativity to guide the nonlinear analogy. We apply AL to the typical multi-scenario learning problem of intelligent wireless localization in cellular networks. Extensive experiments validate AL's superiority across three key dimensions. First, it achieves state-of-the-art accuracy in single-scenario benchmarks. Second, it demonstrates stable transferability between different scenarios, avoiding catastrophic forgetting. Finally, and most importantly, it robustly adapts to new, unseen scenarios--including dynamic weather and traffic conditions--without any tuning. All data and code are available at https://github.com/ziruichen-research/ALLoc.
中文:该文提出的类比学习框架通过Mateformer神经网络实现,能隐式适应特定场景的参考框架并利用共享物理规律,从而在无需重新训练的情况下,于动态环境中实现卓越的准确性、可迁移性和鲁棒性。
English: The proposed analogical learning framework, implemented via the Mateformer neural network, enhances generalization by implicitly adapting to scenario-specific reference frames while leveraging shared physical principles, achieving superior accuracy, transferability, and robustness in dynamic environments without retraining.
Authors:Zheyuan Lai, Yingming Pu
Abstract:
Complex chemical space and limited knowledge scope with biases holds immense challenge for human scientists, yet in automated materials discovery. Existing intelligent methods relies more on numerical computation, leading to inefficient exploration and results with hard-interpretability. To bridge this gap, we introduce a principles-guided material discovery system powered by language inferential multi-agent system (MAS), namely PriM. Our framework integrates automated hypothesis generation with experimental validation in a roundtable system of MAS, enabling systematic exploration while maintaining scientific rigor. Based on our framework, the case study of nano helix demonstrates higher materials exploration rate and property value while providing transparent reasoning pathways. This approach develops an automated-and-transparent paradigm for material discovery, with broad implications for rational design of functional materials. Code is publicly available at our \href{https://github.com/amair-lab/PriM}{GitHub}.
中文摘要:PriM系统采用基于语言的多智能体方法,通过原则引导与实验验证相结合,实现了材料发现的高效探索与透明推理。
English Summary: The PriM system uses a language-based multi-agent approach to automate material discovery, enhancing exploration efficiency and interpretability through principled guidance and experimental validation.
Authors:Lucas Beerens, Desmond J. Higham
Abstract:
We introduce a new attack paradigm that embeds hidden adversarial capabilities directly into diffusion models via fine-tuning, without altering their observable behavior or requiring modifications during inference. Unlike prior approaches that target specific images or adjust the generation process to produce adversarial outputs, our method integrates adversarial functionality into the model itself. The resulting tampered model generates high-quality images indistinguishable from those of the original, yet these images cause misclassification in downstream classifiers at a high rate. The misclassification can be targeted to specific output classes. Users can employ this compromised model unaware of its embedded adversarial nature, as it functions identically to a standard diffusion model. We demonstrate the effectiveness and stealthiness of our approach, uncovering a covert attack vector that raises new security concerns. These findings expose a risk arising from the use of externally-supplied models and highlight the urgent need for robust model verification and defense mechanisms against hidden threats in generative models. The code is available at https://github.com/LucasBeerens/CRAFTed-Diffusion .
中文摘要:本研究提出一种通过微调将对抗性功能嵌入扩散模型的隐蔽攻击方法,使模型在生成看似正常图像的同时能有效误导下游分类器,且用户难以察觉其恶意性质。
English Summary: This study introduces a stealthy attack method that embeds adversarial capabilities into diffusion models through fine-tuning, enabling them to generate seemingly normal images that reliably mislead downstream classifiers while remaining undetectable to users.
Authors:Yang Yang, Tong Zhang, Jian Wu, Lijie Su
Abstract:
With the rapid advancement of large language models, academic topic identification and topic evolution analysis are crucial for enhancing AI's understanding capabilities. Dynamic topic analysis provides a powerful approach to capturing and understanding the temporal evolution of topics in large-scale datasets. This paper presents a two-stage dynamic topic analysis framework that incorporates convex optimization to improve topic consistency, sparsity, and interpretability. In Stage 1, a two-layer non-negative matrix factorization (NMF) model is employed to extract annual topics and identify key terms. In Stage 2, a convex optimization algorithm refines the dynamic topic structure using the convex NMF (cNMF) model, further enhancing topic integration and stability. Applying the proposed method to IEEE journal abstracts from 2004 to 2022 effectively identifies and quantifies emerging research topics, such as COVID-19 and digital twins. By optimizing sparsity differences in the clustering feature space between traditional and emerging research topics, the framework provides deeper insights into topic evolution and ranking analysis. Moreover, the NMF-cNMF model demonstrates superior stability in topic consistency. At sparsity levels of 0.4, 0.6, and 0.9, the proposed approach improves topic ranking stability by 24.51%, 56.60%, and 36.93%, respectively. The source code (to be open after publication) is available at https://github.com/meetyangyang/CDNMF.
本文提出了一种采用凸优化的两阶段动态主题分析框架,通过非负矩阵分解和凸非负矩阵分解模型提升主题一致性与稳定性,有效识别了2004至2022年IEEE摘要中COVID-19和数字孪生等新兴研究主题。
This paper introduces a two-stage dynamic topic analysis framework using convex optimization to enhance topic consistency and stability, effectively identifying emerging research trends like COVID-19 and digital twins in IEEE abstracts from 2004 to 2022.
Authors:Anton Thielmann, Arik Reuter, Benjamin Saefken
Abstract:
In recent years, deep neural networks have showcased their predictive power across a variety of tasks. Beyond natural language processing, the transformer architecture has proven efficient in addressing tabular data problems and challenges the previously dominant gradient-based decision trees in these areas. However, this predictive power comes at the cost of intelligibility: Marginal feature effects are almost completely lost in the black-box nature of deep tabular transformer networks. Alternative architectures that use the additivity constraints of classical statistical regression models can maintain intelligible marginal feature effects, but often fall short in predictive power compared to their more complex counterparts. To bridge the gap between intelligibility and performance, we propose an adaptation of tabular transformer networks designed to identify marginal feature effects. We provide theoretical justifications that marginal feature effects can be accurately identified, and our ablation study demonstrates that the proposed model efficiently detects these effects, even amidst complex feature interactions. To demonstrate the model's predictive capabilities, we compare it to several interpretable as well as black-box models and find that it can match black-box performances while maintaining intelligibility. The source code is available at https://github.com/OpenTabular/NAMpy.
Chinese: 所提出的表格Transformer网络改进方案通过准确识别边际特征效应,在保持与黑盒模型相当预测性能的同时,弥合了可解释性与预测能力之间的鸿沟。
English: The proposed adaptation of tabular transformer networks bridges the gap between predictive performance and intelligibility by accurately identifying marginal feature effects while matching black-box model capabilities.
Authors:Fangzhi Xu, Hang Yan, Chang Ma, Haiteng Zhao, Qiushi Sun, Kanzhi Cheng, Junxian He, Jun Liu, Zhiyong Wu
Abstract:
Advancing LLM reasoning skills has captivated wide interest. However, current post-training techniques rely heavily on supervisory signals, such as outcome supervision or auxiliary reward models, which face the problem of scalability and high annotation costs. This motivates us to enhance LLM reasoning without the need for external supervision. We introduce a generalizable and purely unsupervised self-training framework, named Genius. Without external auxiliary, Genius requires to seek the optimal response sequence in a stepwise manner and optimize the LLM. To explore the potential steps and exploit the optimal ones, Genius introduces a stepwise foresight re-sampling strategy to sample and estimate the step value by simulating future outcomes. Further, we recognize that the unsupervised setting inevitably induces the intrinsic noise and uncertainty. To provide a robust optimization, we propose an advantage-calibrated optimization (ACO) loss function to mitigate estimation inconsistencies. Combining these techniques together, Genius provides an advanced initial step towards self-improve LLM reasoning with general queries and without supervision, revolutionizing reasoning scaling laws given the vast availability of general queries. The code will be released at https://github.com/xufangzhi/Genius.
中文总结:Genius是一种无监督自训练框架,通过逐步前瞻重采样和优势校准优化技术,无需外部监督即可提升大语言模型的推理能力,有效解决了扩展性和标注成本问题。
English Summary: Genius is an unsupervised self-training framework that enhances LLM reasoning through stepwise foresight re-sampling and advantage-calibrated optimization, eliminating the need for external supervision while improving scalability.
Authors:Renu Sharma, Debasmita Pal, Arun Ross
Abstract:
One of the major challenges in machine learning is maintaining the accuracy of the deployed model (e.g., a classifier) in a non-stationary environment. The non-stationary environment results in distribution shifts and, consequently, a degradation in accuracy. Continuous learning of the deployed model with new data could be one remedy. However, the question arises as to how we should update the model with new training data so that it retains its accuracy on the old data while adapting to the new data. In this work, we propose a task-conditioned ensemble of models to maintain the performance of the existing model. The method involves an ensemble of expert models based on task membership information. The in-domain models-based on the local outlier concept (different from the expert models) provide task membership information dynamically at run-time to each probe sample. To evaluate the proposed method, we experiment with three setups: the first represents distribution shift between tasks (LivDet-Iris-2017), the second represents distribution shift both between and within tasks (LivDet-Iris-2020), and the third represents disjoint distribution between tasks (Split MNIST). The experiments highlight the benefits of the proposed method. The source code is available at https://github.com/iPRoBe-lab/Continuous_Learning_FE_DM.
中文: 本研究针对非稳态环境中模型精度下降的问题,提出了一种基于任务条件的集成方法,通过局部离群值概念动态组合专家模型,使模型在适应新数据的同时保持对原有数据的性能。
English: This work addresses the challenge of maintaining model accuracy in non-stationary environments by proposing a task-conditioned ensemble method that dynamically combines expert models using local outlier concepts to adapt to new data while preserving performance on old data.
Authors:Tao Zhang, Zhenhai Liu, Yong Xin, Yongjun Jiao
Abstract:
The Finite Element Method (FEM) is widely used in engineering and scientific computing, but its pre-processing, solver configuration, and post-processing stages are often time-consuming and require specialized knowledge. This paper proposes an automated solution framework, MooseAgent, for the multi-physics simulation framework MOOSE, which combines large-scale pre-trained language models (LLMs) with a multi-agent system. The framework uses LLMs to understand user-described simulation requirements in natural language and employs task decomposition and multi-round iterative verification strategies to automatically generate MOOSE input files. To improve accuracy and reduce model hallucinations, the system builds and utilizes a vector database containing annotated MOOSE input cards and function documentation. We conducted experimental evaluations on several typical cases, including heat transfer, mechanics, phase field, and multi-physics coupling. The results show that MooseAgent can automate the MOOSE simulation process to a certain extent, especially demonstrating a high success rate when dealing with relatively simple single-physics problems. The main contribution of this research is the proposal of a multi-agent automated framework for MOOSE, which validates its potential in simplifying finite element simulation processes and lowering the user barrier, providing new ideas for the development of intelligent finite element simulation software. The code for the MooseAgent framework proposed in this paper has been open-sourced and is available at https://github.com/taozhan18/MooseAgent
中文: 本文提出MooseAgent自动化框架,通过结合大语言模型与多智能体系统,能够理解自然语言描述的仿真需求并自动生成MOOSE输入文件,有效降低了有限元模拟的使用门槛。
English: This paper introduces MooseAgent, an automated framework for the MOOSE multi-physics simulation platform that leverages large language models and multi-agent systems to interpret natural language requirements and generate simulation input files, demonstrating effectiveness in simplifying finite element analysis workflows.
Authors:Gesina Schwalbe, Georgii Mikriukov, Edgar Heinert, Stavros Gerolymatos, Mert Keser, Alois Knoll, Matthias Rottmann, Annika Mütze
Abstract:
The thriving research field of concept-based explainable artificial intelligence (C-XAI) investigates how human-interpretable semantic concepts embed in the latent spaces of deep neural networks (DNNs). Post-hoc approaches therein use a set of examples to specify a concept, and determine its embeddings in DNN latent space using data driven techniques. This proved useful to uncover biases between different target (foreground or concept) classes. However, given that the background is mostly uncontrolled during training, an important question has been left unattended so far: Are/to what extent are state-of-the-art, data-driven post-hoc C-XAI approaches themselves prone to biases with respect to their backgrounds? E.g., wild animals mostly occur against vegetation backgrounds, and they seldom appear on roads. Even simple and robust C-XAI methods might abuse this shortcut for enhanced performance. A dangerous performance degradation of the concept-corner cases of animals on the road could thus remain undiscovered. This work validates and thoroughly confirms that established Net2Vec-based concept segmentation techniques frequently capture background biases, including alarming ones, such as underperformance on road scenes. For the analysis, we compare 3 established techniques from the domain of background randomization on >50 concepts from 2 datasets, and 7 diverse DNN architectures. Our results indicate that even low-cost setups can provide both valuable insight and improved background robustness.
Chinese: 基于概念的可解释人工智能方法常从训练数据中无意识地学习背景偏差,导致物体出现在非典型背景场景时性能不可靠,这一点已通过跨多个数据集和神经网络架构的系统性测试得到验证。
English: Concept-based explainable AI methods often unintentionally learn background biases from training data, leading to unreliable performance in scenarios where objects appear against atypical backgrounds, as demonstrated through systematic testing across multiple datasets and neural network architectures.
Authors:Eleanor Wallach, Sage Siler, Jing Deng
Abstract:
Federated Learning (FL) has been introduced as a way to keep data local to clients while training a shared machine learning model, as clients train on their local data and send trained models to a central aggregator. It is expected that FL will have a huge implication on Mobile Edge Computing, the Internet of Things, and Cross-Silo FL. In this paper, we focus on the widely used FedAvg algorithm to explore the effect of the number of clients in FL. We find a significant deterioration of learning accuracy for FedAvg as the number of clients increases. To address this issue for a general application, we propose a method called Knowledgeable Client Insertion (KCI) that introduces a very small number of knowledgeable clients to the MEC setting. These knowledgeable clients are expected to have accumulated a large set of data samples to help with training. With the help of KCI, the learning accuracy of FL increases much faster even with a normal FedAvg aggregation technique. We expect this approach to be able to provide great privacy protection for clients against security attacks such as model inversion attacks. Our code is available at https://github.com/Eleanor-W/KCI_for_FL.
中文: 联邦学习(FL)通过本地数据训练共享模型,但常用的FedAvg算法在客户端数量增加时精度显著下降,为此提出的知识型客户端插入(KCI)方法通过引入少量知识型客户端,有效提升了学习速度并增强了隐私保护。
English: Federated Learning (FL) trains a shared model by keeping data local, but the widely used FedAvg algorithm suffers from accuracy decline as client numbers grow, which is addressed by the proposed Knowledgeable Client Insertion (KCI) method that enhances learning speed and privacy protection.
Authors:Lucian Chauvin, Somil Gupta, Angelina Ibarra, Joshua Peeples
Abstract:
Anomaly detection is a key research challenge in computer vision and machine learning with applications in many fields from quality control to radar imaging. In radar imaging, specifically synthetic aperture radar (SAR), anomaly detection can be used for the classification, detection, and segmentation of objects of interest. However, there is no method for developing and benchmarking these methods on SAR imagery. To address this issue, we introduce SAR imagery anomaly detection (SARIAD). In conjunction with Anomalib, a deep-learning library for anomaly detection, SARIAD provides a comprehensive suite of algorithms and datasets for assessing and developing anomaly detection approaches on SAR imagery. SARIAD specifically integrates multiple SAR datasets along with tools to effectively apply various anomaly detection algorithms to SAR imagery. Several anomaly detection metrics and visualizations are available. Overall, SARIAD acts as a central package for benchmarking SAR models and datasets to allow for reproducible research in the field of anomaly detection in SAR imagery. This package is publicly available: https://github.com/Advanced-Vision-and-Learning-Lab/SARIAD.
中文: SARIAD是一个综合性工具包,集成了合成孔径雷达(SAR)图像异常检测的数据集与算法,为该领域的方法开发、性能评估及可重复研究提供了统一平台。
English: SARIAD is a comprehensive toolkit that integrates datasets and algorithms for developing and benchmarking anomaly detection methods in synthetic aperture radar (SAR) imagery, enabling reproducible research in the field.
Authors:Ingryd V. S. T. Pereira, George D. C. Cavalcanti, Rafael M. O. Cruz
Abstract:
Given the volume and speed at which fake news spreads across social media, automatic fake news detection has become a highly important task. However, this task presents several challenges, including extracting textual features that contain relevant information about fake news. Research about fake news detection shows that no single feature extraction technique consistently outperforms the others across all scenarios. Nevertheless, different feature extraction techniques can provide complementary information about the textual data and enable a more comprehensive representation of the content. This paper proposes using multi-view autoencoders to generate a joint feature representation for fake news detection by integrating several feature extraction techniques commonly used in the literature. Experiments on fake news datasets show a significant improvement in classification performance compared to individual views (feature representations). We also observed that selecting a subset of the views instead of composing a latent space with all the views can be advantageous in terms of accuracy and computational effort. For further details, including source codes, figures, and datasets, please refer to the project's repository: https://github.com/ingrydpereira/multiview-fake-news.
Chinese: 本文提出了一种多视图自编码器方法,通过整合多种特征提取技术来提升虚假新闻检测效果,实验表明选择性融合互补文本特征可显著提高分类性能并优化计算效率。
English: This paper proposes a multi-view autoencoder approach that integrates multiple feature extraction techniques to enhance fake news detection, achieving improved classification performance and efficiency by selectively combining complementary textual features.
Authors:Yutaro Yamada, Robert Tjarko Lange, Cong Lu, Shengran Hu, Chris Lu, Jakob Foerster, Jeff Clune, David Ha
Abstract:
AI is increasingly playing a pivotal role in transforming how scientific discoveries are made. We introduce The AI Scientist-v2, an end-to-end agentic system capable of producing the first entirely AI generated peer-review-accepted workshop paper. This system iteratively formulates scientific hypotheses, designs and executes experiments, analyzes and visualizes data, and autonomously authors scientific manuscripts. Compared to its predecessor (v1, Lu et al., 2024 arXiv:2408.06292), The AI Scientist-v2 eliminates the reliance on human-authored code templates, generalizes effectively across diverse machine learning domains, and leverages a novel progressive agentic tree-search methodology managed by a dedicated experiment manager agent. Additionally, we enhance the AI reviewer component by integrating a Vision-Language Model (VLM) feedback loop for iterative refinement of content and aesthetics of the figures. We evaluated The AI Scientist-v2 by submitting three fully autonomous manuscripts to a peer-reviewed ICLR workshop. Notably, one manuscript achieved high enough scores to exceed the average human acceptance threshold, marking the first instance of a fully AI-generated paper successfully navigating a peer review. This accomplishment highlights the growing capability of AI in conducting all aspects of scientific research. We anticipate that further advancements in autonomous scientific discovery technologies will profoundly impact human knowledge generation, enabling unprecedented scalability in research productivity and significantly accelerating scientific breakthroughs, greatly benefiting society at large. We have open-sourced the code at https://github.com/SakanaAI/AI-Scientist-v2 to foster the future development of this transformative technology. We also discuss the role of AI in science, including AI safety.
中文: AI科学家-v2系统首次实现了完全由人工智能生成且通过同行评审的学术论文,标志着人工智能已具备自主开展完整科学研究的能力。
English: The AI Scientist-v2 is an autonomous system that successfully produced the first fully AI-generated peer-review-accepted scientific paper, demonstrating AI's growing capability to conduct end-to-end research without human intervention.
Authors:Tony Shen, Seonghwan Seo, Ross Irwin, Kieran Didi, Simon Olsson, Woo Youn Kim, Martin Ester
Abstract:
Many generative applications, such as synthesis-based 3D molecular design, involve constructing compositional objects with continuous features. Here, we introduce Compositional Generative Flows (CGFlow), a novel framework that extends flow matching to generate objects in compositional steps while modeling continuous states. Our key insight is that modeling compositional state transitions can be formulated as a straightforward extension of the flow matching interpolation process. We further build upon the theoretical foundations of generative flow networks (GFlowNets), enabling reward-guided sampling of compositional structures. We apply CGFlow to synthesizable drug design by jointly designing the molecule's synthetic pathway with its 3D binding pose. Our approach achieves state-of-the-art binding affinity on all 15 targets from the LIT-PCBA benchmark, and 5.8$\times$ improvement in sampling efficiency compared to 2D synthesis-based baseline. To our best knowledge, our method is also the first to achieve state of-art-performance in both Vina Dock (-9.38) and AiZynth success rate (62.2\%) on the CrossDocked benchmark.
中文: CGFlow是一种新颖框架,通过扩展流匹配技术实现连续特征组合对象的逐步生成与奖励引导采样,在可合成药物设计中达到顶尖性能并显著提升采样效率。
English: CGFlow is a novel framework that extends flow matching to generate compositional objects with continuous features through step-by-step construction and reward-guided sampling, achieving state-of-the-art performance in synthesizable drug design with significantly improved efficiency.
Authors:Xuan-Hao Liu, Bao-Liang Lu, Wei-Long Zheng
Abstract:
The cross-subject electroencephalography (EEG) classification exhibits great challenges due to the diversity of cognitive processes and physiological structures between different subjects. Modern EEG models are based on neural networks, demanding a large amount of data to achieve high performance and generalizability. However, privacy concerns associated with EEG pose significant limitations to data sharing between different hospitals and institutions, resulting in the lack of large dataset for most EEG tasks. Federated learning (FL) enables multiple decentralized clients to collaboratively train a global model without direct communication of raw data, thus preserving privacy. For the first time, we investigate the cross-subject EEG classification in the FL setting. In this paper, we propose a simple yet effective framework termed mixEEG. Specifically, we tailor the vanilla mixup considering the unique properties of the EEG modality. mixEEG shares the unlabeled averaged data of the unseen subject rather than simply sharing raw data under the domain adaptation setting, thus better preserving privacy and offering an averaged label as pseudo-label. Extensive experiments are conducted on an epilepsy detection and an emotion recognition dataset. The experimental result demonstrates that our mixEEG enhances the transferability of global model for cross-subject EEG classification consistently across different datasets and model architectures. Code is published at: https://github.com/XuanhaoLiu/mixEEG.
中文:提出的mixEEG框架通过共享平均未标记数据而非原始数据,在联邦学习中解决了跨被试脑电分类的难题,在不同数据集和架构下增强模型可迁移性的同时有效保护了隐私。
English: The proposed mixEEG framework addresses cross-subject EEG classification challenges in federated learning by sharing averaged unlabeled data instead of raw data, enhancing model transferability while preserving privacy across different datasets and architectures.
Authors:Hamidreza Eivazi, Jendrik-Alexander Tröger, Stefan Wittek, Stefan Hartmann, Andreas Rausch
Abstract:
Multiscale problems are ubiquitous in physics. Numerical simulations of such problems by solving partial differential equations (PDEs) at high resolution are computationally too expensive for many-query scenarios, e.g., uncertainty quantification, remeshing applications, topology optimization, and so forth. This limitation has motivated the application of data-driven surrogate models, where the microscale computations are $\textit{substituted}$ with a surrogate, usually acting as a black-box mapping between macroscale quantities. These models offer significant speedups but struggle with incorporating microscale physical constraints, such as the balance of linear momentum and constitutive models. In this contribution, we propose Equilibrium Neural Operator (EquiNO) as a $\textit{complementary}$ physics-informed PDE surrogate for predicting microscale physics and compare it with variational physics-informed neural and operator networks. Our framework, applicable to the so-called multiscale FE$^{\,2}\,$ computations, introduces the FE-OL approach by integrating the finite element (FE) method with operator learning (OL). We apply the proposed FE-OL approach to quasi-static problems of solid mechanics. The results demonstrate that FE-OL can yield accurate solutions even when confronted with a restricted dataset during model development. Our results show that EquiNO achieves speedup factors exceeding 8000-fold compared to traditional methods and offers an optimal balance between data-driven and physics-based strategies.
Chinese: 本文提出平衡神经算子(EquiNO),这是一种融合有限元方法与算子学习的物理信息代理模型,用于高效预测多尺度问题中的微观物理现象,在有限数据下仍保持精度,同时实现超过8000倍的加速效果。
English: This paper introduces the Equilibrium Neural Operator (EquiNO), a physics-informed surrogate model that integrates the finite element method with operator learning to efficiently predict microscale physics in multiscale problems, achieving over 8000-fold speedup while maintaining accuracy with limited data.
Authors:Mirac Suzgun, Mert Yuksekgonul, Federico Bianchi, Dan Jurafsky, James Zou
Abstract:
Despite their impressive performance on complex tasks, current language models (LMs) typically operate in a vacuum: Each input query is processed separately, without retaining insights from previous attempts. Here, we present Dynamic Cheatsheet (DC), a lightweight framework that endows a black-box LM with a persistent, evolving memory. Rather than repeatedly re-discovering or re-committing the same solutions and mistakes, DC enables models to store and reuse accumulated strategies, code snippets, and general problem-solving insights at inference time. This test-time learning enhances performance substantially across a range of tasks without needing explicit ground-truth labels or human feedback. Leveraging DC, Claude 3.5 Sonnet's accuracy more than doubled on AIME math exams once it began retaining algebraic insights across questions. Similarly, GPT-4o's success rate on Game of 24 increased from 10% to 99% after the model discovered and reused a Python-based solution. In tasks prone to arithmetic mistakes, such as balancing equations, DC enabled GPT-4o and Claude to reach near-perfect accuracy by recalling previously validated code, whereas their baselines stagnated around 50%. Beyond arithmetic challenges, DC yields notable accuracy gains on knowledge-demanding tasks. Claude achieved a 9% improvement in GPQA-Diamond and an 8% boost on MMLU-Pro problems. Crucially, DC's memory is self-curated, focusing on concise, transferable snippets rather than entire transcript. Unlike finetuning or static retrieval methods, DC adapts LMs' problem-solving skills on the fly, without modifying their underlying parameters. Overall, our findings present DC as a promising approach for augmenting LMs with persistent memory, bridging the divide between isolated inference events and the cumulative, experience-driven learning characteristic of human cognition.
中文: 动态备忘单(DC)是一种轻量级框架,为黑盒语言模型赋予持久记忆能力,使其能在推理时存储和复用解题策略,无需真实标签或人工反馈即可显著提升各类任务的表现。
English: Dynamic Cheatsheet (DC) is a lightweight framework that equips black-box language models with persistent memory, enabling them to store and reuse problem-solving insights at inference time, which substantially enhances performance across various tasks without requiring ground-truth labels or human feedback.
Authors:Rui Pan, Yinwei Dai, Zhihao Zhang, Gabriele Oliaro, Zhihao Jia, Ravi Netravali
Abstract:
Recent advances in inference-time compute have significantly improved performance on complex tasks by generating long chains of thought (CoTs) using Large Reasoning Models (LRMs). However, this improved accuracy comes at the cost of high inference latency due to the length of generated reasoning sequences and the autoregressive nature of decoding. Our key insight in tackling these overheads is that LRM inference, and the reasoning that it embeds, is highly tolerant of approximations: complex tasks are typically broken down into simpler steps, each of which brings utility based on the semantic insight it provides for downstream steps rather than the exact tokens it generates. Accordingly, we introduce SpecReason, a system that automatically accelerates LRM inference by using a lightweight model to (speculatively) carry out simpler intermediate reasoning steps and reserving the costly base model only to assess (and potentially correct) the speculated outputs. Importantly, SpecReason's focus on exploiting the semantic flexibility of thinking tokens in preserving final-answer accuracy is complementary to prior speculation techniques, most notably speculative decoding, which demands token-level equivalence at each step. Across a variety of reasoning benchmarks, SpecReason achieves $1.4-3.0\times$ speedup over vanilla LRM inference while improving accuracy by $0.4-9.0\%$. Compared to speculative decoding without SpecReason, their combination yields an additional $8.8-58.0\%$ latency reduction. We open-source SpecReason at https://github.com/ruipeterpan/specreason.
中文摘要:SpecReason系统通过使用轻量模型执行中间推理步骤、基础模型进行验证,有效加速大型推理模型,在提升速度的同时提高了准确性。
English Summary: SpecReason is a system that accelerates Large Reasoning Models by using a lightweight model for intermediate reasoning steps and the base model for verification, achieving significant speedup and improved accuracy.
Authors:Erin Carson, Xinye Chen
Abstract:
Motivated by the growing demand for low-precision arithmetic in computational science, we exploit lower-precision emulation in Python -- widely regarded as the dominant programming language for numerical analysis and machine learning. Low-precision training has revolutionized deep learning by enabling more efficient computation and reduced memory and energy consumption while maintaining model fidelity. To better enable numerical experimentation with and exploration of low precision computation, we developed the Pychop library, which supports customizable floating-point formats and a comprehensive set of rounding modes in Python, allowing users to benefit from fast, low-precision emulation in numerous applications. Pychop also introduces interfaces for both PyTorch and JAX, enabling efficient low-precision emulation on GPUs for neural network training and inference with unparalleled flexibility.
In this paper, we offer a comprehensive exposition of the design, implementation, validation, and practical application of Pychop, establishing it as a foundational tool for advancing efficient mixed-precision algorithms. Furthermore, we present empirical results on low-precision emulation for image classification and object detection using published datasets, illustrating the sensitivity of the use of low precision and offering valuable insights into its impact. Pychop enables in-depth investigations into the effects of numerical precision, facilitates the development of novel hardware accelerators, and integrates seamlessly into existing deep learning workflows. Software and experimental code are publicly available at https://github.com/inEXASCALE/pychop.
中文: Pychop库在Python中实现了可定制的低精度算术仿真,支持PyTorch和JAX的GPU加速,为神经网络训练和硬件加速器开发提供了高效灵活的解决方案,同时保持模型精度。
English: The Pychop library enables flexible low-precision arithmetic emulation in Python with GPU support for PyTorch and JAX, facilitating efficient neural network training and hardware accelerator development while maintaining model accuracy.
Authors:Yifan Ding, Arturas Aleksandraus, Amirhossein Ahmadian, Jonas Unger, Fredrik Lindsten, Gabriel Eilertsen
Abstract:
Out-of-distribution (OOD) detection is critical for ensuring the reliability of deep learning systems, particularly in safety-critical applications. Likelihood-based deep generative models have historically faced criticism for their unsatisfactory performance in OOD detection, often assigning higher likelihood to OOD data than in-distribution samples when applied to image data. In this work, we demonstrate that likelihood is not inherently flawed. Rather, several properties in the images space prohibit likelihood as a valid detection score. Given a sufficiently good likelihood estimator, specifically using the probability flow formulation of a diffusion model, we show that likelihood-based methods can still perform on par with state-of-the-art methods when applied in the representation space of pre-trained encoders. The code of our work can be found at $\href{https://github.com/limchaos/Likelihood-OOD.git}{\texttt{https://github.com/limchaos/Likelihood-OOD.git}}$.
Chinese: 本研究证明,通过扩散模型在预训练编码器的表征空间中使用基于似然的方法,能够实现最先进的分布外检测性能,从而挑战了先前对似然方法的批评。
English: This study demonstrates that likelihood-based methods can achieve state-of-the-art out-of-distribution detection performance when applied in the representation space of pre-trained encoders using diffusion models, challenging previous criticisms of likelihood approaches.
Authors:Patrick Fernandes, Sweta Agrawal, Emmanouil Zaranis, André F. T. Martins, Graham Neubig
Abstract:
Despite the steady progress in machine translation evaluation, existing automatic metrics struggle to capture how well meaning is preserved beyond sentence boundaries. We posit that reliance on a single intrinsic quality score, trained to mimic human judgments, might be insufficient for evaluating translations of long, complex passages, and a more ``pragmatic'' approach that assesses how accurately key information is conveyed by a translation in context is needed. We introduce TREQA (Translation Evaluation via Question-Answering), a framework that extrinsically evaluates translation quality by assessing how accurately candidate translations answer reading comprehension questions that target key information in the original source or reference texts. In challenging domains that require long-range understanding, such as literary texts, we show that TREQA is competitive with and, in some cases, outperforms state-of-the-art neural and LLM-based metrics in ranking alternative paragraph-level translations, despite never being explicitly optimized to correlate with human judgments. Furthermore, the generated questions and answers offer interpretability: empirical analysis shows that they effectively target translation errors identified by experts in evaluated datasets. Our code is available at https://github.com/deep-spin/treqa
中文: 现有机器翻译自动评估指标难以衡量跨句子的意义保留,因此我们提出TREQA框架,通过测试翻译文本对原文关键信息的阅读理解问题回答准确性来评估质量,在复杂领域表现优异且提供可解释性。
English: Current automatic metrics for machine translation evaluation often fail to assess meaning preservation beyond individual sentences, prompting the introduction of TREQA, a pragmatic framework that evaluates translations by testing how well they answer comprehension questions about key information in the source text, showing competitive performance and enhanced interpretability in complex domains.
Authors:Moritz Rempe, Fabian Hörst, Helmut Becker, Marco Schlimbach, Lukas Rotkopf, Kevin Kröninger, Jens Kleesiek
Abstract:
Magnetic resonance imaging (MRI) raw data, or k-Space data, is complex-valued, containing both magnitude and phase information. However, clinical and existing Artificial Intelligence (AI)-based methods focus only on magnitude images, discarding the phase data despite its potential for downstream tasks, such as tumor segmentation and classification. In this work, we introduce $\textit{PhaseGen}$, a novel complex-valued diffusion model for generating synthetic MRI raw data conditioned on magnitude images, commonly used in clinical practice. This enables the creation of artificial complex-valued raw data, allowing pretraining for models that require k-Space information. We evaluate PhaseGen on two tasks: skull-stripping directly in k-Space and MRI reconstruction using the publicly available FastMRI dataset. Our results show that training with synthetic phase data significantly improves generalization for skull-stripping on real-world data, with an increased segmentation accuracy from $41.1\%$ to $80.1\%$, and enhances MRI reconstruction when combined with limited real-world data. This work presents a step forward in utilizing generative AI to bridge the gap between magnitude-based datasets and the complex-valued nature of MRI raw data. This approach allows researchers to leverage the vast amount of avaliable image domain data in combination with the information-rich k-Space data for more accurate and efficient diagnostic tasks. We make our code publicly $\href{https://github.com/TIO-IKIM/PhaseGen}{\text{available here}}$.
中文: 本研究提出PhaseGen这一复数值扩散模型,能从临床常用的磁共振幅度图像生成合成k空间数据,通过弥合图像数据与复数值原始数据之间的鸿沟,显著提升了颅骨剥离和图像重建等任务的性能。
English: This study introduces PhaseGen, a complex-valued diffusion model that generates synthetic MRI k-space data from magnitude images, enhancing tasks like skull-stripping and MRI reconstruction by bridging the gap between clinical image data and the full potential of complex-valued raw data.
Authors:Erdenebileg Batbaatar, Jeonggeol Kim, Yongcheol Kim, Young Yoon
Abstract:
In this paper, we introduce Traversal Learning (TL), a novel approach designed to address the problem of decreased quality encountered in popular distributed learning (DL) paradigms such as Federated Learning (FL), Split Learning (SL), and SplitFed Learning (SFL). Traditional FL experiences from an accuracy drop during aggregation due to its averaging function, while SL and SFL face increased loss due to the independent gradient updates on each split network. TL adopts a unique strategy where the model traverses the nodes during forward propagation (FP) and performs backward propagation (BP) on the orchestrator, effectively implementing centralized learning (CL) principles within a distributed environment. The orchestrator is tasked with generating virtual batches and planning the sequential node visits of the model during FP, aligning them with the ordered index of the data within these batches. We conducted experiments on six datasets representing diverse characteristics across various domains. Our evaluation demonstrates that TL is on par with classic CL approaches in terms of accurate inference, thereby offering a viable and robust solution for DL tasks. TL outperformed other DL methods and improved accuracy by 7.85% for independent and identically distributed (IID) datasets, macro F1-score by 1.06% for non-IID datasets, accuracy by 2.60% for text classification, and AUC by 3.88% and 4.54% for medical and financial datasets, respectively. By effectively preserving data privacy while maintaining performance, TL represents a significant advancement in DL methodologies. The implementation of TL is available at https://github.com/neouly-inc/Traversal-Learning
中文: 遍历学习(TL)是一种新型分布式学习方法,通过实施集中式学习原则克服了现有范式的性能下降问题,在多种数据集上实现了卓越的准确性和隐私保护。
English: Traversal Learning (TL) is a novel distributed learning method that overcomes performance declines in existing paradigms by implementing centralized learning principles, achieving superior accuracy and privacy across diverse datasets.
Authors:Juzheng Zhang, Jiacheng You, Ashwinee Panda, Tom Goldstein
Abstract:
Low-Rank Adaptation (LoRA) has emerged as a popular parameter-efficient fine-tuning (PEFT) method for Large Language Models (LLMs), yet it still incurs notable overhead and suffers from parameter interference in multi-task scenarios. We propose LoRA with Reduced Interference (LoRI), a simple yet effective approach that freezes the projection matrices $A$ as random projections and sparsifies the matrices $B$ using task-specific masks. This design substantially reduces the number of trainable parameters while maintaining strong task performance. Moreover, LoRI minimizes cross-task interference in adapter merging by leveraging the orthogonality between adapter subspaces, and supports continual learning by using sparsity to mitigate catastrophic forgetting. Extensive experiments across natural language understanding, mathematical reasoning, code generation, and safety alignment tasks demonstrate that LoRI outperforms full fine-tuning and existing PEFT methods, while using up to 95% fewer trainable parameters than LoRA. In multi-task experiments, LoRI enables effective adapter merging and continual learning with reduced cross-task interference. Code is available at: https://github.com/juzhengz/LoRI
中文: LoRI是一种改进的微调方法,通过冻结随机投影矩阵并应用任务特定稀疏化,在保持优异性能的同时大幅减少可训练参数,相比LoRA减少高达95%参数且有效降低跨任务干扰。
English: LoRI is an enhanced fine-tuning method that freezes random projection matrices and applies task-specific sparsity to significantly reduce trainable parameters while outperforming full fine-tuning and existing PEFT methods, with up to 95% fewer parameters than LoRA and reduced cross-task interference.
Authors:Dongqi Fu, Yada Zhu, Zhining Liu, Lecheng Zheng, Xiao Lin, Zihao Li, Liri Fang, Katherine Tieu, Onkar Bhardwaj, Kommy Weldemariam, Hanghang Tong, Hendrik Hamann, Jingrui He
Abstract:
Climate science studies the structure and dynamics of Earth's climate system and seeks to understand how climate changes over time, where the data is usually stored in the format of time series, recording the climate features, geolocation, time attributes, etc. Recently, much research attention has been paid to the climate benchmarks. In addition to the most common task of weather forecasting, several pioneering benchmark works are proposed for extending the modality, such as domain-specific applications like tropical cyclone intensity prediction and flash flood damage estimation, or climate statement and confidence level in the format of natural language. To further motivate the artificial general intelligence development for climate science, in this paper, we first contribute a multi-modal climate benchmark, i.e., ClimateBench-M, which aligns (1) the time series climate data from ERA5, (2) extreme weather events data from NOAA, and (3) satellite image data from NASA HLS based on a unified spatial-temporal granularity. Second, under each data modality, we also propose a simple but strong generative method that could produce competitive performance in weather forecasting, thunderstorm alerts, and crop segmentation tasks in the proposed ClimateBench-M. The data and code of ClimateBench-M are publicly available at https://github.com/iDEA-iSAIL-Lab-UIUC/ClimateBench-M.
中文摘要:气候科学研究地球气候系统的结构与动态及其随时间的变化,本文提出了ClimateBench-M多模态基准,整合时间序列、极端天气和卫星数据,以推动气候人工智能应用的发展。
English Summary: Climate science examines Earth's climate system and its changes over time, with this paper introducing ClimateBench-M, a multi-modal benchmark integrating time series, extreme weather, and satellite data to advance climate AI applications.
Authors:Hanqi Xiao, Yi-Lin Sung, Elias Stengel-Eskin, Mohit Bansal
Abstract:
Post-training quantization (PTQ) reduces a model's memory footprint by mapping full precision weights into low bit weights without costly retraining, but can degrade its downstream performance especially in low 2- to 3-bit settings. We develop a new mixed-precision PTQ approach, Task-Circuit Quantization (TaCQ), that draws parallels to automated circuit discovery, directly conditioning the quantization process on specific weight circuits -- which we define as sets of weights associated with downstream task performance. These weights are kept as 16-bit weights, while others are quantized, maintaining performance while only adding a marginal memory cost. Specifically, TaCQ contrasts unquantized model weights with a uniformly-quantized model to estimate the expected change in weights due to quantization and uses gradient information to predict the resulting impact on task performance, allowing us to preserve task-specific weights. We compare TaCQ-based quantization to existing mixed-precision quantization methods when conditioning both on general-purpose and task-specific data. Across QA, math reasoning, and text-to-SQL tasks for both Llama-3 and Qwen2.5, we find that TaCQ outperforms baselines using the same calibration data and a lower weight budget, achieving major improvements in the 2 and 3-bit regime. With only 3.1 bits we are able to recover 96% of Llama-3-8B-Instruct's unquantized 16-bit MMLU performance, obtaining a 5.25% absolute improvement over SPQR. We also observe consistently large gains over existing methods in the 2-bit regime, with an average gain of 14.74% over the strongest baseline, SliM-LLM. Moreover, we observe a 7.20% gain without conditioning on specific tasks, showing TaCQ's ability to identify important weights is not limited to task-conditioned settings.
Chinese: TaCQ是一种新颖的混合精度训练后量化方法,通过保留任务关键权重为16位精度同时量化其余权重,在低比特设置下以最小内存开销实现卓越的性能恢复。
English: TaCQ is a novel mixed-precision post-training quantization method that preserves task-critical weights at 16-bit precision while quantizing others, achieving superior performance recovery in low-bit settings with minimal memory overhead.
Authors:Donghao Ren, Fred Hohman, Dominik Moritz
Abstract:
Interactive visualization of embedding projections is a useful technique for understanding data and evaluating machine learning models. Labeling data within these visualizations is critical for interpretation, as labels provide an overview of the projection and guide user navigation. However, most methods for producing labels require clustering the points, which can be computationally expensive as the number of points grows. In this paper, we describe an efficient clustering approach using kernel density estimation in the projected 2D space instead of points. This algorithm can produce high-quality cluster regions from a 2D density map in a few hundred milliseconds, orders of magnitude faster than current approaches. We contribute the design of the algorithm, benchmarks, and applications that demonstrate the utility of the algorithm, including labeling and summarization.
Chinese: 本文提出了一种在二维投影空间中使用核密度估计的高效聚类方法,能够快速生成高质量聚类区域用于交互式可视化标注,其速度显著优于现有方法。
English: This paper introduces an efficient clustering method using kernel density estimation in 2D projection space, enabling rapid generation of high-quality cluster regions for interactive visualization labeling and significantly outperforming existing approaches in speed.
Authors:Mingxuan Li, Hanchen Li, Chenhao Tan
Abstract:
Large language models (LLMs) have demonstrated great potential for automating the evaluation of natural language generation. Previous frameworks of LLM-as-a-judge fall short in two ways: they either use zero-shot setting without consulting any human input, which leads to low alignment, or fine-tune LLMs on labeled data, which requires a non-trivial number of samples. Moreover, previous methods often provide little reasoning behind automated evaluations. In this paper, we propose HypoEval, Hypothesis-guided Evaluation framework, which first uses a small corpus of human evaluations to generate more detailed rubrics for human judgments and then incorporates a checklist-like approach to combine LLM's assigned scores on each decomposed dimension to acquire overall scores. With only 30 human evaluations, HypoEval achieves state-of-the-art performance in alignment with both human rankings (Spearman correlation) and human scores (Pearson correlation), on average outperforming G-Eval by 11.86% and fine-tuned Llama-3.1-8B-Instruct with at least 3 times more human evaluations by 11.95%. Furthermore, we conduct systematic studies to assess the robustness of HypoEval, highlighting its effectiveness as a reliable and interpretable automated evaluation framework.
中文摘要:HypoEval是一种创新框架,通过少量人工评估生成详细评分标准并采用清单式方法整合维度得分,仅用30个样本即可实现与人类评估的最佳对齐,同时提供可解释的自动化评测。
English Summary: HypoEval is a novel framework that enhances LLM-based evaluation by generating detailed rubrics from minimal human input and using a checklist approach to combine dimension scores, achieving superior alignment with human judgments using only 30 samples while providing interpretable reasoning.
Authors:Will LeVine, Bijan Varjavand
Abstract:
Modern Large Language Model (LLM) systems typically rely on Retrieval Augmented Generation (RAG) which aims to gather context that is useful for response generation. These RAG systems typically optimize strictly towards retrieving context that is maximally relevant to the query. However, conventional theory suggests that retrieval systems which seek to maximize context relevance without any additional explicit criteria can create information bottlenecks. We reaffirm this finding in the modern age of LLM's by showing that in standard RAG pipelines, maximizing for context relevance alone can degrade downstream response quality. In response, we show evaluations of existing RAG methods which account for both context relevance and answer quality. These evaluations introduce a novel finding that existing RAG systems scale poorly with inference time compute usage when considering our combined metric. We introduce "RErank BEyond reLevance (REBEL)", which enables RAG systems to scale with inference-time compute via injection of multi-criteria optimization using Chain-of-Thought prompting (and optionally Multi-Turn dialogue). Ultimately, this enables a new performance/speed tradeoff curve, where RAG systems are able to achieve both higher relevance of retrieved contexts and superior answer quality as inference time increases. Code for the implementation of our method in llama-index can be found at the following PR: https://github.com/run-llama/llama_index/pull/17590. Code for running experiments using this llama-index implementation can be found at https://github.com/microsoft/REBEL.
中文: 现代RAG系统若仅优化上下文相关性会制约回答质量,而REBEL方法通过多标准推理优化,实现了检索相关性和答案质量的双重提升。
English: Modern RAG systems often focus solely on maximizing context relevance, which can limit response quality, but the new REBEL method improves both relevance and answer quality by optimizing multiple criteria during inference.
Authors:Alexander Rubinstein, Ameya Prabhu, Matthias Bethge, Seong Joon Oh
Abstract:
Object-centric learning (OCL) seeks to learn representations that only encode an object, isolated from other objects or background cues in a scene. This approach underpins various aims, including out-of-distribution (OOD) generalization, sample-efficient composition, and modeling of structured environments. Most research has focused on developing unsupervised mechanisms that separate objects into discrete slots in the representation space, evaluated using unsupervised object discovery. However, with recent sample-efficient segmentation models, we can separate objects in the pixel space and encode them independently. This achieves remarkable zero-shot performance on OOD object discovery benchmarks, is scalable to foundation models, and can handle a variable number of slots out-of-the-box. Hence, the goal of OCL methods to obtain object-centric representations has been largely achieved. Despite this progress, a key question remains: How does the ability to separate objects within a scene contribute to broader OCL objectives, such as OOD generalization? We address this by investigating the OOD generalization challenge caused by spurious background cues through the lens of OCL. We propose a novel, training-free probe called Object-Centric Classification with Applied Masks (OCCAM), demonstrating that segmentation-based encoding of individual objects significantly outperforms slot-based OCL methods. However, challenges in real-world applications remain. We provide the toolbox for the OCL community to use scalable object-centric representations, and focus on practical applications and fundamental questions, such as understanding object perception in human cognition. Our code is available here: https://github.com/AlexanderRubinstein/OCCAM.
Chinese: 物体中心学习(OCL)旨在从场景中分离物体以提升泛化能力和效率,提出的OCCAM方法通过基于分割的编码显著优于传统基于槽位的方法,同时解决了实际应用中的挑战。
English: Object-centric learning (OCL) aims to isolate objects from scenes for improved generalization and efficiency, with the proposed OCCAM method using segmentation-based encoding to outperform traditional slot-based approaches while addressing challenges in real-world applications.
Authors:Cassidy Laidlaw, Eli Bronstein, Timothy Guo, Dylan Feng, Lukas Berglund, Justin Svegliato, Stuart Russell, Anca Dragan
Abstract:
Assistance games are a promising alternative to reinforcement learning from human feedback (RLHF) for training AI assistants. Assistance games resolve key drawbacks of RLHF, such as incentives for deceptive behavior, by explicitly modeling the interaction between assistant and user as a two-player game where the assistant cannot observe their shared goal. Despite their potential, assistance games have only been explored in simple settings. Scaling them to more complex environments is difficult because it requires both solving intractable decision-making problems under uncertainty and accurately modeling human users' behavior. We present the first scalable approach to solving assistance games and apply it to a new, challenging Minecraft-based assistance game with over $10^{400}$ possible goals. Our approach, AssistanceZero, extends AlphaZero with a neural network that predicts human actions and rewards, enabling it to plan under uncertainty. We show that AssistanceZero outperforms model-free RL algorithms and imitation learning in the Minecraft-based assistance game. In a human study, our AssistanceZero-trained assistant significantly reduces the number of actions participants take to complete building tasks in Minecraft. Our results suggest that assistance games are a tractable framework for training effective AI assistants in complex environments. Our code and models are available at https://github.com/cassidylaidlaw/minecraft-building-assistance-game.
中文: 辅助博弈为训练AI助手提供了可扩展且有效的替代方案,通过AssistanceZero在《我的世界》等复杂环境中的卓越表现,减少了用户操作步骤并解决了欺骗性激励等问题。
English: Assistance games offer a scalable and effective alternative to RLHF for training AI assistants, as demonstrated by AssistanceZero's superior performance in complex environments like Minecraft, reducing user actions and addressing issues like deceptive incentives.
Authors:Tomohiro Hayase, Benoît Collins, Nakamasa Inoue
Abstract:
Hierarchical inductive biases are hypothesized to promote generalizable policies in reinforcement learning, as demonstrated by explicit hyperbolic latent representations and architectures. Therefore, a more flexible approach is to have these biases emerge naturally from the algorithm. We introduce Free Random Projection, an input mapping grounded in free probability theory that constructs random orthogonal matrices where hierarchical structure arises inherently. The free random projection integrates seamlessly into existing in-context reinforcement learning frameworks by encoding hierarchical organization within the input space without requiring explicit architectural modifications. Empirical results on multi-environment benchmarks show that free random projection consistently outperforms the standard random projection, leading to improvements in generalization. Furthermore, analyses within linearly solvable Markov decision processes and investigations of the spectrum of kernel random matrices reveal the theoretical underpinnings of free random projection's enhanced performance, highlighting its capacity for effective adaptation in hierarchically structured state spaces.
中文摘要:自由随机投影是一种通过随机正交矩阵自然生成层次结构的新方法,无需修改架构即可提升强化学习的泛化能力,其优势已通过实证和理论分析得到验证。
English Summary: Free Random Projection is a novel method that inherently develops hierarchical structures through random orthogonal matrices, enhancing reinforcement learning generalization without architectural changes, as validated by empirical and theoretical analyses.
Authors:Zhixuan Lin, Johan Obando-Ceron, Xu Owen He, Aaron Courville
Abstract:
The recently proposed Forgetting Transformer (FoX) incorporates a forget gate into softmax attention and has shown consistently better or on-par performance compared to the standard RoPE-based Transformer. Notably, many attention heads in FoX tend to forget quickly, causing their output at each timestep to rely primarily on local context. Based on this observation, we propose Adaptive Computation Pruning (ACP) for FoX, a method that dynamically prunes computations involving input-output dependencies that are strongly decayed by the forget gate. In particular, our method performs provably safe pruning via a dynamically set pruning threshold that guarantees the pruned attention weights are negligible. We apply ACP to language model pretraining with FoX and show it consistently reduces the number of FLOPs and memory accesses in softmax attention by around 70% across different model sizes and context lengths, resulting in a roughly 50% to 70% reduction in attention runtime (or a 2-3$\times$ speedup) and a roughly 10% to 40% increase in end-to-end training throughput. Furthermore, longer context lengths yield greater computational savings. All these speed improvements are achieved without any performance degradation. Our code is available at https://github.com/zhixuan-lin/forgetting-transformer.
中文摘要:针对遗忘变换器提出的自适应计算剪枝(ACP)方法通过基于遗忘门衰减动态剪枝可忽略计算,在保持性能不变的同时实现注意力计算2-3倍加速和10-40%训练吞吐量提升。
English Summary: The Adaptive Computation Pruning (ACP) method for the Forgetting Transformer dynamically prunes negligible computations based on forget gate decay, achieving 2-3× attention speedup and 10-40% training throughput gains without performance loss.
Authors:Emmanuelle Bourigault, Amir Jamaludin, Abdullah Hamdi
Abstract:
In medical imaging, the primary challenge is collecting large-scale labeled data due to privacy concerns, logistics, and high labeling costs. In this work, we present the UK Biobank Organs and Bones (UKBOB), the largest labeled dataset of body organs, comprising 51,761 MRI 3D samples (equivalent to 17.9 million 2D images) and more than 1.37 billion 2D segmentation masks of 72 organs, all based on the UK Biobank MRI dataset. We utilize automatic labeling, introduce an automated label cleaning pipeline with organ-specific filters, and manually annotate a subset of 300 MRIs with 11 abdominal classes to validate the quality (referred to as UKBOB-manual). This approach allows for scaling up the dataset collection while maintaining confidence in the labels. We further confirm the validity of the labels by demonstrating zero-shot generalization of trained models on the filtered UKBOB to other small labeled datasets from similar domains (e.g., abdominal MRI). To further mitigate the effect of noisy labels, we propose a novel method called Entropy Test-time Adaptation (ETTA) to refine the segmentation output. We use UKBOB to train a foundation model, Swin-BOB, for 3D medical image segmentation based on the Swin-UNetr architecture, achieving state-of-the-art results in several benchmarks in 3D medical imaging, including the BRATS brain MRI tumor challenge (with a 0.4% improvement) and the BTCV abdominal CT scan benchmark (with a 1.3% improvement). The pre-trained models and the code are available at https://emmanuelleb985.github.io/ukbob , and the filtered labels will be made available with the UK Biobank.
Authors:Jiawei Mao, Yuhan Wang, Yucheng Tang, Daguang Xu, Kang Wang, Yang Yang, Zongwei Zhou, Yuyin Zhou
Abstract:
This paper presents MedSegFactory, a versatile medical synthesis framework that generates high-quality paired medical images and segmentation masks across modalities and tasks. It aims to serve as an unlimited data repository, supplying image-mask pairs to enhance existing segmentation tools. The core of MedSegFactory is a dual-stream diffusion model, where one stream synthesizes medical images and the other generates corresponding segmentation masks. To ensure precise alignment between image-mask pairs, we introduce Joint Cross-Attention (JCA), enabling a collaborative denoising paradigm by dynamic cross-conditioning between streams. This bidirectional interaction allows both representations to guide each other's generation, enhancing consistency between generated pairs. MedSegFactory unlocks on-demand generation of paired medical images and segmentation masks through user-defined prompts that specify the target labels, imaging modalities, anatomical regions, and pathological conditions, facilitating scalable and high-quality data generation. This new paradigm of medical image synthesis enables seamless integration into diverse medical imaging workflows, enhancing both efficiency and accuracy. Extensive experiments show that MedSegFactory generates data of superior quality and usability, achieving competitive or state-of-the-art performance in 2D and 3D segmentation tasks while addressing data scarcity and regulatory constraints.
Authors:Osama Ahmad, Zubair Khalid
Abstract:
This paper focuses on improving the robustness of spatiotemporal long-term prediction using a variational mode graph convolutional network (VMGCN) by introducing 3D channel attention. The deep learning network for this task relies on historical data inputs, yet real-time data can be corrupted by sensor noise, altering its distribution. We model this noise as independent and identically distributed (i.i.d.) Gaussian noise and incorporate it into the LargeST traffic volume dataset, resulting in data with both inherent and additive noise components. Our approach involves decomposing the corrupted signal into modes using variational mode decomposition, followed by feeding the data into a learning pipeline for prediction. We integrate a 3D attention mechanism encompassing spatial, temporal, and channel attention. The spatial and temporal attention modules learn their respective correlations, while the channel attention mechanism is used to suppress noise and highlight the significant modes in the spatiotemporal signals. Additionally, a learnable soft thresholding method is implemented to exclude unimportant modes from the feature vector, and a feature reduction method based on the signal-to-noise ratio (SNR) is applied. We compare the performance of our approach against baseline models, demonstrating that our method achieves superior long-term prediction accuracy, robustness to noise, and improved performance with mode truncation compared to the baseline models. The code of the paper is available at https://github.com/OsamaAhmad369/VMGCN.
中文: 本文通过引入三维通道注意力机制到变分模态图卷积网络中,将含噪信号分解为模态并利用注意力抑制噪声,结合可学习阈值优化特征,从而提升了时空长期预测的鲁棒性和准确性。
English: This paper introduces a 3D channel attention mechanism into a variational mode graph convolutional network (VMGCN) to enhance the robustness and accuracy of spatiotemporal long-term predictions by decomposing noisy signals into modes, applying attention mechanisms to suppress noise, and implementing learnable thresholding for feature optimization.
Authors:Ludvig Dillén, Per-Erik Forssén, Johan Edstedt
Abstract:
We present FACT, a method for predicting alignment quality (i.e., registration error) of registered lidar point cloud pairs. This is useful e.g. for quality assurance of large, automatically registered 3D models. FACT extracts local features from a registered pair and processes them with a point transformer-based network to predict a misalignment class. We generalize prior work that study binary alignment classification of registration errors, by recasting it as multinomial misalignment classification. To achieve this, we introduce a custom regression-by-classification loss function that combines the cross-entropy and Wasserstein losses, and demonstrate that it outperforms both direct regression and prior binary classification. FACT successfully classifies point-cloud pairs registered with both the classical ICP and GeoTransformer, while other choices, such as standard point-cloud-quality metrics and registration residuals are shown to be poor choices for predicting misalignment. On a synthetically perturbed point-cloud task introduced by the CorAl method, we show that FACT achieves substantially better performance than CorAl. Finally, we demonstrate how FACT can assist experts in correcting misaligned point-cloud maps. Our code is available at https://github.com/LudvigDillen/FACT_for_PCMC.
中文: FACT是一种通过点变换网络和定制损失函数预测激光雷达点云对配准质量的方法,在分类配准误差方面优于现有方法。
English: FACT is a method that predicts the alignment quality of registered lidar point cloud pairs using a point transformer-based network and a custom loss function, outperforming existing approaches in classifying misalignment.
Authors:Sujay Khandagale, Bhawna Juneja, Prabhat Agarwal, Aditya Subramanian, Jaewon Yang, Yuting Wang
Abstract:
Modern search systems use a multi-stage architecture to deliver personalized results efficiently. Key stages include retrieval, pre-ranking, full ranking, and blending, which refine billions of items to top selections. The pre-ranking stage, vital for scoring and filtering hundreds of thousands of items down to a few thousand, typically relies on two tower models due to their computational efficiency, despite often lacking in capturing complex interactions. While query-item cross interaction features are paramount for full ranking, integrating them into pre-ranking models presents efficiency-related challenges. In this paper, we introduce InteractRank, a novel two tower pre-ranking model with robust cross interaction features used at Pinterest. By incorporating historical user engagement-based query-item interactions in the scoring function along with the two tower dot product, InteractRank significantly boosts pre-ranking performance with minimal latency and computation costs. In real-world A/B experiments at Pinterest, InteractRank improves the online engagement metric by 6.5% over a BM25 baseline and by 3.7% over a vanilla two tower baseline. We also highlight other components of InteractRank, like real-time user-sequence modeling, and analyze their contributions through offline ablation studies. The code for InteractRank is available at https://github.com/pinterest/atg-research/tree/main/InteractRank.
Chinese: InteractRank提出了一种新颖的双塔预排序模型,通过整合强大的交叉交互特征,在保持低延迟和计算成本的同时显著提升了性能,在Pinterest的实验中在线互动指标比基线模型提高了6.5%。
English: InteractRank introduces a novel two-tower pre-ranking model that integrates robust cross interaction features, significantly enhancing performance with minimal latency and computation costs, as demonstrated by a 6.5% improvement in online engagement over baseline models at Pinterest.
Authors:Junrui Zhang, Chenjie Wang, Jie Peng, Haoyu Li, Jianmin Ji, Yu Zhang, Yanyong Zhang
Abstract:
Imitation learning based planning tasks on the nuPlan dataset have gained great interest due to their potential to generate human-like driving behaviors. However, open-loop training on the nuPlan dataset tends to cause causal confusion during closed-loop testing, and the dataset also presents a long-tail distribution of scenarios. These issues introduce challenges for imitation learning. To tackle these problems, we introduce CAFE-AD, a Cross-Scenario Adaptive Feature Enhancement for Trajectory Planning in Autonomous Driving method, designed to enhance feature representation across various scenario types. We develop an adaptive feature pruning module that ranks feature importance to capture the most relevant information while reducing the interference of noisy information during training. Moreover, we propose a cross-scenario feature interpolation module that enhances scenario information to introduce diversity, enabling the network to alleviate over-fitting in dominant scenarios. We evaluate our method CAFE-AD on the challenging public nuPlan Test14-Hard closed-loop simulation benchmark. The results demonstrate that CAFE-AD outperforms state-of-the-art methods including rule-based and hybrid planners, and exhibits the potential in mitigating the impact of long-tail distribution within the dataset. Additionally, we further validate its effectiveness in real-world environments. The code and models will be made available at https://github.com/AlniyatRui/CAFE-AD.
Chinese: CAFE-AD方法通过自适应特征剪枝和跨场景插值模块,有效解决了nuPlan模仿学习中的因果混淆和长尾分布问题,在基准测试和实际验证中均展现出优于现有方法的性能。
English: The proposed CAFE-AD method addresses causal confusion and long-tail distribution challenges in nuPlan-based imitation learning by introducing adaptive feature pruning and cross-scenario interpolation modules, demonstrating superior performance in both benchmark tests and real-world validation.
Authors:Minshuo Chen, Renyuan Xu, Yumin Xu, Ruixun Zhang
Abstract:
Financial scenario simulation is essential for risk management and portfolio optimization, yet it remains challenging especially in high-dimensional and small data settings common in finance. We propose a diffusion factor model that integrates latent factor structure into generative diffusion processes, bridging econometrics with modern generative AI to address the challenges of the curse of dimensionality and data scarcity in financial simulation. By exploiting the low-dimensional factor structure inherent in asset returns, we decompose the score function--a key component in diffusion models--using time-varying orthogonal projections, and this decomposition is incorporated into the design of neural network architectures. We derive rigorous statistical guarantees, establishing nonasymptotic error bounds for both score estimation at O(d^{5/2} n^{-2/(k+5)}) and generated distribution at O(d^{5/4} n^{-1/2(k+5)}), primarily driven by the intrinsic factor dimension k rather than the number of assets d, surpassing the dimension-dependent limits in the classical nonparametric statistics literature and making the framework viable for markets with thousands of assets. Numerical studies confirm superior performance in latent subspace recovery under small data regimes. Empirical analysis demonstrates the economic significance of our framework in constructing mean-variance optimal portfolios and factor portfolios. This work presents the first theoretical integration of factor structure with diffusion models, offering a principled approach for high-dimensional financial simulation with limited data. Our code is available at https://github.com/xymmmm00/diffusion_factor_model.
中文摘要:本文提出扩散因子模型,将潜在因子结构与生成扩散过程相结合,有效解决金融模拟中的高维数据稀缺难题,并通过理论证明和实证分析展示了其在投资组合优化中的显著优势。
English Summary: The paper introduces a diffusion factor model that combines latent factor structures with generative diffusion processes to overcome dimensionality and data scarcity challenges in financial simulations, providing theoretical guarantees and demonstrating effectiveness in portfolio optimization.
Authors:Huzaifa Arif, Keerthiram Murugesan, Payel Das, Alex Gittens, Pin-Yu Chen
Abstract:
This paper explores inference-time data leakage risks of deep neural networks (NNs), where a curious and honest model service provider is interested in retrieving users' private data inputs solely based on the model inference results. Particularly, we revisit residual NNs due to their popularity in computer vision and our hypothesis that residual blocks are a primary cause of data leakage owing to the use of skip connections. By formulating inference-time data leakage as a constrained optimization problem, we propose a novel backward feature inversion method, \textbf{PEEL}, which can effectively recover block-wise input features from the intermediate output of residual NNs. The surprising results in high-quality input data recovery can be explained by the intuition that the output from these residual blocks can be considered as a noisy version of the input and thus the output retains sufficient information for input recovery. We demonstrate the effectiveness of our layer-by-layer feature inversion method on facial image datasets and pre-trained classifiers. Our results show that PEEL outperforms the state-of-the-art recovery methods by an order of magnitude when evaluated by mean squared error (MSE). The code is available at \href{https://github.com/Huzaifa-Arif/PEEL}{https://github.com/Huzaifa-Arif/PEEL}
中文: 本文提出PEEL这一新型逆向特征还原方法,通过将残差神经网络中间输出视为输入的噪声版本,有效实现输入数据恢复,在人脸图像恢复任务中展现出比现有技术更优越的性能。
English: This paper introduces PEEL, a novel backward feature inversion method that effectively recovers input data from residual neural networks by treating intermediate outputs as noisy versions of inputs, demonstrating superior performance over existing techniques in facial image recovery.
Authors:Zixuan Yi, Yao Tian, Zachary G. Ives, Ryan Marcus
Abstract:
Recent deployments of learned query optimizers use expensive neural networks and ad-hoc search policies. To address these issues, we introduce \textsc{LimeQO}, a framework for offline query optimization leveraging low-rank learning to efficiently explore alternative query plans with minimal resource usage. By modeling the workload as a partially observed, low-rank matrix, we predict unobserved query plan latencies using purely linear methods, significantly reducing computational overhead compared to neural networks. We formalize offline exploration as an active learning problem, and present simple heuristics that reduces a 3-hour workload to 1.5 hours after just 1.5 hours of exploration. Additionally, we propose a transductive Tree Convolutional Neural Network (TCNN) that, despite higher computational costs, achieves the same workload reduction with only 0.5 hours of exploration. Unlike previous approaches that place expensive neural networks directly in the query processing ``hot'' path, our approach offers a low-overhead solution and a no-regressions guarantee, all without making assumptions about the underlying DBMS. The code is available in \href{https://github.com/zixy17/LimeQO}{https://github.com/zixy17/LimeQO}.
中文: LimeQO提出了一种低秩学习框架,通过线性方法离线优化查询,高效预测查询计划延迟并减少工作量探索时间,提供低开销且无性能回退的保障。
English: LimeQO introduces a low-rank learning framework for offline query optimization that uses linear methods to efficiently predict query plan latencies and reduce workload exploration time, offering a low-overhead solution with a no-regressions guarantee.
Authors:Bailey J. Eccles, Leon Wong, Blesson Varghese
Abstract:
Extensive compute and memory requirements limit the deployment of large language models (LLMs) on any hardware. Compression methods, such as pruning, can reduce model size, which in turn reduces resource requirements. State-of-the-art pruning is based on coarse-grained methods. They are time-consuming and inherently remove critical model parameters, adversely impacting the quality of the pruned model. This paper introduces projection pruning, a novel fine-grained method for pruning LLMs. In addition, LLM projection pruning is enhanced by a new approach we refer to as composite projection pruning - the synergistic combination of unstructured pruning that retains accuracy and structured pruning that reduces model size. We develop Mosaic, a novel system to create and deploy pruned LLMs using composite projection pruning. Mosaic is evaluated using a range of performance and quality metrics on multiple hardware platforms, LLMs, and datasets. Mosaic is 7.19x faster in producing models than existing approaches. Mosaic models achieve up to 84.2% lower perplexity and 31.4% higher accuracy than models obtained from coarse-grained pruning. Up to 67% faster inference and 68% lower GPU memory use is noted for Mosaic models. Mosaic is available for public use from https://github.com/blessonvar/Mosaic
中文摘要:本文提出Mosaic系统,采用复合投影剪枝技术高效压缩大语言模型,相比现有方法,在模型生成速度、准确性和资源消耗方面均实现显著优化。
English Summary: This paper introduces Mosaic, a system utilizing composite projection pruning to efficiently compress large language models, achieving faster model generation, improved accuracy, and reduced resource consumption compared to existing methods.
Authors:Ziwei Yang, Takeyuki Tamura
Abstract:
In genome-scale constraint-based metabolic models, gene deletion strategies are crucial for achieving growth-coupled production, where cell growth and target metabolite production are simultaneously achieved. While computational methods for calculating gene deletions have been widely explored and contribute to developing gene deletion strategy databases, current approaches are limited in leveraging new data-driven paradigms, such as machine learning, for more efficient strain design. Therefore, it is necessary to propose a fundamental framework for this objective. In this study, we first formulate the problem of gene deletion strategy prediction and then propose a framework for predicting gene deletion strategies for growth-coupled production in genome-scale metabolic models. The proposed framework leverages deep learning algorithms to learn and integrate sequential gene and metabolite data representation, enabling the automatic gene deletion strategy prediction. Computational experiment results demonstrate the feasibility of the proposed framework, showing substantial improvements over baseline methods. Specifically, the proposed framework achieves a 14.69%, 22.52%, and 13.03% increase in overall accuracy across three metabolic models of different scales under study, while maintaining balanced precision and recall in predicting gene deletion statuses. The source code and examples for the framework are publicly available at https://github.com/MetNetComp/DeepGDel.
中文: 本研究提出一个深度学习框架,用于预测代谢模型中生长偶联生产的基因敲除策略,在多个模型规模上相比基线方法展现出显著提升的预测准确性。
English: This study introduces a deep learning framework that predicts gene deletion strategies for growth-coupled production in metabolic models, demonstrating improved accuracy over baseline methods across multiple model scales.
Authors:Matvei Popov, Aymen Kallala, Anirudha Ramesh, Narimane Hennouni, Shivesh Khaitan, Rick Gentry, Alain-Sam Cohen
Abstract:
Long-range dependencies are critical for understanding genomic structure and function, yet most conventional methods struggle with them. Widely adopted transformer-based models, while excelling at short-context tasks, are limited by the attention module's quadratic computational complexity and inability to extrapolate to sequences longer than those seen in training. In this work, we explore State Space Models (SSMs) as a promising alternative by benchmarking two SSM-inspired architectures, Caduceus and Hawk, on long-range genomics modeling tasks under conditions parallel to a 50M parameter transformer baseline. We discover that SSMs match transformer performance and exhibit impressive zero-shot extrapolation across multiple tasks, handling contexts 10 to 100 times longer than those seen during training, indicating more generalizable representations better suited for modeling the long and complex human genome. Moreover, we demonstrate that these models can efficiently process sequences of 1M tokens on a single GPU, allowing for modeling entire genomic regions at once, even in labs with limited compute. Our findings establish SSMs as efficient and scalable for long-context genomic analysis.
Authors:Dongyang Fan, Vinko SabolÄec, Matin Ansaripour, Ayush Kumar Tarun, Martin Jaggi, Antoine Bosselut, Imanol Schlag
Abstract:
The increasing adoption of web crawling opt-outs by copyright holders of online content raises critical questions about the impact of data compliance on large language model (LLM) performance. However, little is known about how these restrictions (and the resultant filtering of pretraining datasets) affect the capabilities of models trained using these corpora. In this work, we conceptualize this effect as the $\textit{data compliance gap}$ (DCG), which quantifies the performance difference between models trained on datasets that comply with web crawling opt-outs, and those that do not. We measure the data compliance gap in two settings: pretraining models from scratch and continual pretraining from existing compliant models (simulating a setting where copyrighted data could be integrated later in pretraining). Our experiments with 1.5B models show that, as of January 2025, compliance with web data opt-outs does not degrade general knowledge acquisition (close to 0\% DCG). However, in specialized domains such as biomedical research, excluding major publishers leads to performance declines. These findings suggest that while general-purpose LLMs can be trained to perform equally well using fully open data, performance in specialized domains may benefit from access to high-quality copyrighted sources later in training. Our study provides empirical insights into the long-debated trade-off between data compliance and downstream model performance, informing future discussions on AI training practices and policy decisions. Our website is available at https://data-compliance.github.io/.
Authors:Chejian Xu, Wei Ping, Peng Xu, Zihan Liu, Boxin Wang, Mohammad Shoeybi, Bo Li, Bryan Catanzaro
Abstract:
Long-context capabilities are essential for a wide range of applications, including document and video understanding, in-context learning, and inference-time scaling, all of which require models to process and reason over long sequences of text and multimodal data. In this work, we introduce a efficient training recipe for building ultra-long context LLMs from aligned instruct model, pushing the boundaries of context lengths from 128K to 1M, 2M, and 4M tokens. Our approach leverages efficient continued pretraining strategies to extend the context window and employs effective instruction tuning to maintain the instruction-following and reasoning abilities. Our UltraLong-8B, built on Llama3.1-Instruct with our recipe, achieves state-of-the-art performance across a diverse set of long-context benchmarks. Importantly, models trained with our approach maintain competitive performance on standard benchmarks, demonstrating balanced improvements for both long and short context tasks. We further provide an in-depth analysis of key design choices, highlighting the impacts of scaling strategies and data composition. Our findings establish a robust framework for efficiently scaling context lengths while preserving general model capabilities. We release all model weights at: https://ultralong.github.io/.
中文: 本研究提出一种高效训练方法,可将大语言模型的上下文长度从128K扩展到4M词元,在保持指令遵循和推理能力的同时,在长上下文基准测试中达到最优性能,并在标准任务中保持竞争力。
English: This work introduces an efficient training method to extend LLMs' context length from 128K to 4M tokens while maintaining instruction-following and reasoning capabilities, achieving state-of-the-art performance on long-context benchmarks and competitive results on standard tasks.
Authors:Luigi Tresca, Carolin Schmidt, James Harrison, Filipe Rodrigues, Gioele Zardini, Daniele Gammelli, Marco Pavone
Abstract:
Fleets of robo-taxis offering on-demand transportation services, commonly known as Autonomous Mobility-on-Demand (AMoD) systems, hold significant promise for societal benefits, such as reducing pollution, energy consumption, and urban congestion. However, orchestrating these systems at scale remains a critical challenge, with existing coordination algorithms often failing to exploit the systems' full potential. This work introduces a novel decision-making framework that unites mathematical modeling with data-driven techniques. In particular, we present the AMoD coordination problem through the lens of reinforcement learning and propose a graph network-based framework that exploits the main strengths of graph representation learning, reinforcement learning, and classical operations research tools. Extensive evaluations across diverse simulation fidelities and scenarios demonstrate the flexibility of our approach, achieving superior system performance, computational efficiency, and generalizability compared to prior methods. Finally, motivated by the need to democratize research efforts in this area, we release publicly available benchmarks, datasets, and simulators for network-level coordination alongside an open-source codebase designed to provide accessible simulation platforms and establish a standardized validation process for comparing methodologies. Code available at: https://github.com/StanfordASL/RL4AMOD
中文:本研究提出了一种新颖的强化学习框架,用于自动驾驶按需出行系统,结合图网络与运筹学方法提升协调效率和性能,并公开了基准测试和开源工具以支持研究普及。
English: This work introduces a novel reinforcement learning framework for Autonomous Mobility-on-Demand systems that combines graph networks with operations research to enhance coordination efficiency and performance, supported by publicly released benchmarks and open-source tools.
Authors:Kuntian Zhang, Simin Yu, Yaoshu Wang, Makoto Onizuka, Chuan Xiao
Abstract:
In this paper, we propose CKGAN, a novel generative adversarial network (GAN) variant based on an integral probability metrics framework with characteristic kernel (CKIPM). CKIPM, as a distance between two probability distributions, is designed to optimize the lowerbound of the maximum mean discrepancy (MMD) in a reproducing kernel Hilbert space, and thus can be used to train GANs. CKGAN mitigates the notorious problem of mode collapse by mapping the generated images back to random noise. To save the effort of selecting the kernel function manually, we propose a soft selection method to automatically learn a characteristic kernel function. The experimental evaluation conducted on a set of synthetic and real image benchmarks (MNIST, CelebA, etc.) demonstrates that CKGAN generally outperforms other MMD-based GANs. The results also show that at the cost of moderately more training time, the automatically selected kernel function delivers very close performance to the best of manually fine-tuned one on real image benchmarks and is able to improve the performances of other MMD-based GANs.
中文: 本文提出CKGAN,一种基于特征核积分概率度量的生成对抗网络变体,通过将生成图像映射回随机噪声缓解模式崩溃问题,并采用软选择方法自动学习最优核函数,在多个基准数据集上优于其他基于MMD的GAN模型。
English: This paper introduces CKGAN, a GAN variant using characteristic kernel-based integral probability metrics to mitigate mode collapse and automatically learn optimal kernels, demonstrating superior performance over other MMD-based GANs on benchmark datasets.
Authors:Shuzhang Zhong, Yanfan Sun, Ling Liang, Runsheng Wang, Ru Huang, Meng Li
Abstract:
The Mixture of Experts (MoE) architecture has demonstrated significant advantages as it enables to increase the model capacity without a proportional increase in computation. However, the large MoE model size still introduces substantial memory demands, which usually requires expert offloading on resource-constrained platforms and incurs significant overhead. Hybrid CPU-GPU inference has been proposed to leverage CPU computation to reduce expert loading overhead but faces major challenges: on one hand, the expert activation patterns of MoE models are highly unstable, rendering the fixed mapping strategies in existing works inefficient; on the other hand, the hybrid CPU-GPU schedule for MoE is inherently complex due to the diverse expert sizes, structures, uneven workload distribution, etc. To address these challenges, in this paper, we propose HybriMoE, a hybrid CPU-GPU inference framework that improves resource utilization through a novel CPU-GPU scheduling and cache management system. HybriMoE introduces (i) a dynamic intra-layer scheduling strategy to balance workloads across CPU and GPU, (ii) an impact-driven inter-layer prefetching algorithm, and (iii) a score-based caching algorithm to mitigate expert activation instability. We implement HybriMoE on top of the kTransformers framework and evaluate it on three widely used MoE-based LLMs. Experimental results demonstrate that HybriMoE achieves an average speedup of 1.33$\times$ in the prefill stage and 1.70$\times$ in the decode stage compared to state-of-the-art hybrid MoE inference framework. Our code is available at: https://github.com/PKU-SEC-Lab/HybriMoE.
中文:混合专家(MoE)架构在不显著增加计算成本的情况下提升了模型能力,但面临内存限制;提出的HybriMoE框架通过动态CPU-GPU调度和缓存管理,有效提升了推理效率。
English: The Mixture of Experts (MoE) architecture increases model capacity without proportional computational cost but faces memory constraints, which the proposed HybriMoE framework addresses through dynamic CPU-GPU scheduling and cache management to enhance inference efficiency.
Authors:Toby van Gastelen, Wouter Edeling, Benjamin Sanderse
Abstract:
Machine learning-based closure models for LES have shown promise in capturing complex turbulence dynamics but often suffer from instabilities and physical inconsistencies. In this work, we develop a novel skew-symmetric neural architecture as closure model that enforces stability while preserving key physical conservation laws. Our approach leverages a discretization that ensures mass, momentum, and energy conservation, along with a face-averaging filter to maintain mass conservation in coarse-grained velocity fields. We compare our model against several conventional data-driven closures (including unconstrained convolutional neural networks), and the physics-based Smagorinsky model. Performance is evaluated on decaying turbulence and Kolmogorov flow for multiple coarse-graining factors. In these test cases we observe that unconstrained machine learning models suffer from numerical instabilities. In contrast, our skew-symmetric model remains stable across all tests, though at the cost of increased dissipation. Despite this trade-off, we demonstrate that our model still outperforms the Smagorinsky model in unseen scenarios. These findings highlight the potential of structure-preserving machine learning closures for reliable long-time LES.
中文: 本研究提出的斜对称神经网络闭合模型在确保数值稳定性和物理守恒律的同时,尽管耗散略有增加,但在未知场景中仍优于传统模型。
English: The proposed skew-symmetric neural network closure model ensures numerical stability and preserves physical conservation laws, outperforming conventional models in unseen scenarios despite slightly increased dissipation.
Authors:Qingyang Zhang, Haitao Wu, Changqing Zhang, Peilin Zhao, Yatao Bian
Abstract:
Existing methods to enhance the reasoning capability of large language models predominantly rely on supervised fine-tuning (SFT) followed by reinforcement learning (RL) on reasoning-specific data. These approaches critically depend on external supervisions--such as labeled reasoning traces, verified golden answers, or pre-trained reward models. In this work, we propose Entropy Minimized Policy Optimization (\ours), which makes an early attempt at fully unsupervised LLM reasoning incentivization. By continuously minimizing the predictive entropy of LLMs on unlabeled questions in a latent semantic space, \ours achieves competitive performance compared to supervised counterparts on both mathematical and free-form natural reasoning tasks. Specifically, without any supervised signals, \ours boosts the accuracy of Qwen2.5-Math-7B Base from 30.7\% to 48.1\% on mathematical benchmarks and improves the accuracy of Qwen2.5-7B Base from 32.1\% to 50.1\% on MMLU-Pro. Primary experiments and analysis are also provided to interpret the effectiveness of \ours. Code is available at https://github.com/QingyangZhang/EMPO.
中文: 本文提出熵最小化策略优化(EMPO),这是一种完全无监督的方法,通过在潜在语义空间持续最小化大语言模型对未标注问题的预测熵,无需外部监督即可在数学和自由形式自然推理任务上取得与监督方法相媲美的性能。
English: This paper introduces Entropy Minimized Policy Optimization (EMPO), a fully unsupervised method that enhances large language models' reasoning by minimizing their predictive entropy on unlabeled questions, achieving competitive results on mathematical and natural reasoning tasks without external supervision.
Authors:Yiming Tang, Yi Fan, Chenxiao Yu, Tiankai Yang, Yue Zhao, Xiyang Hu
Abstract:
The integration of large language models (LLMs) into information retrieval systems introduces new attack surfaces, particularly for adversarial ranking manipulations. We present $\textbf{StealthRank}$, a novel adversarial attack method that manipulates LLM-driven ranking systems while maintaining textual fluency and stealth. Unlike existing methods that often introduce detectable anomalies, StealthRank employs an energy-based optimization framework combined with Langevin dynamics to generate StealthRank Prompts (SRPs)-adversarial text sequences embedded within item or document descriptions that subtly yet effectively influence LLM ranking mechanisms. We evaluate StealthRank across multiple LLMs, demonstrating its ability to covertly boost the ranking of target items while avoiding explicit manipulation traces. Our results show that StealthRank consistently outperforms state-of-the-art adversarial ranking baselines in both effectiveness and stealth, highlighting critical vulnerabilities in LLM-driven ranking systems. Our code is publicly available at $\href{https://github.com/Tangyiming205069/controllable-seo}{here}$.
中文: 本文提出StealthRank攻击方法,通过基于能量的优化框架和朗之万动力学生成隐蔽提示,能在保持文本流畅性的同时有效操纵大语言模型驱动的排序系统,暗中提升目标条目排名且避免被检测到。
English: This paper introduces StealthRank, an adversarial attack method that manipulates LLM-driven ranking systems through energy-based optimization and Langevin dynamics to generate stealthy prompts, effectively boosting target items' rankings while maintaining textual fluency and evading detection.
Authors:Luigi Rovito, Marco Virgolin
Abstract:
Survival Regression (SuR) is a key technique for modeling time to event in important applications such as clinical trials and semiconductor manufacturing. Currently, SuR algorithms belong to one of three classes: non-linear black-box -- allowing adaptability to many datasets but offering limited interpretability (e.g., tree ensembles); linear glass-box -- being easier to interpret but limited to modeling only linear interactions (e.g., Cox proportional hazards); and non-linear glass-box -- allowing adaptability and interpretability, but empirically found to have several limitations (e.g., explainable boosting machines, survival trees). In this work, we investigate whether Symbolic Regression (SR), i.e., the automated search of mathematical expressions from data, can lead to non-linear glass-box survival models that are interpretable and accurate. We propose an evolutionary, multi-objective, and multi-expression implementation of SR adapted to SuR. Our empirical results on five real-world datasets show that SR consistently outperforms traditional glass-box methods for SuR in terms of accuracy per number of dimensions in the model, while exhibiting comparable accuracy with black-box methods. Furthermore, we offer qualitative examples to assess the interpretability potential of SR models for SuR. Code at: https://github.com/lurovi/SurvivalMultiTree-pyNSGP.
中文:符号回归被提出作为生存回归的新型非线性透明盒方法,在保持与黑盒模型相媲美的准确性的同时,其精度优于传统可解释方法,并展现出更强的可解释性潜力。
English: Symbolic Regression is proposed as a novel non-linear glass-box method for Survival Regression, demonstrating superior accuracy over traditional interpretable models while maintaining competitive performance with black-box approaches and offering enhanced interpretability.
Authors:Keren Shao, Ke Chen, Matthew Baas, Shlomo Dubnov
Abstract:
Robustness is critical in zero-shot singing voice conversion (SVC). This paper introduces two novel methods to strengthen the robustness of the kNN-VC framework for SVC. First, kNN-VC's core representation, WavLM, lacks harmonic emphasis, resulting in dull sounds and ringing artifacts. To address this, we leverage the bijection between WavLM, pitch contours, and spectrograms to perform additive synthesis, integrating the resulting waveform into the model to mitigate these issues. Second, kNN-VC overlooks concatenative smoothness, a key perceptual factor in SVC. To enhance smoothness, we propose a new distance metric that filters out unsuitable kNN candidates and optimize the summing weights of the candidates during inference. Although our techniques are built on the kNN-VC framework for implementation convenience, they are broadly applicable to general concatenative neural synthesis models. Experimental results validate the effectiveness of these modifications in achieving robust SVC. Demo: http://knnsvc.com Code: https://github.com/SmoothKen/knn-svc
中文摘要:本文通过加性合成和新距离度量方法,提升了kNN-VC框架在零样本歌声转换中的谐波质量和流畅度。
English Summary: This paper enhances zero-shot singing voice conversion by improving harmonic quality and smoothness in the kNN-VC framework through additive synthesis and a novel distance metric.
Authors:Igor Polyakov, Alexey Dukhanov, Egor Spirin
Abstract:
The increasing complexity of large language models (LLMs) necessitates efficient training strategies to mitigate the high computational costs associated with distributed training. A significant bottleneck in this process is gradient synchronization across multiple GPUs, particularly in the zero-redundancy parallelism mode. In this paper, we introduce Transformer-Aware Gradient Compression (TAGC), an optimized gradient compression algorithm designed specifically for transformer-based models. TAGC extends the lossless homomorphic compression method by adapting it for sharded models and incorporating transformer-specific optimizations, such as layer-selective compression and dynamic sparsification. Our experimental results demonstrate that TAGC accelerates training by up to 15% compared to the standard Fully Sharded Data Parallel (FSDP) approach, with minimal impact on model quality. We integrate TAGC into the PyTorch FSDP framework, the implementation is publicly available at https://github.com/ipolyakov/TAGC.
中文: 本文提出TAGC算法,这是一种针对Transformer模型的梯度压缩技术,可将大型语言模型的分布式训练速度提升高达15%,且对模型质量影响极小,现已集成至PyTorch FSDP框架。
English: This paper introduces TAGC, a transformer-aware gradient compression algorithm that accelerates distributed training of large language models by up to 15% with minimal quality loss, now integrated into PyTorch FSDP.
Authors:Hossein Entezari Zarch, Lei Gao, Chaoyi Jiang, Murali Annavaram
Abstract:
Speculative Decoding (SD) is a widely used approach to accelerate the inference of large language models (LLMs) without reducing generation quality. It operates by first using a compact model to draft multiple tokens efficiently, followed by parallel verification using the target LLM. This approach leads to faster inference compared to auto-regressive decoding. While there are multiple approaches to create a draft model, one promising approach is to use early-exit methods. These methods draft candidate tokens by using a subset of layers of the primary model and applying the remaining layers for verification, allowing a single model to handle both drafting and verification. While this technique reduces memory usage and computational cost, its performance relies on the choice of the exit layer for drafting and the number of tokens drafted (speculation length) in each SD round. Prior works use hyperparameter exploration to statically select these values. However, our evaluations show that these hyperparameter values are task-specific, and even within a task they are dependent on the current sequence context. We introduce DEL (Dynamic Exit Layer), a plug-and-play method that adaptively selects the exit layer and speculation length during inference. DEL dynamically tracks the token acceptance rate if the tokens are drafted at each layer of an LLM and uses that knowledge to heuristically select the optimal exit layer and speculation length. Our experiments across a broad range of models and downstream tasks show that DEL achieves overall speedups of $2.16\times$$\sim$$2.62\times$ over vanilla auto-regressive decoding and improves upon state-of-the-art SD methods, which peak at $2.43\times$, by up to $0.19\times$. The code is available at https://github.com/hoenza/DEL.
推测解码通过使用草稿模型高效生成多个候选标记,再由目标模型并行验证,从而在不降低生成质量的前提下加速大语言模型的推理过程。
Speculative Decoding accelerates large language model inference by using a draft model to propose tokens and the target model to verify them in parallel, maintaining quality while increasing speed.
Authors:Arnas Uselis, Seong Joon Oh
Abstract:
Deep classifiers are known to be sensitive to data distribution shifts, primarily due to their reliance on spurious correlations in training data. It has been suggested that these classifiers can still find useful features in the network's last layer that hold up under such shifts. In this work, we question the use of last-layer representations for out-of-distribution (OOD) generalisation and explore the utility of intermediate layers. To this end, we introduce \textit{Intermediate Layer Classifiers} (ILCs). We discover that intermediate layer representations frequently offer substantially better generalisation than those from the penultimate layer. In many cases, zero-shot OOD generalisation using earlier-layer representations approaches the few-shot performance of retraining on penultimate layer representations. This is confirmed across multiple datasets, architectures, and types of distribution shifts. Our analysis suggests that intermediate layers are less sensitive to distribution shifts compared to the penultimate layer. These findings highlight the importance of understanding how information is distributed across network layers and its role in OOD generalisation, while also pointing to the limits of penultimate layer representation utility. Code is available at https://github.com/oshapio/intermediate-layer-generalization
中文: 中间层分类器通常比倒数第二层提供显著更好的分布外泛化能力,因为它们对多种数据集和架构中的数据分布变化较不敏感。
English: Intermediate layer classifiers often provide significantly better out-of-distribution generalization than penultimate layers, as they are less sensitive to data distribution shifts across various datasets and architectures.
Authors:Yue Yao, Mohamed-Khalil Bouzidi, Daniel Goehring, Joerg Reichardt
Abstract:
As the prediction horizon increases, predicting the future evolution of traffic scenes becomes increasingly difficult due to the multi-modal nature of agent motion. Most state-of-the-art (SotA) prediction models primarily focus on forecasting the most likely future. However, for the safe operation of autonomous vehicles, it is equally important to cover the distribution for plausible motion alternatives. To address this, we introduce EP-Diffuser, a novel parameter-efficient diffusion-based generative model designed to capture the distribution of possible traffic scene evolutions. Conditioned on road layout and agent history, our model acts as a predictor and generates diverse, plausible scene continuations. We benchmark EP-Diffuser against two SotA models in terms of accuracy and plausibility of predictions on the Argoverse 2 dataset. Despite its significantly smaller model size, our approach achieves both highly accurate and plausible traffic scene predictions. We further evaluate model generalization ability in an out-of-distribution (OoD) test setting using Waymo Open dataset and show superior robustness of our approach. The code and model checkpoints are available at: https://github.com/continental/EP-Diffuser.
Chinese: EP-Diffuser是一种参数高效的扩散模型,能够生成多样且合理的交通场景预测,尽管模型规模较小,却在准确性和鲁棒性上超越了现有最优模型。
English: EP-Diffuser is a parameter-efficient diffusion model that generates diverse and plausible traffic scene predictions, achieving superior accuracy and robustness compared to state-of-the-art models despite its smaller size.
Authors:Junghun Oh, Sungyong Baik, Kyoung Mu Lee
Abstract:
The Lottery Ticket Hypothesis (LTH) posits the existence of a sparse subnetwork (a.k.a. winning ticket) that can generalize comparably to its over-parameterized counterpart when trained from scratch. The common approach to finding a winning ticket is to preserve the original strong generalization through Iterative Pruning (IP) and transfer information useful for achieving the learned generalization by applying the resulting sparse mask to an untrained network. However, existing IP methods still struggle to generalize their observations beyond ad-hoc initialization and small-scale architectures or datasets, or they bypass these challenges by applying their mask to trained weights instead of initialized ones. In this paper, we demonstrate that the parameter sign configuration plays a crucial role in conveying useful information for generalization to any randomly initialized network. Through linear mode connectivity analysis, we observe that a sparse network trained by an existing IP method can retain its basin of attraction if its parameter signs and normalization layer parameters are preserved. To take a step closer to finding a winning ticket, we alleviate the reliance on normalization layer parameters by preventing high error barriers along the linear path between the sparse network trained by our method and its counterpart with initialized normalization layer parameters. Interestingly, across various architectures and datasets, we observe that any randomly initialized network can be optimized to exhibit low error barriers along the linear path to the sparse network trained by our method by inheriting its sparsity and parameter sign information, potentially achieving performance comparable to the original. The code is available at https://github.com/JungHunOh/AWS\_ICLR2025.git
Chinese: 该研究表明,在随机初始化的网络中保留参数符号配置和稀疏性,通过线性连通性维持低误差障碍,能够使其达到与完全训练的稀疏网络相媲美的性能。
English: The study reveals that preserving parameter sign configurations and sparsity in randomly initialized networks enables them to achieve performance comparable to fully trained sparse networks by maintaining low error barriers through linear connectivity.
Authors:Siqing Song, Chuang Wang, Ruiqi Wang, Yi Yang, Xu-Yao Zhang
Abstract:
Quantizing large language models (LLMs) to 1-bit precision significantly reduces computational costs, but existing quantization techniques suffer from noticeable performance degradation when using weight and activation precisions below 4 bits (W4A4). In this paper, we propose a post-training quantization framework with W(1+1)A(1*4) configuration, where weights are quantized to 1 bit with an additional 1 bit for fine-grain grouping and activations are quantized to 1 bit with a 4-fold increase in the number of channels. For weight quantization, we propose utilizing Hessian-aware fine-grained grouping along with an EM-based quantization scheme. For activation quantization, we decompose INT4-quantized activations into a 4 * INT1 format equivalently and simultaneously smooth the scaling factors based on quantization errors, which further reduces the quantization errors in activations. Our method surpasses state-of-the-art (SOTA) LLM quantization baselines on W2A4 across multiple tasks, pushing the boundaries of existing LLM quantization methods toward fully binarized models. Code is available at https://github.com/JimmyCrave/LLM-PTQ-binarization.
中文摘要:本文提出了一种新颖的训练后量化框架,通过细粒度分组和误差减少技术,在1位权重和激活下实现卓越性能,推动了全二值化大语言模型的发展。
English Summary: This paper introduces a novel post-training quantization framework that achieves superior performance with 1-bit weights and activations through fine-grained grouping and error-reduction techniques, advancing toward fully binarized LLMs.
Authors:Hao Nan Sheng, Zhi-yong Wang, Mingrui Yang, Hing Cheung So
Abstract:
As large language models continue to grow in size, parameter-efficient fine-tuning (PEFT) has become increasingly crucial. While low-rank adaptation (LoRA) offers a solution through low-rank updates, its static rank allocation may yield suboptimal results. Adaptive low-rank adaptation (AdaLoRA) improves this with dynamic allocation but remains sensitive to initial and target rank configurations. We introduce AROMA, a framework that automatically constructs layer-specific updates by iteratively building up rank-one components with very few trainable parameters that gradually diminish to zero. Unlike existing methods that employ rank reduction mechanisms, AROMA introduces a dual-loop architecture for rank growth. The inner loop extracts information from each rank-one subspace, while the outer loop determines the number of rank-one subspaces, i.e., the optimal rank. We reset optimizer states to maintain subspace independence. AROMA significantly reduces parameters compared to LoRA and AdaLoRA while achieving superior performance on natural language understanding and commonsense reasoning tasks, offering new insights into adaptive PEFT. The code is available at \href{https://github.com/ShuDun23/AROMA}{AROMA}.
Chinese: AROMA提出了一种双循环框架,通过迭代构建极少可训练参数的秩一组件来自动生成层级特定更新,相比LoRA和AdaLoRA在语言任务中显著提升了参数效率与性能表现。
English: AROMA introduces a dual-loop framework that dynamically constructs layer-specific updates by iteratively building rank-one components with minimal trainable parameters, significantly outperforming LoRA and AdaLoRA in efficiency and performance on language tasks.
Authors:Hansheng Chen, Kai Zhang, Hao Tan, Zexiang Xu, Fujun Luan, Leonidas Guibas, Gordon Wetzstein, Sai Bi
Abstract:
Diffusion models approximate the denoising distribution as a Gaussian and predict its mean, whereas flow matching models reparameterize the Gaussian mean as flow velocity. However, they underperform in few-step sampling due to discretization error and tend to produce over-saturated colors under classifier-free guidance (CFG). To address these limitations, we propose a novel Gaussian mixture flow matching (GMFlow) model: instead of predicting the mean, GMFlow predicts dynamic Gaussian mixture (GM) parameters to capture a multi-modal flow velocity distribution, which can be learned with a KL divergence loss. We demonstrate that GMFlow generalizes previous diffusion and flow matching models where a single Gaussian is learned with an $L_2$ denoising loss. For inference, we derive GM-SDE/ODE solvers that leverage analytic denoising distributions and velocity fields for precise few-step sampling. Furthermore, we introduce a novel probabilistic guidance scheme that mitigates the over-saturation issues of CFG and improves image generation quality. Extensive experiments demonstrate that GMFlow consistently outperforms flow matching baselines in generation quality, achieving a Precision of 0.942 with only 6 sampling steps on ImageNet 256$\times$256.
中文摘要:提出的GMFlow模型通过预测高斯混合参数来更精确地捕捉速度分布,改进了扩散模型和流匹配方法,其新型概率引导方案实现了更优的少步采样效果并有效缓解了色彩过饱和问题。
English Summary: The proposed GMFlow model improves upon diffusion and flow matching methods by predicting Gaussian mixture parameters for more accurate velocity distributions, enabling superior few-step sampling and reducing color over-saturation through a novel probabilistic guidance scheme.
Authors:Kwangjun Ahn, Byron Xu, Natalie Abreu, Ying Fan, Gagik Magakyan, Pratyusha Sharma, Zheng Zhan, John Langford
Abstract:
Orthonormalized updates accelerate training, improve stability, and enable robust hyperparameter transfer, but existing methods like Muon rely on dense matrix operations that clash with sharded weights in large-scale LLM training, causing high compute and communication cost. We introduce Dion (Distributed Orthonormalization), a scalable and efficient update rule that replaces Newton-Schulz iteration with amortized power iteration on a momentum buffer, avoiding full-matrix reconstruction and integrating cleanly with weight sharding. The rank-fraction parameter with error feedback enables low-rank updates that balance quality with significant cost savings. On language models from 160M to 3B parameters, Dion retains the benefits of orthonormalized updates, while markedly reducing wall-clock time at scale, making it a practical optimizer for next-generation foundation models. Code is available at: https://github.com/microsoft/dion/
中文: Dion提出了一种可扩展的分布式正交化方法,通过基于动量的迭代替代密集矩阵运算,在保持更新质量的同时显著提升了大规模语言模型训练的效率和稳定性。
English: Dion introduces a scalable distributed orthonormalization method that replaces dense matrix operations with efficient momentum-based iterations, enabling faster and more stable training of large language models while maintaining update quality.
Authors:Mustafa Burak Gurbuz, Xingyu Zheng, Constantine Dovrolis
Abstract:
As deep learning continues to be driven by ever-larger datasets, understanding which examples are most important for generalization has become a critical question. While progress in data selection continues, emerging applications require studying this problem in dynamic contexts. To bridge this gap, we pose the Incremental Data Selection (IDS) problem, where examples arrive as a continuous stream, and need to be selected without access to the full data source. In this setting, the learner must incrementally build a training dataset of predefined size while simultaneously learning the underlying task. We find that in IDS, the impact of a new sample on the model state depends fundamentally on both its geometric relationship in the feature space and its prediction error. Leveraging this insight, we propose PEAKS (Prediction Error Anchored by Kernel Similarity), an efficient data selection method tailored for IDS. Our comprehensive evaluations demonstrate that PEAKS consistently outperforms existing selection strategies. Furthermore, PEAKS yields increasingly better performance returns than random selection as training data size grows on real-world datasets. The code is available at https://github.com/BurakGurbuz97/PEAKS.
中文: 本研究提出了增量数据选择(IDS)问题,并开发了PEAKS方法,通过结合特征空间几何关系与预测误差,在动态数据流场景中持续优于现有选择策略。
English: The study introduces the Incremental Data Selection (IDS) problem and proposes PEAKS, an efficient method that leverages geometric relationships and prediction errors to outperform existing strategies in dynamic data streaming contexts.
Authors:Wenzhao Tang, Weihang Li, Xiucheng Liang, Olaf Wysocki, Filip Biljecki, Christoph Holst, Boris Jutzi
Abstract:
Despite recent advancements in surface reconstruction, Level of Detail (LoD) 3 building reconstruction remains an unresolved challenge. The main issue pertains to the object-oriented modelling paradigm, which requires georeferencing, watertight geometry, facade semantics, and low-poly representation -- Contrasting unstructured mesh-oriented models. In Texture2LoD3, we introduce a novel method leveraging the ubiquity of 3D building model priors and panoramic street-level images, enabling the reconstruction of LoD3 building models. We observe that prior low-detail building models can serve as valid planar targets for ortho-rectifying street-level panoramic images. Moreover, deploying segmentation on accurately textured low-level building surfaces supports maintaining essential georeferencing, watertight geometry, and low-poly representation for LoD3 reconstruction. In the absence of LoD3 validation data, we additionally introduce the ReLoD3 dataset, on which we experimentally demonstrate that our method leads to improved facade segmentation accuracy by 11% and can replace costly manual projections. We believe that Texture2LoD3 can scale the adoption of LoD3 models, opening applications in estimating building solar potential or enhancing autonomous driving simulations. The project website, code, and data are available here: https://wenzhaotang.github.io/Texture2LoD3/.
Authors:Kidist Amde Mekonnen, Yubao Tang, Maarten de Rijke
Abstract:
Generative information retrieval (GenIR) is a promising neural retrieval paradigm that formulates document retrieval as a document identifier (docid) generation task, allowing for end-to-end optimization toward a unified global retrieval objective. However, existing GenIR models suffer from token-level misalignment, where models trained to predict the next token often fail to capture document-level relevance effectively. While reinforcement learning-based methods, such as reinforcement learning from relevance feedback (RLRF), aim to address this misalignment through reward modeling, they introduce significant complexity, requiring the optimization of an auxiliary reward function followed by reinforcement fine-tuning, which is computationally expensive and often unstable. To address these challenges, we propose direct document relevance optimization (DDRO), which aligns token-level docid generation with document-level relevance estimation through direct optimization via pairwise ranking, eliminating the need for explicit reward modeling and reinforcement learning. Experimental results on benchmark datasets, including MS MARCO document and Natural Questions, show that DDRO outperforms reinforcement learning-based methods, achieving a 7.4% improvement in MRR@10 for MS MARCO and a 19.9% improvement for Natural Questions. These findings highlight DDRO's potential to enhance retrieval effectiveness with a simplified optimization approach. By framing alignment as a direct optimization problem, DDRO simplifies the ranking optimization pipeline of GenIR models while offering a viable alternative to reinforcement learning-based methods.
中文: 提出的直接文档相关性优化(DDRO)方法通过成对排序将标记级文档标识符生成与文档级相关性对齐,在显著提升检索效果的同时简化了优化流程,优于基于强化学习的方法。
English: The proposed direct document relevance optimization (DDRO) method aligns token-level document identifier generation with document-level relevance through pairwise ranking, outperforming reinforcement learning-based approaches with significant improvements in retrieval effectiveness while simplifying the optimization process.
Authors:Guangqiang Li, M. Amine Atoui, Xiangshun Li
Abstract:
Fault diagnosis in multimode processes plays a critical role in ensuring the safe operation of industrial systems across multiple modes. It faces a great challenge yet to be addressed - that is, the significant distributional differences among monitoring data from multiple modes make it difficult for the models to extract shared feature representations related to system health conditions. In response to this problem, this paper introduces a novel method called attention-based multiscale temporal fusion network. The multiscale depthwise convolution and gated recurrent unit are employed to extract multiscale contextual local features and long-short-term features. Instance normalization is applied to suppress mode-specific information. Furthermore, a temporal attention mechanism is designed to focus on critical time points with higher cross-mode shared information, thereby enhancing the accuracy of fault diagnosis. The proposed model is applied to Tennessee Eastman process dataset and three-phase flow facility dataset. The experiments demonstrate that the proposed model achieves superior diagnostic performance and maintains a small model size. The source code will be available on GitHub at https://github.com/GuangqiangLi/AMTFNet.
中文摘要:本文提出了一种基于注意力的多尺度时序融合网络,通过多尺度特征提取和时间注意力机制解决多模态过程中的故障诊断难题,在保持模型轻量化的同时实现了优越的诊断性能。
English Summary: This paper introduces an attention-based multiscale temporal fusion network that addresses multimode process fault diagnosis by extracting shared features through multiscale analysis and temporal attention, achieving superior performance with compact model size.
Authors:Suhang Gu, Ye Wang, Yongxin Chou, Jinliang Cong, Mingli Lu, Zhuqing Jiao
Abstract:
Clustering is an efficient and essential technique for exploring latent knowledge of data. However, limited attention has been given to the interpretability of the clusters detected by most clustering algorithms. In addition, due to the homogeneity of data, different groups of data have their own homogeneous styles. In this paper, the above two aspects are considered, and an interpretable style Takagi-Sugeno-Kang (TSK) fuzzy clustering (IS-TSK-FC) algorithm is proposed. The clustering behavior of IS-TSK-FC is fully guided by the TSK fuzzy inference on fuzzy rules. In particular, samples are grouped into clusters represented by the corresponding consequent vectors of all fuzzy rules learned in an unsupervised manner. This can explain how the clusters are generated in detail, thus making the underlying decision-making process of the IS-TSK-FC interpretable. Moreover, a series of style matrices are introduced to facilitate the consequents of fuzzy rules in IS-TSK-FC by capturing the styles of clusters as well as the nuances between different styles. Consequently, all the fuzzy rules in IS-TSK-FC have powerful data representation capability. After determining the antecedents of all the fuzzy rules, the optimization problem of IS-TSK-FC can be iteratively solved in an alternation manner. The effectiveness of IS-TSK-FC as an interpretable clustering tool is validated through extensive experiments on benchmark datasets with unknown implicit/explicit styles. Specially, the superior clustering performance of IS-TSK-FC is demonstrated on case studies where different groups of data present explicit styles. The source code of IS-TSK-FC can be downloaded from https://github.com/gusuhang10/IS-TSK-FC.
中文: 本文提出了一种可解释的风格TSK模糊聚类算法(IS-TSK-FC),通过模糊规则和风格矩阵提升聚类可解释性,并在基准数据集上验证了其有效性。
English: This paper introduces an interpretable style TSK fuzzy clustering algorithm (IS-TSK-FC) that enhances cluster interpretability through fuzzy rules and style matrices, validated by experiments on benchmark datasets.
Authors:Liu Xiao, Li Zhiyuan, Lin Yueyu
Abstract:
Test-time scaling has emerged as a prominent research direction in machine learning, enabling models to enhance their expressive capabilities during inference.Transformers, renowned for striking a delicate balance between efficiency and expressiveness, have benefited from test-time scaling techniques that leverage an expanding key-value (KV) cache to significantly improve performance.In this paper, we introduce a novel state-based approach to test-time scaling, which we term state tuning, tailored to the RNN-based RWKV-7 model.By exploiting the unique strengths of RWKV-7, our method achieves state-of-the-art performance on the target task without altering the model's pre-trained weights. Our approach centers on three key innovations. First, we develop an observer framework that allows a smaller model to replicate and learn the state dynamics of the RWKV-7 model. Second, we employ a kernel method to dynamically upscale the state size, enhancing the model's capacity to capture intricate patterns. Third, we integrate Decorrelated Backpropagation (DBP) to optimize the upscaled state matrix, thereby improving convergence and expressivity. By tuning only the state matrix, we demonstrate that a smaller model can outperform larger models on the given task. This method preserves the efficiency of the original RWKV-7 architecture while harnessing the power of test-time scaling to deliver superior results. Our findings underscore the potential of state tuning as an effective strategy for advancing model performance in resource-constrained settings. Our code is https://github.com/TorchRWKV/flash-linear-attention.
中文: 本文提出状态调优这一新型测试时扩展方法,针对RWKV-7模型通过动态扩展和优化状态矩阵(无需修改预训练权重)实现性能突破,其三大核心创新——观察者框架、核函数状态扩展与去相关反向传播——使小模型在特定任务上超越大模型表现。
English: This paper introduces state tuning, a novel test-time scaling method for the RWKV-7 model that enhances performance by dynamically upscaling and optimizing the state matrix without altering pre-trained weights, achieving state-of-the-art results through three key innovations: an observer framework, kernel-based state upscaling, and Decorrelated Backpropagation.
Authors:Chandra Raskoti, Iftekharul Islam, Xuan Wang, Weizi Li
Abstract:
Accurate vehicle trajectory prediction is critical for safe and efficient autonomous driving, especially in mixed traffic environments when both human-driven and autonomous vehicles co-exist. However, uncertainties introduced by inherent driving behaviors -- such as acceleration, deceleration, and left and right maneuvers -- pose significant challenges for reliable trajectory prediction. We introduce a Maneuver-Intention-Aware Transformer (MIAT) architecture, which integrates a maneuver intention awareness control mechanism with spatiotemporal interaction modeling to enhance long-horizon trajectory predictions. We systematically investigate the impact of varying awareness of maneuver intention on both short- and long-horizon trajectory predictions. Evaluated on the real-world NGSIM dataset and benchmarked against various transformer- and LSTM-based methods, our approach achieves an improvement of up to 4.7% in short-horizon predictions and a 1.6% in long-horizon predictions compared to other intention-aware benchmark methods. Moreover, by leveraging intention awareness control mechanism, MIAT realizes an 11.1% performance boost in long-horizon predictions, with a modest drop in short-horizon performance. The source code and datasets are available at https://github.com/cpraskoti/MIAT.
Chinese: MIAT架构通过结合机动意图感知与时空交互建模,有效提升车辆轨迹预测精度,在真实数据集上实现短期预测性能最高提升4.7%、长期预测提升1.6%的突破。
English: The MIAT architecture enhances vehicle trajectory prediction by integrating maneuver intention awareness with spatiotemporal modeling, achieving up to 4.7% and 1.6% improvements in short- and long-horizon predictions respectively on real-world datasets.
Authors:Ran Xu, Wenqi Shi, Yuchen Zhuang, Yue Yu, Joyce C. Ho, Haoyu Wang, Carl Yang
Abstract:
Retrieval-Augmented Generation (RAG) systems often struggle to handle multi-hop question-answering tasks accurately due to irrelevant context retrieval and limited complex reasoning capabilities. We introduce Collab-RAG, a collaborative training framework that leverages mutual enhancement between a white-box small language model (SLM) and a blackbox large language model (LLM) for RAG. Specifically, the SLM decomposes complex queries into simpler sub-questions, thus enhancing the accuracy of the retrieval and facilitating more effective reasoning by the black-box LLM. Concurrently, the black-box LLM provides feedback signals to improve the SLM's decomposition capability. We observe that Collab-RAG relies solely on supervision from an affordable black-box LLM without additional distillation from frontier LLMs, yet demonstrates strong generalization across multiple black-box LLMs. Experimental evaluations across five multi-hop QA datasets demonstrate that Collab-RAG substantially outperforms existing black-box-only and SLM fine-tuning baselines by 1.8%-14.2% on average. In particular, our fine-tuned 3B SLM surpasses a frozen 32B LLM in question decomposition, highlighting the efficiency of Collab-RAG in improving reasoning and retrieval for complex questions. The code of Collab-RAG is available on https://github.com/ritaranx/Collab-RAG/.
中文:Collab-RAG通过让小型语言模型分解复杂问题、大型语言模型提供反馈,有效提升了多跳问答性能,无需昂贵蒸馏即可实现卓越表现。
English: Collab-RAG enhances multi-hop question-answering by enabling a small language model to decompose complex queries and a large language model to provide feedback, achieving superior performance without expensive distillation.
Authors:Tianyang Wu, Lipeng Wan, Yuhang Wang, Qiang Wan, Xuguang Lan
Abstract:
Significant progress has been made in AI for games, including board games, MOBA, and RTS games. However, complex agents are typically developed in an embedded manner, directly accessing game state information, unlike human players who rely on noisy visual data, leading to unfair competition. Developing complex non-embedded agents remains challenging, especially in card-based RTS games with complex features and large state spaces. We propose a non-embedded offline reinforcement learning training strategy using visual inputs to achieve real-time autonomous gameplay in the RTS game Clash Royale. Due to the lack of a object detection dataset for this game, we designed an efficient generative object detection dataset for training. We extract features using state-of-the-art object detection and optical character recognition models. Our method enables real-time image acquisition, perception feature fusion, decision-making, and control on mobile devices, successfully defeating built-in AI opponents. All code is open-sourced at https://github.com/wty-yy/katacr.
中文摘要:本研究提出了一种基于视觉输入的非嵌入式强化学习方法,通过创建生成式目标检测数据集解决了《皇室战争》中目标检测数据缺失的难题,实现了移动端实时自主游戏并成功击败内置AI对手。
English Summary: This study introduces a non-embedded reinforcement learning approach using visual inputs to enable real-time autonomous gameplay in Clash Royale, overcoming challenges like the absence of object detection datasets by creating a generative dataset and achieving victory against built-in AI opponents.
Authors:Inhwan Bae, Junoh Lee, Hae-Gon Jeon
Abstract:
Modeling and reproducing crowd behaviors are important in various domains including psychology, robotics, transport engineering and virtual environments. Conventional methods have focused on synthesizing momentary scenes, which have difficulty in replicating the continuous nature of real-world crowds. In this paper, we introduce a novel method for automatically generating continuous, realistic crowd trajectories with heterogeneous behaviors and interactions among individuals. We first design a crowd emitter model. To do this, we obtain spatial layouts from single input images, including a segmentation map, appearance map, population density map and population probability, prior to crowd generation. The emitter then continually places individuals on the timeline by assigning independent behavior characteristics such as agents' type, pace, and start/end positions using diffusion models. Next, our crowd simulator produces their long-term locomotions. To simulate diverse actions, it can augment their behaviors based on a Markov chain. As a result, our overall framework populates the scenes with heterogeneous crowd behaviors by alternating between the proposed emitter and simulator. Note that all the components in the proposed framework are user-controllable. Lastly, we propose a benchmark protocol to evaluate the realism and quality of the generated crowds in terms of the scene-level population dynamics and the individual-level trajectory accuracy. We demonstrate that our approach effectively models diverse crowd behavior patterns and generalizes well across different geographical environments. Code is publicly available at https://github.com/InhwanBae/CrowdES .
中文: 本文提出了一种新颖框架,通过交替使用人群发射器和模拟器来生成具有异质行为的连续逼真人群轨迹,该方法在不同地理环境中展现出优异泛化能力且支持全流程用户控制。
English: This paper introduces a novel framework for generating continuous, realistic crowd trajectories with heterogeneous behaviors using an alternating emitter-simulator approach, which demonstrates superior generalization across environments and offers full user control.
Authors:Will Cai, Tianneng Shi, Xuandong Zhao, Dawn Song
Abstract:
The proliferation of Large Language Models (LLMs) accessed via black-box APIs introduces a significant trust challenge: users pay for services based on advertised model capabilities (e.g., size, performance), but providers may covertly substitute the specified model with a cheaper, lower-quality alternative to reduce operational costs. This lack of transparency undermines fairness, erodes trust, and complicates reliable benchmarking. Detecting such substitutions is difficult due to the black-box nature, typically limiting interaction to input-output queries. This paper formalizes the problem of model substitution detection in LLM APIs. We systematically evaluate existing verification techniques, including output-based statistical tests, benchmark evaluations, and log probability analysis, under various realistic attack scenarios like model quantization, randomized substitution, and benchmark evasion. Our findings reveal the limitations of methods relying solely on text outputs, especially against subtle or adaptive attacks. While log probability analysis offers stronger guarantees when available, its accessibility is often limited. We conclude by discussing the potential of hardware-based solutions like Trusted Execution Environments (TEEs) as a pathway towards provable model integrity, highlighting the trade-offs between security, performance, and provider adoption. Code is available at https://github.com/sunblaze-ucb/llm-api-audit
中文摘要:商业大语言模型API存在信任问题,服务商可能暗中替换廉价模型,软件检测方法效果不佳,而基于可信执行环境的硬件级安全方案能以较小性能开销提供可靠解决方案。
English Summary: Commercial LLM APIs face a trust issue where providers may secretly substitute cheaper models, and software detection methods prove unreliable, but hardware-level security using Trusted Execution Environments offers a robust solution with minimal performance impact.
Authors:Manlai Liang, JiaMing Zhang, Xiong Li, Jinlong Li
Abstract:
The increasing size of the Key-Value (KV) cache during the Large Language Models long-context inference is the main obstacle for its balance between the deployment cost and task accuracy. To reduce the KV cache size in such scenarios, most previous efforts leveraged on the attention weight to evict non-critical cache tokens. But there is a trade-off in those methods, they usually require major modification of the inference infrastructure and significant computation overhead. Based on the fact that the Large Language models are autoregressive models, we propose LagKV, a KV compression strategy only relying on straight forward comparison among KV themselves. It is a totally attention free method which offers easy integration to the main stream inference platform and comparable performance comparing to other complicated KV compression methods. Results on RULER benchmark show that, our approach outperforms SnapKV and StreamingLLM in different compression ratios. Especially in the 64-digit passkey retrieval task, our method outperforms the attention weight based method $H_2O$ over $50\%$ with same compression ratios. Our code is available at https://github.com/AI-Lab-China-Merchants-Bank/LagKV.
中文摘要:LagKV是一种创新的KV缓存压缩方法,通过直接比较KV条目而无需注意力权重,实现了便捷集成和在长上下文大语言模型推理中的卓越性能。
English Summary: LagKV is a novel KV cache compression method that eliminates the need for attention weights by comparing KV entries directly, offering easy integration and superior performance in long-context LLM inference.
Authors:Tomasz Kacprzak, Francois Kamper, Michael W. Heiss, Gianluca Janka, Ann M. Dillner, Satoshi Takahama
Abstract:
Recently, linear regression models incorporating an optimal transport (OT) loss have been explored for applications such as supervised unmixing of spectra, music transcription, and mass spectrometry. However, these task-specific approaches often do not generalize readily to a broader class of linear models. In this work, we propose a novel algorithmic framework for solving a general class of non-negative linear regression models with an entropy-regularized OT datafit term, based on Sinkhorn-like scaling iterations. Our framework accommodates convex penalty functions on the weights (e.g. squared-$\ell_2$ and $\ell_1$ norms), and admits additional convex loss terms between the transported marginal and target distribution (e.g. squared error or total variation). We derive simple multiplicative updates for common penalty and datafit terms. This method is suitable for large-scale problems due to its simplicity of implementation and straightforward parallelization.
中文: 本文提出了一种基于熵正则化最优传输损失的非负线性回归算法框架,通过简单的乘法更新支持多种凸惩罚项和数据拟合项,适用于大规模问题。
English: This paper introduces a scalable algorithmic framework for non-negative linear regression using entropy-regularized optimal transport loss, supporting various convex penalties and datafit terms through simple multiplicative updates.
Authors:Xuerui Su, Shufang Xie, Guoqing Liu, Yingce Xia, Renqian Luo, Peiran Jin, Zhiming Ma, Yue Wang, Zun Wang, Yuting Liu
Abstract:
Recently, Large Language Models (LLMs) have rapidly evolved, approaching Artificial General Intelligence (AGI) while benefiting from large-scale reinforcement learning to enhance Human Alignment (HA) and Reasoning. Recent reward-based optimization algorithms, such as Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO) have achieved significant performance on reasoning tasks, whereas preference-based optimization algorithms such as Direct Preference Optimization (DPO) significantly improve the performance of LLMs on human alignment. However, despite the strong performance of reward-based optimization methods in alignment tasks , they remain vulnerable to reward hacking. Furthermore, preference-based algorithms (such as Online DPO) haven't yet matched the performance of reward-based optimization algorithms (like PPO) on reasoning tasks, making their exploration in this specific area still a worthwhile pursuit. Motivated by these challenges, we propose the Trust Region Preference Approximation (TRPA) algorithm, which integrates rule-based optimization with preference-based optimization for reasoning tasks. As a preference-based algorithm, TRPA naturally eliminates the reward hacking issue. TRPA constructs preference levels using predefined rules, forms corresponding preference pairs, and leverages a novel optimization algorithm for RL training with a theoretical monotonic improvement guarantee. Experimental results demonstrate that TRPA not only achieves competitive performance on reasoning tasks but also exhibits robust stability. The code of this paper are released and updating on https://github.com/XueruiSu/Trust-Region-Preference-Approximation.git.
中文:近期大语言模型在推理任务中采用基于奖励的优化方法如PPO,在人类对齐任务中采用基于偏好的方法如DPO,但两者分别存在奖励破解和性能不足的问题,因此提出TRPA算法,结合规则与偏好优化,在推理任务中实现稳定且具竞争力的性能。
English: Recent advancements in LLMs utilize reward-based methods like PPO for reasoning and preference-based approaches like DPO for alignment, yet both face challenges such as reward hacking and performance gaps, prompting the proposal of the TRPA algorithm that combines rule-based and preference optimization to achieve competitive reasoning results with stability.
Authors:Mete Ahishali, Anis Ur Rahman, Einari Heinaro, Samuli Junttila
Abstract:
Information on standing dead trees is important for understanding forest ecosystem functioning and resilience but has been lacking over large geographic regions. Climate change has caused large-scale tree mortality events that can remain undetected due to limited data. In this study, we propose a novel method for segmenting standing dead trees using aerial multispectral orthoimages. Because access to annotated datasets has been a significant problem in forest remote sensing due to the need for forest expertise, we introduce a method for domain transfer by leveraging domain adaptation to learn a transformation from a source domain X to target domain Y. In this Image-to-Image translation task, we aim to utilize available annotations in the target domain by pre-training a segmentation network. When images from a new study site without annotations are introduced (source domain X), these images are transformed into the target domain. Then, transfer learning is applied by inferring the pre-trained network on domain-adapted images. In addition to investigating the feasibility of current domain adaptation approaches for this objective, we propose a novel approach called the Attention-guided Domain Adaptation Network (ADA-Net) with enhanced contrastive learning. Accordingly, the ADA-Net approach provides new state-of-the-art domain adaptation performance levels outperforming existing approaches. We have evaluated the proposed approach using two datasets from Finland and the US. The USA images are converted to the Finland domain, and we show that the synthetic USA2Finland dataset exhibits similar characteristics to the Finland domain images. The software implementation is shared at https://github.com/meteahishali/ADA-Net. The data is publicly available at https://www.kaggle.com/datasets/meteahishali/aerial-imagery-for-standing-dead-tree-segmentation.
中文摘要:本研究提出ADA-Net这一新型域自适应方法,通过图像转换实现跨林区知识迁移,解决了枯立木航拍图像分割中标注数据匮乏的问题,在芬兰和美国数据集上取得了最优性能。
English Summary: This study introduces ADA-Net, a novel domain adaptation method for segmenting standing dead trees from aerial imagery, addressing data scarcity by transferring knowledge between forest regions and achieving state-of-the-art performance on datasets from Finland and the USA.
Authors:Shiguang Sun, Hanbo Zhang, Zeyang Liu, Xinrui Yang, Lipeng Wan, Xingyu Chen, Xuguang Lan
Abstract:
Existing visual model-based reinforcement learning (MBRL) algorithms with observation reconstruction often suffer from information conflicts, making it difficult to learn compact representations and hence result in less robust policies, especially in the presence of task-irrelevant visual distractions. In this paper, we first reveal that the information conflicts in current visual MBRL algorithms stem from visual representation learning and latent dynamics modeling with an information-theoretic perspective. Based on this finding, we present a new algorithm to resolve information conflicts for visual MBRL, named MInCo, which mitigates information conflicts by leveraging negative-free contrastive learning, aiding in learning invariant representation and robust policies despite noisy observations. To prevent the dominance of visual representation learning, we introduce time-varying reweighting to bias the learning towards dynamics modeling as training proceeds. We evaluate our method on several robotic control tasks with dynamic background distractions. Our experiments demonstrate that MInCo learns invariant representations against background noise and consistently outperforms current state-of-the-art visual MBRL methods. Code is available at https://github.com/ShiguangSun/minco.
中文:MInCo算法通过对比学习解决视觉模型强化学习中的信息冲突,学习不变表征和鲁棒策略,在噪声环境下优于现有方法。
English: The MInCo algorithm addresses information conflicts in visual model-based reinforcement learning by using contrastive learning to develop invariant representations and robust policies, outperforming existing methods in noisy environments.
Authors:Conghao Xiong, Hao Chen, Joseph J. Y. Sung
Abstract:
Computational pathology, which involves analyzing whole slide images for automated cancer diagnosis, relies on multiple instance learning, where performance depends heavily on the feature extractor and aggregator. Recent Pathology Foundation Models (PFMs), pretrained on large-scale histopathology data, have significantly enhanced both the extractor and aggregator, but they lack a systematic analysis framework. In this survey, we present a hierarchical taxonomy organizing PFMs through a top-down philosophy applicable to foundation model analysis in any domain: model scope, model pretraining, and model design. Additionally, we systematically categorize PFM evaluation tasks into slide-level, patch-level, multimodal, and biological tasks, providing comprehensive benchmarking criteria. Our analysis identifies critical challenges in both PFM development (pathology-specific methodology, end-to-end pretraining, data-model scalability) and utilization (effective adaptation, model maintenance), paving the way for future directions in this promising field. Resources referenced in this survey are available at https://github.com/BearCleverProud/AwesomeWSI.
中文摘要:本综述提出了一种层次化分类法,通过模型范围、预训练和设计三个维度系统分析病理学基础模型,并划分评估任务类别,指出了模型开发与应用中的关键挑战,为计算病理学发展指明方向。
English Summary: This survey introduces a hierarchical taxonomy for analyzing Pathology Foundation Models (PFMs) by examining model scope, pretraining, and design, while categorizing evaluation tasks and identifying key challenges in development and utilization to advance computational pathology.
Authors:Maksim Siniukov, Di Chang, Minh Tran, Hongkun Gong, Ashutosh Chaubey, Mohammad Soleymani
Abstract:
Generating naturalistic and nuanced listener motions for extended interactions remains an open problem. Existing methods often rely on low-dimensional motion codes for facial behavior generation followed by photorealistic rendering, limiting both visual fidelity and expressive richness. To address these challenges, we introduce DiTaiListener, powered by a video diffusion model with multimodal conditions. Our approach first generates short segments of listener responses conditioned on the speaker's speech and facial motions with DiTaiListener-Gen. It then refines the transitional frames via DiTaiListener-Edit for a seamless transition. Specifically, DiTaiListener-Gen adapts a Diffusion Transformer (DiT) for the task of listener head portrait generation by introducing a Causal Temporal Multimodal Adapter (CTM-Adapter) to process speakers' auditory and visual cues. CTM-Adapter integrates speakers' input in a causal manner into the video generation process to ensure temporally coherent listener responses. For long-form video generation, we introduce DiTaiListener-Edit, a transition refinement video-to-video diffusion model. The model fuses video segments into smooth and continuous videos, ensuring temporal consistency in facial expressions and image quality when merging short video segments produced by DiTaiListener-Gen. Quantitatively, DiTaiListener achieves the state-of-the-art performance on benchmark datasets in both photorealism (+73.8% in FID on RealTalk) and motion representation (+6.1% in FD metric on VICO) spaces. User studies confirm the superior performance of DiTaiListener, with the model being the clear preference in terms of feedback, diversity, and smoothness, outperforming competitors by a significant margin.
Authors:Wenliang Zheng, Sarkar Snigdha Sarathi Das, Yusen Zhang, Rui Zhang
Abstract:
LLMs have gained immense popularity among researchers and the general public for its impressive capabilities on a variety of tasks. Notably, the efficacy of LLMs remains significantly dependent on the quality and structure of the input prompts, making prompt design a critical factor for their performance. Recent advancements in automated prompt optimization have introduced diverse techniques that automatically enhance prompts to better align model outputs with user expectations. However, these methods often suffer from the lack of standardization and compatibility across different techniques, limited flexibility in customization, inconsistent performance across model scales, and they often exclusively rely on expensive proprietary LLM APIs. To fill in this gap, we introduce GREATERPROMPT, a novel framework that democratizes prompt optimization by unifying diverse methods under a unified, customizable API while delivering highly effective prompts for different tasks. Our framework flexibly accommodates various model scales by leveraging both text feedback-based optimization for larger LLMs and internal gradient-based optimization for smaller models to achieve powerful and precise prompt improvements. Moreover, we provide a user-friendly Web UI that ensures accessibility for non-expert users, enabling broader adoption and enhanced performance across various user groups and application scenarios. GREATERPROMPT is available at https://github.com/psunlpgroup/GreaterPrompt via GitHub, PyPI, and web user interfaces.
中文摘要:GREATERPROMPT是一个创新框架,通过统一多种自动提示优化方法到可定制的API中,解决了标准化不足和模型兼容性等问题,同时结合基于反馈和梯度的优化方法,为不同任务生成高效提示。
English Summary: GREATERPROMPT is a novel framework that unifies diverse automated prompt optimization techniques under a customizable API, addressing limitations like standardization issues and model compatibility while providing effective prompts for various tasks through both feedback-based and gradient-based methods.
Authors:Rufei Ma, Chao Chen
Abstract:
Remote photoplethysmography (rPPG) technology infers heart rate by capturing subtle color changes in facial skin
using a camera, demonstrating great potential in non-contact heart rate measurement. However, measurement
accuracy significantly decreases in complex scenarios such as lighting changes and head movements compared
to ideal laboratory conditions. Existing deep learning models often neglect the quantification of measurement
uncertainty, limiting their credibility in dynamic scenes. To address the issue of insufficient rPPG measurement
reliability in complex scenarios, this paper introduces Bayesian neural networks to the rPPG field for the first time,
proposing the Robust Fusion Bayesian Physiological Network (RF-BayesPhysNet), which can model both aleatoric
and epistemic uncertainty. It leverages variational inference to balance accuracy and computational efficiency.
Due to the current lack of uncertainty estimation metrics in the rPPG field, this paper also proposes a new set of
methods, using Spearman correlation coefficient, prediction interval coverage, and confidence interval width, to
measure the effectiveness of uncertainty estimation methods under different noise conditions. Experiments show
that the model, with only double the parameters compared to traditional network models, achieves a MAE of 2.56
on the UBFC-RPPG dataset, surpassing most models. It demonstrates good uncertainty estimation capability
in no-noise and low-noise conditions, providing prediction confidence and significantly enhancing robustness in
real-world applications. We have open-sourced the code at https://github.com/AIDC-rPPG/RF-Net
中文: 本文提出RF-BayesPhysNet贝叶斯神经网络,通过建模测量不确定性提升复杂场景下远程心率监测的可靠性,在仅增加一倍参数的情况下于UBFC-RPPG数据集取得优异精度,并开发了新的不确定性评估指标。
English: This paper introduces RF-BayesPhysNet, a Bayesian neural network that models measurement uncertainty to enhance remote heart rate monitoring's reliability in complex scenarios, achieving superior accuracy on the UBFC-RPPG dataset with only doubled parameters.
Authors:Ved Umrajkar, Aakash Kumar Singh
Abstract:
Tree-Ring Watermarking is a significant technique for authenticating AI-generated images. However, its effectiveness in rectified flow-based models remains unexplored, particularly given the inherent challenges of these models with noise latent inversion. Through extensive experimentation, we evaluated and compared the detection and separability of watermarks between SD 2.1 and FLUX.1-dev models. By analyzing various text guidance configurations and augmentation attacks, we demonstrate how inversion limitations affect both watermark recovery and the statistical separation between watermarked and unwatermarked images. Our findings provide valuable insights into the current limitations of Tree-Ring Watermarking in the current SOTA models and highlight the critical need for improved inversion methods to achieve reliable watermark detection and separability. The official implementation, dataset release and all experimental results are available at this \href{https://github.com/dsgiitr/flux-watermarking}{\textbf{link}}.
Chinese: 树环水印技术在如FLUX.1-dev等整流流模型中因噪声潜在反转的挑战而存在局限性,影响了水印恢复以及水印与非水印图像间的统计可分离性。
English: Tree-Ring Watermarking faces limitations in rectified flow-based models like FLUX.1-dev due to noise latent inversion challenges, affecting both watermark recovery and statistical separability between watermarked and unwatermarked images.
Authors:Ruhui Zhang, Hezhe Qiao, Pengcheng Xu, Mingsheng Shang, Lin Chen
Abstract:
Multi-label Recognition (MLR) involves assigning multiple labels to each data instance in an image, offering advantages over single-label classification in complex scenarios. However, it faces the challenge of annotating all relevant categories, often leading to uncertain annotations, such as unseen or incomplete labels. Recent Vision and Language Pre-training (VLP) based methods have made significant progress in tackling zero-shot MLR tasks by leveraging rich vision-language correlations. However, the correlation between multi-label semantics has not been fully explored, and the learned visual features often lack essential semantic information. To overcome these limitations, we introduce a Semantic-guided Representation Learning approach (SigRL) that enables the model to learn effective visual and textual representations, thereby improving the downstream alignment of visual images and categories. Specifically, we first introduce a graph-based multi-label correlation module (GMC) to facilitate information exchange between labels, enriching the semantic representation across the multi-label texts. Next, we propose a Semantic Visual Feature Reconstruction module (SVFR) to enhance the semantic information in the visual representation by integrating the learned textual representation during reconstruction. Finally, we optimize the image-text matching capability of the VLP model using both local and global features to achieve zero-shot MLR. Comprehensive experiments are conducted on several MLR benchmarks, encompassing both zero-shot MLR (with unseen labels) and single positive multi-label learning (with limited labels), demonstrating the superior performance of our approach compared to state-of-the-art methods. The code is available at https://github.com/MVL-Lab/SigRL.
中文: 提出的语义引导表示学习方法通过建模标签间关联并利用语义重构视觉特征,在零样本和有限标签的多标签识别任务中表现出卓越性能。
English: The proposed Semantic-guided Representation Learning (SigRL) method enhances multi-label recognition by modeling label correlations and reconstructing visual features with semantic guidance, achieving superior performance in zero-shot and limited-label scenarios.
Authors:Wei Huang, Qinying Gu, Nanyang Ye
Abstract:
Offline reinforcement learning (RL) enables policy training solely on pre-collected data, avoiding direct environment interaction - a crucial benefit for energy-constrained embodied AI applications. Although Artificial Neural Networks (ANN)-based methods perform well in offline RL, their high computational and energy demands motivate exploration of more efficient alternatives. Spiking Neural Networks (SNNs) show promise for such tasks, given their low power consumption. In this work, we introduce DSFormer, the first spike-driven transformer model designed to tackle offline RL via sequence modeling. Unlike existing SNN transformers focused on spatial dimensions for vision tasks, we develop Temporal Spiking Self-Attention (TSSA) and Positional Spiking Self-Attention (PSSA) in DSFormer to capture the temporal and positional dependencies essential for sequence modeling in RL. Additionally, we propose Progressive Threshold-dependent Batch Normalization (PTBN), which combines the benefits of LayerNorm and BatchNorm to preserve temporal dependencies while maintaining the spiking nature of SNNs. Comprehensive results in the D4RL benchmark show DSFormer's superiority over both SNN and ANN counterparts, achieving 78.4% energy savings, highlighting DSFormer's advantages not only in energy efficiency but also in competitive performance. Code and models are public at https://wei-nijuan.github.io/DecisionSpikeFormer.
Authors:Qian Chen, Xingjian Dong, Zhike Peng, Guang Meng
Abstract:
Despite significant progress in intelligent fault diagnosis (IFD), the lack of interpretability remains a critical barrier to practical industrial applications, driving the growth of interpretability research in IFD. Post-hoc interpretability has gained popularity due to its ability to preserve network flexibility and scalability without modifying model structures. However, these methods often yield suboptimal time-domain explanations. Recently, combining domain transform with SHAP has improved interpretability by extending explanations to more informative domains. Nonetheless, the computational expense of SHAP, exacerbated by increased dimensions from domain transforms, remains a major challenge. To address this, we propose patch-wise attribution and SHapley Estimated Explanation (SHEP). Patch-wise attribution reduces feature dimensions at the cost of explanation granularity, while SHEP simplifies subset enumeration to approximate SHAP, reducing complexity from exponential to linear. Together, these methods significantly enhance SHAP's computational efficiency, providing feasibility for real-time interpretation in monitoring tasks. Extensive experiments confirm SHEP's efficiency, interpretability, and reliability in approximating SHAP. Additionally, with open-source code, SHEP has the potential to serve as a benchmark for post-hoc interpretability in IFD. The code is available on https://github.com/ChenQian0618/SHEP.
中文摘要:提出的分块归因和沙普利估计解释(SHEP)方法在保持智能故障诊断可解释性的同时显著提升了计算效率,并通过开源代码为实际应用提供了可行性。
English summary: The proposed patch-wise attribution and SHapley Estimated Explanation (SHEP) method significantly enhances computational efficiency while maintaining interpretability for intelligent fault diagnosis, with open-source code available for practical implementation.
Authors:Brandon Radosevich, John Halloran
Abstract:
To reduce development overhead and enable seamless integration between potential components comprising any given generative AI application, the Model Context Protocol (MCP) (Anthropic, 2024) has recently been released and subsequently widely adopted. The MCP is an open protocol that standardizes API calls to large language models (LLMs), data sources, and agentic tools. By connecting multiple MCP servers, each defined with a set of tools, resources, and prompts, users are able to define automated workflows fully driven by LLMs. However, we show that the current MCP design carries a wide range of security risks for end users. In particular, we demonstrate that industry-leading LLMs may be coerced into using MCP tools to compromise an AI developer's system through various attacks, such as malicious code execution, remote access control, and credential theft. To proactively mitigate these and related attacks, we introduce a safety auditing tool, MCPSafetyScanner, the first agentic tool to assess the security of an arbitrary MCP server. MCPScanner uses several agents to (a) automatically determine adversarial samples given an MCP server's tools and resources; (b) search for related vulnerabilities and remediations based on those samples; and (c) generate a security report detailing all findings. Our work highlights serious security issues with general-purpose agentic workflows while also providing a proactive tool to audit MCP server safety and address detected vulnerabilities before deployment.
The described MCP server auditing tool, MCPSafetyScanner, is freely available at: https://github.com/johnhalloran321/mcpSafetyScanner
模型上下文协议(MCP)虽能标准化生成式AI组件集成,却存在恶意代码执行等安全隐患,为此开发的MCPSafetyScanner可在部署前主动检测服务器漏洞。
The Model Context Protocol (MCP) standardizes generative AI components but introduces security risks like malicious code execution, prompting the development of MCPSafetyScanner to audit server vulnerabilities before deployment.
Authors:Muyun Jiang, Yi Ding, Wei Zhang, Kok Ann Colin Teo, LaiGuan Fong, Shuailei Zhang, Zhiwei Guo, Chenyu Liu, Raghavan Bhuvanakantham, Wei Khang Jeremy Sim, Chuan Huat Vince Foo, Rong Hui Jonathan Chua, Parasuraman Padmanabhan, Victoria Leong, Jia Lu, Balazs Gulyas, Cuntai Guan
Abstract:
Covert speech involves imagining speaking without audible sound or any movements. Decoding covert speech from electroencephalogram (EEG) is challenging due to a limited understanding of neural pronunciation mapping and the low signal-to-noise ratio of the signal. In this study, we developed a large-scale multi-utterance speech EEG dataset from 57 right-handed native English-speaking subjects, each performing covert and overt speech tasks by repeating the same word in five utterances within a ten-second duration. Given the spatio-temporal nature of the neural activation process during speech pronunciation, we developed a Functional Areas Spatio-temporal Transformer (FAST), an effective framework for converting EEG signals into tokens and utilizing transformer architecture for sequence encoding. Our results reveal distinct and interpretable speech neural features by the visualization of FAST-generated activation maps across frontal and temporal brain regions with each word being covertly spoken, providing new insights into the discriminative features of the neural representation of covert speech. This is the first report of such a study, which provides interpretable evidence for speech decoding from EEG. The code for this work has been made public at https://github.com/Jiang-Muyun/FAST
中文摘要:本研究开发了功能区域时空转换器(FAST)框架,成功从脑电信号中解码隐性言语,揭示了前额叶和颞叶脑区的独特神经激活模式,为基于脑电的言语解码提供了首个可解释证据。
English Summary: This study introduces a Functional Areas Spatio-temporal Transformer (FAST) framework that successfully decodes covert speech from EEG signals, revealing distinct neural activation patterns in frontal and temporal brain regions and providing the first interpretable evidence for EEG-based speech decoding.
Authors:Shijie Ma, Fei Zhu, Xu-Yao Zhang, Cheng-Lin Liu
Abstract:
Generalized category discovery (GCD) is a pragmatic but underexplored problem, which requires models to automatically cluster and discover novel categories by leveraging the labeled samples from old classes. The challenge is that unlabeled data contain both old and new classes. Early works leveraging pseudo-labeling with parametric classifiers handle old and new classes separately, which brings about imbalanced accuracy between them. Recent methods employing contrastive learning neglect potential positives and are decoupled from the clustering objective, leading to biased representations and sub-optimal results. To address these issues, we introduce a unified and unbiased prototype learning framework, namely ProtoGCD, wherein old and new classes are modeled with joint prototypes and unified learning objectives, {enabling unified modeling between old and new classes}. Specifically, we propose a dual-level adaptive pseudo-labeling mechanism to mitigate confirmation bias, together with two regularization terms to collectively help learn more suitable representations for GCD. Moreover, for practical considerations, we devise a criterion to estimate the number of new classes. Furthermore, we extend ProtoGCD to detect unseen outliers, achieving task-level unification. Comprehensive experiments show that ProtoGCD achieves state-of-the-art performance on both generic and fine-grained datasets. The code is available at https://github.com/mashijie1028/ProtoGCD.
中文:ProtoGCD提出了一种统一的原型学习框架,通过联合建模新旧类别并采用自适应伪标记与正则化机制,解决了广义类别发现中的不平衡问题,在多个数据集上实现了最优性能。
English: ProtoGCD introduces a unified prototype learning framework that addresses imbalances in generalized category discovery by jointly modeling old and new classes with adaptive pseudo-labeling and regularization, achieving state-of-the-art performance across datasets.
Authors:Kaiyuan Hou, Minghui Zhao, Lilin Xu, Yuang Fan, Xiaofan Jiang
Abstract:
The rapid emergence of Vision-Language Models (VLMs) has significantly advanced multimodal understanding, enabling applications in scene comprehension and visual reasoning. While these models have been primarily evaluated and developed for front-view image understanding, their capabilities in interpreting top-down images have received limited attention, partly due to the scarcity of diverse top-down datasets and the challenges in collecting such data. In contrast, top-down vision provides explicit spatial overviews and improved contextual understanding of scenes, making it particularly valuable for tasks like autonomous navigation, aerial imaging, and spatial planning. In this work, we address this gap by introducing TDBench, a comprehensive benchmark for VLMs in top-down image understanding. TDBench is constructed from public top-down view datasets and high-quality simulated images, including diverse real-world and synthetic scenarios. TDBench consists of visual question-answer pairs across ten evaluation dimensions of image understanding. Moreover, we conduct four case studies that commonly happen in real-world scenarios but are less explored. By revealing the strengths and limitations of existing VLM through evaluation results, we hope TDBench to provide insights for motivating future research. Project homepage: https://github.com/Columbia-ICSL/TDBench
Chinese Summary: 本研究提出TDBench基准测试,通过创新的评估框架和案例研究,针对俯视图像评估视觉语言模型,解决其旋转不变性和可靠性等被忽视的问题。
English Summary: The study introduces TDBench, a benchmark for evaluating Vision Language Models on top-down images, addressing their overlooked rotational invariance and reliability issues through a novel evaluation framework and case studies.
Authors:Teodor Chiaburu, Felix BieÃmann, Frank HauÃer
Abstract:
Understanding uncertainty in Explainable AI (XAI) is crucial for building trust and ensuring reliable decision-making in Machine Learning models. This paper introduces a unified framework for quantifying and interpreting Uncertainty in XAI by defining a general explanation function $e_θ(x, f)$ that captures the propagation of uncertainty from key sources: perturbations in input data and model parameters. By using both analytical and empirical estimates of explanation variance, we provide a systematic means of assessing the impact uncertainty on explanations. We illustrate the approach using a first-order uncertainty propagation as the analytical estimator. In a comprehensive evaluation across heterogeneous datasets, we compare analytical and empirical estimates of uncertainty propagation and evaluate their robustness. Extending previous work on inconsistencies in explanations, our experiments identify XAI methods that do not reliably capture and propagate uncertainty. Our findings underscore the importance of uncertainty-aware explanations in high-stakes applications and offer new insights into the limitations of current XAI methods. The code for the experiments can be found in our repository at https://github.com/TeodorChiaburu/UXAI
本文提出了一个统一框架,通过分析输入数据和模型参数变化对解释的影响来量化可解释人工智能中的不确定性,揭示了现有方法的局限性,并强调了在关键应用中采用不确定性感知方法的重要性。
This paper presents a unified framework for quantifying uncertainty in Explainable AI by analyzing how input data and model parameter variations affect explanations, revealing limitations in current methods and emphasizing the need for uncertainty-aware approaches in critical applications.
Authors:Lihui Liu, Zihao Wang, Dawei Zhou, Ruijie Wang, Yuchen Yan, Bo Xiong, Sihong He, Kai Shu, Hanghang Tong
Abstract:
Knowledge graphs (KGs) are ubiquitous and widely used in various applications. However, most real-world knowledge graphs are incomplete, which significantly degrades their performance on downstream tasks. Additionally, the relationships in real-world knowledge graphs often follow a long-tail distribution, meaning that most relations are represented by only a few training triplets. To address these challenges, few-shot learning has been introduced. Few-shot KG completion aims to make accurate predictions for triplets involving novel relations when only a limited number of training triplets are available. Although many methods have been proposed, they typically learn each relation individually, overlooking the correlations between different tasks and the relevant information in previously trained tasks. In this paper, we propose a transfer learning-based few-shot KG completion method (TransNet). By learning the relationships between different tasks, TransNet effectively transfers knowledge from similar tasks to improve the current task's performance. Furthermore, by employing meta-learning, TransNet can generalize effectively to new, unseen relations. Extensive experiments on benchmark datasets demonstrate the superiority of TransNet over state-of-the-art methods. Code can be found at https://github.com/lihuiliullh/TransNet/tree/main
中文: 本文提出TransNet方法,通过迁移学习和元学习技术挖掘任务间关联性,有效提升少样本知识图谱补全任务中对新关系的预测性能,实验证明其优于现有先进方法。
English: This paper introduces TransNet, a transfer learning-based method for few-shot knowledge graph completion that leverages task relationships and meta-learning to enhance performance on novel relations, demonstrating superior results over existing approaches.
Authors:Yongyi Yang, Jianyang Gao, Wei Hu
Abstract:
Post-training Quantization (PTQ) has become a widely used technique for improving inference efficiency of large language models (LLMs). However, existing PTQ methods generally suffer from crucial limitations such as heavy calibration data requirements and inflexible choice of target number of bits. In this paper, we propose RaanA, a unified PTQ framework that overcomes these challenges by introducing two novel components: 1) RaBitQ-H, a variant of a randomized vector quantization method RaBitQ, designed for fast, accurate, and highly efficient quantization; and 2) AllocateBits, an algorithm that optimally allocates bit-widths across layers based on their quantization sensitivity. RaanA achieves competitive performance with state-of-the-art quantization methods while being extremely fast, requiring minimal calibration data, and enabling flexible bit allocation. Extensive experiments demonstrate RaanA's efficacy in balancing efficiency and accuracy. The code is publicly available at https://github.com/FFTYYY/RaanA .
中文:RaanA提出了一种统一的PTQ框架,结合RaBitQ-H实现高效量化和AllocateBits优化比特分配,以极少数据和灵活比特实现优异性能。
English: RaanA introduces a unified PTQ framework with RaBitQ-H for efficient quantization and AllocateBits for optimal bit-width allocation, achieving high performance with minimal data and flexible bit usage.
Authors:Yuzhu Lei, Guanding Yu
Abstract:
Lithium-ion battery health management has become increasingly important as the application of batteries expands. Precise forecasting of capacity degradation is critical for ensuring the healthy usage of batteries. In this paper, we innovatively propose MSPMLP, a multi-scale capacity prediction model utilizing the mixture of experts (MoE) architecture and patch-based multi-layer perceptron (MLP) blocks, to capture both the long-term degradation trend and local capacity regeneration phenomena. Specifically, we utilize patch-based MLP blocks with varying patch sizes to extract multi-scale features from the capacity sequence. Leveraging the MoE architecture, the model adaptively integrates the extracted features, thereby enhancing its capacity and expressiveness. Finally, the future battery capacity is predicted based on the integrated features, achieving high prediction accuracy and generalization. Experimental results on the public NASA dataset indicate that MSPMLP achieves a mean absolute error (MAE) of 0.0078, improving by 41.8\% compared to existing methods. These findings highlight that MSPMLP, owing to its multi-scale modeling capability and generalizability, provides a promising solution to the battery capacity prediction challenges caused by capacity regeneration phenomena and complex usage conditions. The code of this work is provided at https://github.com/LeiYuzhu/CapacityPredict.
中文摘要:本文创新提出MSPMLP多尺度容量预测模型,通过混合专家架构和基于补丁的多层感知器模块,有效捕捉电池长期退化趋势和局部容量再生现象,在NASA数据集上实现预测精度41.8%的提升。
English Summary: This paper introduces MSPMLP, an innovative multi-scale battery capacity prediction model that combines mixture of experts architecture with patch-based MLP blocks to accurately forecast both long-term degradation trends and local capacity regeneration, achieving a 41.8% improvement in prediction accuracy on the NASA dataset.
Authors:Zae Myung Kim, Anand Ramachandran, Farideh Tavazoee, Joo-Kyung Kim, Oleg Rokhlenko, Dongyeop Kang
Abstract:
Generating long, coherent text remains a challenge for large language models (LLMs), as they lack hierarchical planning and structured organization in discourse generation. We introduce Structural Alignment, a novel method that aligns LLMs with human-like discourse structures to enhance long-form text generation. By integrating linguistically grounded discourse frameworks into reinforcement learning, our approach guides models to produce coherent and well-organized outputs. We employ a dense reward scheme within a Proximal Policy Optimization framework, assigning fine-grained, token-level rewards based on the discourse distinctiveness relative to human writing. Two complementary reward models are evaluated: the first improves readability by scoring surface-level textual features to provide explicit structuring, while the second reinforces deeper coherence and rhetorical sophistication by analyzing global discourse patterns through hierarchical discourse motifs, outperforming both standard and RLHF-enhanced models in tasks such as essay generation and long-document summarization. All training data and code will be publicly shared at https://github.com/minnesotanlp/struct_align.
中文:结构对齐方法通过强化学习将大型语言模型与人类话语结构对齐,利用细粒度奖励提升文本连贯性,在文章生成和长文档摘要等任务中优于现有模型。
English: Structural Alignment enhances long-form text generation in LLMs by aligning them with human discourse structures through reinforcement learning, using token-level rewards to improve coherence and outperforming existing models in tasks like essay writing and summarization.
Authors:Runnan Fang, Xiaobin Wang, Yuan Liang, Shuofei Qiao, Jialong Wu, Zekun Xi, Ningyu Zhang, Yong Jiang, Pengjun Xie, Fei Huang, Huajun Chen
Abstract:
In the interaction between agents and their environments, agents expand their capabilities by planning and executing actions. However, LLM-based agents face substantial challenges when deployed in novel environments or required to navigate unconventional action spaces. To empower agents to autonomously explore environments, optimize workflows, and enhance their understanding of actions, we propose SynWorld, a framework that allows agents to synthesize possible scenarios with multi-step action invocation within the action space and perform Monte Carlo Tree Search (MCTS) exploration to effectively refine their action knowledge in the current environment. Our experiments demonstrate that SynWorld is an effective and general approach to learning action knowledge in new environments. Code is available at https://github.com/zjunlp/SynWorld.
中文摘要:SynWorld框架通过合成多步动作场景和执行蒙特卡洛树搜索探索,使基于大语言模型的智能体能够自主优化工作流程并增强动作理解,实验证明其在新环境中学习动作知识的有效性。
English Summary: SynWorld is a framework that enables LLM-based agents to autonomously explore novel environments by synthesizing scenarios and using Monte Carlo Tree Search to refine their action knowledge, proving effective in experimental evaluations.
Authors:Shuofei Qiao, Zhisong Qiu, Baochang Ren, Xiaobin Wang, Xiangyuan Ru, Ningyu Zhang, Xiang Chen, Yong Jiang, Pengjun Xie, Fei Huang, Huajun Chen
Abstract:
Large Language Models (LLMs) have achieved considerable performance across various agentic planning tasks. However, traditional agent planning approaches adopt a "flood irrigation" methodology that indiscriminately injects gold trajectories, external feedback, and domain knowledge into agent models. This practice overlooks the fundamental human cognitive principle of situational self-awareness during decision-making-the ability to dynamically assess situational demands and strategically employ resources during decision-making. We propose agentic knowledgeable self-awareness to address this gap, a novel paradigm enabling LLM-based agents to autonomously regulate knowledge utilization. Specifically, we propose KnowSelf, a data-centric approach that applies agents with knowledgeable self-awareness like humans. Concretely, we devise a heuristic situation judgement criterion to mark special tokens on the agent's self-explored trajectories for collecting training data. Through a two-stage training process, the agent model can switch between different situations by generating specific special tokens, achieving optimal planning effects with minimal costs. Our experiments demonstrate that KnowSelf can outperform various strong baselines on different tasks and models with minimal use of external knowledge. Code is available at https://github.com/zjunlp/KnowSelf.
中文摘要:提出的KnowSelf范式通过情境感知使基于大语言模型的智能体能够自主调控知识运用,在不同任务中以最少的外部知识实现更优的规划效果。
English Summary: The proposed KnowSelf paradigm enables LLM-based agents to autonomously regulate knowledge usage through situational awareness, achieving superior planning performance with minimal external knowledge across various tasks.
Authors:Khai Le-Duc, Tuyen Tran, Bach Phan Tat, Nguyen Kim Hai Bui, Quan Dang, Hung-Phong Tran, Thanh-Thuy Nguyen, Ly Nguyen, Tuan-Minh Phan, Thi Thu Phuong Tran, Chris Ngo, Nguyen X. Khanh, Thanh Nguyen-Tang
Abstract:
Multilingual speech translation (ST) in the medical domain enhances patient care by enabling efficient communication across language barriers, alleviating specialized workforce shortages, and facilitating improved diagnosis and treatment, particularly during pandemics. In this work, we present the first systematic study on medical ST, to our best knowledge, by releasing MultiMed-ST, a large-scale ST dataset for the medical domain, spanning all translation directions in five languages: Vietnamese, English, German, French, Traditional Chinese and Simplified Chinese, together with the models. With 290,000 samples, our dataset is the largest medical machine translation (MT) dataset and the largest many-to-many multilingual ST among all domains. Secondly, we present the most extensive analysis study in ST research to date, including: empirical baselines, bilingual-multilingual comparative study, end-to-end vs. cascaded comparative study, task-specific vs. multi-task sequence-to-sequence (seq2seq) comparative study, code-switch analysis, and quantitative-qualitative error analysis. All code, data, and models are available online: https://github.com/leduckhai/MultiMed-ST.
中文摘要:本研究推出了首个医疗领域大规模多语言语音翻译数据集MultiMed-ST,包含29万条五语言样本,并通过全面对比分析推动跨语言医疗交流的发展。
English Summary: This study introduces MultiMed-ST, the first large-scale multilingual speech translation dataset for the medical domain, featuring 290,000 samples across five languages and comprehensive comparative analyses to advance cross-lingual healthcare communication.
Authors:Makoto Takamoto, Daniel Oñoro-Rubio, Wiem Ben Rim, Takashi Maruyama, Bhushan Kotnis
Abstract:
Knowledge graph embedding (KGE) models encode the structural information of knowledge graphs to predicting new links. Effective training of these models requires distinguishing between positive and negative samples with high precision. Although prior research has shown that improving the quality of negative samples can significantly enhance model accuracy, identifying high-quality negative samples remains a challenging problem. This paper theoretically investigates the condition under which negative samples lead to optimal KG embedding and identifies a sufficient condition for an effective negative sample distribution. Based on this theoretical foundation, we propose \textbf{E}mbedding \textbf{MU}tation (\textsc{EMU}), a novel framework that \emph{generates} negative samples satisfying this condition, in contrast to conventional methods that focus on \emph{identifying} challenging negative samples within the training data. Importantly, the simplicity of \textsc{EMU} ensures seamless integration with existing KGE models and negative sampling methods. To evaluate its efficacy, we conducted comprehensive experiments across multiple datasets. The results consistently demonstrate significant improvements in link prediction performance across various KGE models and negative sampling methods. Notably, \textsc{EMU} enables performance improvements comparable to those achieved by models with embedding dimension five times larger. An implementation of the method and experiments are available at https://github.com/nec-research/EMU-KG.
中文: 本文提出EMU框架,基于理论条件生成知识图谱嵌入模型的最优负样本,显著提升了多种模型和数据集上的链接预测性能。
English: This paper introduces EMU, a framework that generates optimal negative samples for knowledge graph embedding models based on a theoretical condition, significantly improving link prediction performance across various models and datasets.
Authors:Guido Barducci, Ivan Rossi, Francesco Codicè, Cesare Rollo, Valeria Repetto, Corrado Pancotti, Virginia Iannibelli, Tiziana Sanavia, Piero Fariselli
Abstract:
Understanding how residue variations affect protein stability is crucial for designing functional proteins and deciphering the molecular mechanisms underlying disease-related mutations. Recent advances in protein language models (PLMs) have revolutionized computational protein analysis, enabling, among other things, more accurate predictions of mutational effects. In this work, we introduce JanusDDG, a deep learning framework that leverages PLM-derived embeddings and a bidirectional cross-attention transformer architecture to predict $ÎÎG$ of single and multiple-residue mutations while simultaneously being constrained to respect fundamental thermodynamic properties, such as antisymmetry and transitivity. Unlike conventional self-attention, JanusDDG computes queries (Q) and values (V) as the difference between wild-type and mutant embeddings, while keys (K) alternate between the two. This cross-interleaved attention mechanism enables the model to capture mutation-induced perturbations while preserving essential contextual information. Experimental results show that JanusDDG achieves state-of-the-art performance in predicting $ÎÎG$ from sequence alone, matching or exceeding the accuracy of structure-based methods for both single and multiple mutations. Code Availability:https://github.com/compbiomed-unito/JanusDDG
Chinese: JanusDDG是一种深度学习框架,利用蛋白质语言模型和双向交叉注意力变换器,在遵循热力学原理的同时准确预测单点和多点突变的ΔΔG,仅凭序列数据即达到最先进的性能水平。
English: JanusDDG is a deep learning framework that utilizes protein language models and a bidirectional cross-attention transformer to accurately predict ΔΔG for single and multiple mutations while adhering to thermodynamic principles, achieving state-of-the-art performance from sequence data alone.
Authors:Lifan Hu
Abstract:
This work investigates the inverse problem of generator recovery in matrix Lie groups from discretized trajectories. Let $G$ be a real matrix Lie group and $\mathfrak{g} = \text{Lie}(G)$ its corresponding Lie algebra. A smooth trajectory $γ($t$)$ generated by a fixed Lie algebra element $ξ\in \mathfrak{g}$ follows the exponential flow $γ($t$) = g_0 \cdot \exp(t ξ)$. The central task addressed in this work is the reconstruction of such a latent generator $ξ$ from a discretized sequence of poses $ \{g_0, g_1, \dots, g_T\} \subset G$, sampled at uniform time intervals. This problem is formulated as a data-driven regression from normalized sequences of discrete Lie algebra increments $\log\left(g_{t}^{-1} g_{t+1}\right)$ to the constant generator $ξ\in \mathfrak{g}$. A feedforward neural network is trained to learn this mapping across several groups, including $\text{SE(2)}, \text{SE(3)}, \text{SO(3)}, and \text{SL(2,$\mathbb{R})$}$. It demonstrates strong empirical accuracy under both clean and noisy conditions, which validates the viability of data-driven recovery of Lie group generators using shallow neural architectures. This is Lie-RL GitHub Repo https://github.com/Anormalm/LieRL-on-Trajectories. Feel free to make suggestions and collaborations!
本研究利用神经网络开发了一种数据驱动方法,能够从矩阵李群的离散化轨迹中准确恢复潜在生成器,并在不同条件下展现出鲁棒性能。
This study develops a data-driven approach using a neural network to accurately recover the latent generator from discretized trajectories in matrix Lie groups, demonstrating robust performance under various conditions.
Authors:Thomas Daniel, Malgorzata Olejniczak, Julien Tierny
Abstract:
This application paper investigates the stability of hydrogen bonds (H-bonds), as characterized by the Quantum Theory of Atoms in Molecules (QTAIM). First, we contribute a database of 4544 electron densities associated to four isomers of water hexamers (the so-called Ring, Book, Cage and Prism), generated by distorting their equilibrium geometry under various structural perturbations, modeling the natural dynamic behavior of molecular systems. Second, we present a new stability measure, called bond occurrence rate, associating each bond path present at equilibrium with its rate of occurrence within the input ensemble. We also provide an algorithm, called BondMatcher, for its automatic computation, based on a tailored, geometry-aware partial isomorphism estimation between the extremum graphs of the considered electron densities. Our new stability measure allows for the automatic identification of densities lacking H-bond paths, enabling further visual inspections. Specifically, the topological analysis enabled by our framework corroborates experimental observations and provides refined geometrical criteria for characterizing the disappearance of H-bond paths. Our electron density database and our C++ implementation are available at this address: https://github.com/thom-dani/BondMatcher.
中文摘要:本研究通过量子理论中的原子分子方法,提出键出现率指标及BondMatcher算法分析水六聚体中氢键稳定性,其拓扑分析验证了实验观测结果并为氢键路径消失提供了精确的几何判定标准。
English Summary: This study introduces a bond occurrence rate measure and BondMatcher algorithm to analyze hydrogen bond stability in water hexamers using QTAIM, with topological analysis confirming experimental observations and providing refined criteria for bond path disappearance.
Authors:Zihan Gu, Ruoyu Chen, Hua Zhang, Yue Hu, Xiaochun Cao
Abstract:
Grokking, referring to the abrupt improvement in test accuracy after extended overfitting, offers valuable insights into the mechanisms of model generalization. Existing researches based on progress measures imply that grokking relies on understanding the optimization dynamics when the loss function is dominated solely by the weight decay term. However, we find that this optimization merely leads to token uniformity, which is not a sufficient condition for grokking. In this work, we investigate the grokking mechanism underlying the Transformer in the task of prime number operations. Based on theoretical analysis and experimental validation, we present the following insights: (i) The weight decay term encourages uniformity across all tokens in the embedding space when it is minimized. (ii) The occurrence of grokking is jointly determined by the uniformity of the embedding space and the distribution of the training dataset. Building on these insights, we provide a unified perspective for understanding various previously proposed progress measures and introduce a novel, concise, and effective progress measure that could trace the changes in test loss more accurately. Finally, to demonstrate the versatility of our theoretical framework, we design a dedicated dataset to validate our theory on ResNet-18, successfully showcasing the occurrence of grokking. The code is released at https://github.com/Qihuai27/Grokking-Insight.
Chinese: 本研究揭示了Transformer在素数运算任务中顿悟现象的产生源于嵌入空间均匀性与训练数据分布的协同作用,并提出了新的进展度量指标,同时在ResNet-18上验证了理论框架的普适性。
English: This study reveals that grokking in Transformers during prime number operations arises from the combined effect of embedding space uniformity and training data distribution, leading to a new progress measure and validation on ResNet-18.
Authors:Yuxiang Zheng, Dayuan Fu, Xiangkun Hu, Xiaojie Cai, Lyumanshan Ye, Pengrui Lu, Pengfei Liu
Abstract:
Large Language Models (LLMs) equipped with web search capabilities have demonstrated impressive potential for deep research tasks. However, current approaches predominantly rely on either manually engineered prompts (prompt engineering-based) with brittle performance or reinforcement learning within controlled Retrieval-Augmented Generation (RAG) environments (RAG-based) that fail to capture the complexities of real-world interaction. In this paper, we introduce DeepResearcher, the first comprehensive framework for end-to-end training of LLM-based deep research agents through scaling reinforcement learning (RL) in real-world environments with authentic web search interactions. Unlike RAG-based approaches that assume all necessary information exists within a fixed corpus, our method trains agents to navigate the noisy, unstructured, and dynamic nature of the open web. We implement a specialized multi-agent architecture where browsing agents extract relevant information from various webpage structures and overcoming significant technical challenges. Extensive experiments on open-domain research tasks demonstrate that DeepResearcher achieves substantial improvements of up to 28.9 points over prompt engineering-based baselines and up to 7.2 points over RAG-based RL agents. Our qualitative analysis reveals emergent cognitive behaviors from end-to-end RL training, including the ability to formulate plans, cross-validate information from multiple sources, engage in self-reflection to redirect research, and maintain honesty when unable to find definitive answers. Our results highlight that end-to-end training in real-world web environments is not merely an implementation detail but a fundamental requirement for developing robust research capabilities aligned with real-world applications. We release DeepResearcher at https://github.com/GAIR-NLP/DeepResearcher.
中文: DeepResearcher是一种通过真实网络环境中的强化学习来训练大语言模型成为深度研究代理的创新框架,其性能显著超越现有方法,并展现出规划验证、自我反思等新兴认知能力。
English: DeepResearcher is a novel framework that trains large language models as deep research agents through reinforcement learning in real-world web environments, significantly outperforming existing methods and demonstrating emergent cognitive behaviors for robust information gathering.
Authors:Haozhan Tang, Tianyi Zhang, Oliver Kroemer, Matthew Johnson-Roberson, Weiming Zhi
Abstract:
Robots operating in unstructured environments often require accurate and consistent object-level representations. This typically requires segmenting individual objects from the robot's surroundings. While recent large models such as Segment Anything (SAM) offer strong performance in 2D image segmentation. These advances do not translate directly to performance in the physical 3D world, where they often over-segment objects and fail to produce consistent mask correspondences across views. In this paper, we present GraphSeg, a framework for generating consistent 3D object segmentations from a sparse set of 2D images of the environment without any depth information. GraphSeg adds edges to graphs and constructs dual correspondence graphs: one from 2D pixel-level similarities and one from inferred 3D structure. We formulate segmentation as a problem of edge addition, then subsequent graph contraction, which merges multiple 2D masks into unified object-level segmentations. We can then leverage \emph{3D foundation models} to produce segmented 3D representations. GraphSeg achieves robust segmentation with significantly fewer images and greater accuracy than prior methods. We demonstrate state-of-the-art performance on tabletop scenes and show that GraphSeg enables improved performance on downstream robotic manipulation tasks. Code available at https://github.com/tomtang502/graphseg.git.
中文: GraphSeg框架通过构建双重对应图并利用3D基础模型,仅从稀疏的2D图像即可生成一致的3D物体分割,在机器人操作任务中实现了超越现有方法的准确性和性能表现。
English: GraphSeg is a framework that generates consistent 3D object segmentations from sparse 2D images without depth information by constructing dual correspondence graphs and leveraging 3D foundation models, achieving superior accuracy and performance in robotic manipulation tasks.
Authors:Rohit Agarwal, Aryan Dessai, Arif Ahmed Sekh, Krishna Agarwal, Alexander Horsch, Dilip K. Prasad
Abstract:
The field of varying feature space in online learning settings, also known as haphazard inputs, is very prominent nowadays due to its applicability in various fields. However, the current solutions to haphazard inputs are model-dependent and cannot benefit from the existing advanced deep-learning methods, which necessitate inputs of fixed dimensions. Therefore, we propose to transform the varying feature space in an online learning setting to a fixed-dimension image representation on the fly. This simple yet novel approach is model-agnostic, allowing any vision-based models to be applicable for haphazard inputs, as demonstrated using ResNet and ViT. The image representation handles the inconsistent input data seamlessly, making our proposed approach scalable and robust. We show the efficacy of our method on four publicly available datasets. The code is available at https://github.com/Rohit102497/HaphazardInputsAsImages.
中文: 本文提出了一种与模型无关的方法,将变化特征空间实时转换为固定维度的图像表示,使得ResNet和ViT等先进视觉模型能够处理随机输入,并在四个数据集上验证了其有效性。
English: This paper introduces a model-agnostic method that converts varying feature spaces into fixed-dimension image representations, enabling the use of advanced vision models like ResNet and ViT for haphazard inputs and demonstrating effectiveness across four datasets.
Authors:Hongzhe Du, Weikai Li, Min Cai, Karim Saraipour, Zimin Zhang, Himabindu Lakkaraju, Yizhou Sun, Shichang Zhang
Abstract:
Post-training is essential for the success of large language models (LLMs), transforming pre-trained base models into more useful and aligned post-trained models. While plenty of works have studied post-training algorithms and evaluated post-training models by their outputs, it remains understudied how post-training reshapes LLMs internally. In this paper, we compare base and post-trained LLMs mechanistically from four perspectives to better understand post-training effects. Our findings across model families and datasets reveal that: (1) Post-training does not change the factual knowledge storage locations, and it adapts knowledge representations from the base model while developing new knowledge representations; (2) Both truthfulness and refusal can be represented by vectors in the hidden representation space. The truthfulness direction is highly similar between the base and post-trained model, and it is effectively transferable for interventions; (3) The refusal direction is different between the base and post-trained models, and it shows limited forward transferability; (4) Differences in confidence between the base and post-trained models cannot be attributed to entropy neurons. Our study provides insights into the fundamental mechanisms preserved and altered during post-training, facilitates downstream tasks like model steering, and could potentially benefit future research in interpretability and LLM post-training. Our code is publicly available at https://github.com/HZD01/post-training-mechanistic-analysis.
中文摘要:后训练通过调整和扩展知识表征而不改变事实存储位置来增强大语言模型,同时揭示了有助于模型引导与可解释性的差异化真实性表达与拒绝机制。
English Summary: Post-training enhances large language models by adapting and developing knowledge representations without altering factual storage locations, while revealing distinct truthfulness and refusal mechanisms that aid in model steering and interpretability.
Authors:Yangxiao Lu, Ruosen Li, Liqiang Jing, Jikai Wang, Xinya Du, Yunhui Guo, Nicholas Ruozzi, Yu Xiang
Abstract:
Visual grounding focuses on detecting objects from images based on language expressions. Recent Large Vision-Language Models (LVLMs) have significantly advanced visual grounding performance by training large models with large-scale datasets. However, the problem remains challenging, especially when similar objects appear in the input image. For example, an LVLM may not be able to differentiate Diet Coke and regular Coke in an image. In this case, if additional reference images of Diet Coke and regular Coke are available, it can help the visual grounding of similar objects. In this work, we introduce a new task named Multimodal Reference Visual Grounding (MRVG). In this task, a model has access to a set of reference images of objects in a database. Based on these reference images and a language expression, the model is required to detect a target object from a query image. We first introduce a new dataset to study the MRVG problem. Then we introduce a novel method, named MRVG-Net, to solve this visual grounding problem. We show that by efficiently using reference images with few-shot object detection and using Large Language Models (LLMs) for object matching, our method achieves superior visual grounding performance compared to the state-of-the-art LVLMs such as Qwen2.5-VL-72B. Our approach bridges the gap between few-shot detection and visual grounding, unlocking new capabilities for visual understanding, which has wide applications in robotics. Project page with our video, code, and dataset: https://irvlutd.github.io/MultiGrounding
Authors:Sudong Wang, Yunjian Zhang, Yao Zhu, Jianing Li, Zizhe Wang, Yanwei Liu, Xiangyang Ji
Abstract:
Large Vision-Language Models (LVLMs) are gradually becoming the foundation for many artificial intelligence applications. However, understanding their internal working mechanisms has continued to puzzle researchers, which in turn limits the further enhancement of their capabilities. In this paper, we seek to investigate how multimodal knowledge evolves and eventually induces natural languages in LVLMs. We design a series of novel strategies for analyzing internal knowledge within LVLMs, and delve into the evolution of multimodal knowledge from three levels, including single token probabilities, token probability distributions, and feature encodings. In this process, we identify two key nodes in knowledge evolution: the critical layers and the mutation layers, dividing the evolution process into three stages: rapid evolution, stabilization, and mutation. Our research is the first to reveal the trajectory of knowledge evolution in LVLMs, providing a fresh perspective for understanding their underlying mechanisms. Our codes are available at https://github.com/XIAO4579/Vlm-interpretability.
中文摘要:本研究揭示了大型视觉语言模型中多模态知识的演化轨迹,通过识别关键层和突变层将演化过程划分为三个阶段,为理解其内在机制提供了新视角。
English Summary: This study investigates the evolution of multimodal knowledge in Large Vision-Language Models, identifying critical and mutation layers that divide the process into three distinct stages and offering new insights into their internal mechanisms.
Authors:Mateusz Pach, Shyamgopal Karthik, Quentin Bouniot, Serge Belongie, Zeynep Akata
Abstract:
Given that interpretability and steerability are crucial to AI safety, Sparse Autoencoders (SAEs) have emerged as a tool to enhance them in Large Language Models (LLMs). In this work, we extend the application of SAEs to Vision-Language Models (VLMs), such as CLIP, and introduce a comprehensive framework for evaluating monosemanticity at the neuron-level in vision representations. To ensure that our evaluation aligns with human perception, we propose a benchmark derived from a large-scale user study. Our experimental results reveal that SAEs trained on VLMs significantly enhance the monosemanticity of individual neurons, with sparsity and wide latents being the most influential factors. Notably, we demonstrate that applying SAE interventions on CLIP's vision encoder directly steers multimodal LLM outputs (e.g., LLaVA), without any modifications to the underlying model. These findings emphasize the practicality and efficacy of SAEs as an unsupervised tool for enhancing both interpretability and control of VLMs. Code is available at https://github.com/ExplainableML/sae-for-vlm.
Chinese: 稀疏自编码器(SAEs)应用于视觉语言模型(VLMs),通过增强神经元层面的单义性并直接引导模型输出,无需修改底层架构即可提升可解释性和控制能力。
English: Sparse Autoencoders (SAEs) are applied to Vision-Language Models (VLMs) to improve neuron-level monosemanticity and enable direct steering of model outputs, enhancing interpretability and control without modifying the underlying architecture.
Authors:Chuning Zhu, Raymond Yu, Siyuan Feng, Benjamin Burchfiel, Paarth Shah, Abhishek Gupta
Abstract:
Imitation learning has emerged as a promising approach towards building generalist robots. However, scaling imitation learning for large robot foundation models remains challenging due to its reliance on high-quality expert demonstrations. Meanwhile, large amounts of video data depicting a wide range of environments and diverse behaviors are readily available. This data provides a rich source of information about real-world dynamics and agent-environment interactions. Leveraging this data directly for imitation learning, however, has proven difficult due to the lack of action annotation. In this work, we present Unified World Models (UWM), a framework that allows for leveraging both video and action data for policy learning. Specifically, a UWM integrates an action diffusion process and a video diffusion process within a unified transformer architecture, where independent diffusion timesteps govern each modality. By controlling each diffusion timestep, UWM can flexibly represent a policy, a forward dynamics, an inverse dynamics, and a video generator. Through simulated and real-world experiments, we show that: (1) UWM enables effective pretraining on large-scale multitask robot datasets with both dynamics and action predictions, resulting in more generalizable and robust policies than imitation learning, (2) UWM naturally facilitates learning from action-free video data through independent control of modality-specific diffusion timesteps, further improving the performance of finetuned policies. Our results suggest that UWM offers a promising step toward harnessing large, heterogeneous datasets for scalable robot learning, and provides a simple unification between the often disparate paradigms of imitation learning and world modeling. Videos and code are available at https://weirdlabuw.github.io/uwm/.
Authors:Leonardo Iurada, Marco Ciccone, Tatiana Tommasi
Abstract:
Task arithmetic has emerged as a promising approach for editing models by representing task-specific knowledge as composable task vectors. However, existing methods rely on network linearization to derive task vectors, leading to computational bottlenecks during training and inference. Moreover, linearization alone does not ensure weight disentanglement, the key property that enables conflict-free composition of task vectors. To address this, we propose TaLoS which allows to build sparse task vectors with minimal interference without requiring explicit linearization and sharing information across tasks. We find that pre-trained models contain a subset of parameters with consistently low gradient sensitivity across tasks, and that sparsely updating only these parameters allows for promoting weight disentanglement during fine-tuning. Our experiments prove that TaLoS improves training and inference efficiency while outperforming current methods in task addition and negation. By enabling modular parameter editing, our approach fosters practical deployment of adaptable foundation models in real-world applications.
中文: TaLoS提出了一种构建稀疏任务向量的方法,通过仅更新梯度敏感性低的参数子集来提升训练和推理效率,无需线性化即可在任务算术中超越现有方法。
English: TaLoS introduces a method for creating sparse task vectors that enhance training and inference efficiency by updating only a subset of parameters with low gradient sensitivity, outperforming existing approaches in task arithmetic without requiring linearization.
Authors:Yan Ma, Steffi Chern, Xuyang Shen, Yiran Zhong, Pengfei Liu
Abstract:
Reinforcement learning (RL) has recently shown strong potential in improving the reasoning capabilities of large language models and is now being actively extended to vision-language models (VLMs). However, existing RL applications in VLMs often rely on heavily engineered frameworks that hinder reproducibility and accessibility, while lacking standardized evaluation protocols, making it difficult to compare results or interpret training dynamics. This work introduces a transparent, from-scratch framework for RL in VLMs, offering a minimal yet functional four-step pipeline validated across multiple models and datasets. In addition, a standardized evaluation scheme is proposed to assess training dynamics and reflective behaviors. Extensive experiments on visual reasoning tasks uncover key empirical findings: response length is sensitive to random seeds, reflection correlates with output length, and RL consistently outperforms supervised fine-tuning (SFT) in generalization, even with high-quality data. These findings, together with the proposed framework, aim to establish a reproducible baseline and support broader engagement in RL-based VLM research.
中文摘要:本研究为视觉语言模型提出了一个透明的强化学习框架,并通过实验验证了关键发现,如强化学习在泛化能力上优于监督微调,以及响应长度对性能的影响。
English Summary: This study introduces a transparent reinforcement learning framework for vision-language models, validated through experiments that reveal key insights like RL's superior generalization over supervised fine-tuning and the influence of response length on performance.
Authors:Xiangxiang Chu, Hailang Huang, Xiao Zhang, Fei Wei, Yong Wang
Abstract:
Reinforcement Learning (RL) can directly enhance the reasoning capabilities of large language models without extensive reliance on Supervised Fine-Tuning (SFT). In this work, we revisit the traditional Policy Gradient (PG) mechanism and propose a minimalist RL approach termed Group Policy Gradient (GPG). Unlike conventional methods, GPG directly optimize the original RL objective, thus obviating the need for surrogate loss functions. By eliminating the critic and reference models, avoiding KL divergence constraints, and addressing the advantage and gradient estimation bias, our approach significantly simplifies the training process compared to Group Relative Policy Optimization (GRPO). Our approach achieves superior performance without relying on auxiliary techniques or adjustments. As illustrated in Figure 1, extensive experiments demonstrate that our method not only reduces computational costs but also consistently outperforms GRPO across various unimodal and multimodal tasks. Our code is available at https://github.com/AMAP-ML/GPG.
中文: 强化学习无需大量依赖监督微调即可提升大语言模型的推理能力,而提出的极简主义分组策略梯度方法通过省去复杂组件简化了训练过程,在各类任务中实现了更优性能。
English: Reinforcement Learning can enhance large language models' reasoning without heavy reliance on Supervised Fine-Tuning, and the proposed minimalist Group Policy Gradient method simplifies training by eliminating complex components, achieving superior performance across tasks.
Authors:Abhay Kumar, Louis Owen, Nilabhra Roy Chowdhury, Fabian Güra
Abstract:
Training large language models (LLMs) presents numerous challenges, including gradient instability and loss spikes. These phenomena can lead to catastrophic divergence, requiring costly checkpoint restoration and data batch skipping. Traditional gradient clipping techniques, such as constant or norm-based methods, fail to address these issues effectively due to their reliance on fixed thresholds or heuristics, leading to inefficient learning and requiring frequent manual intervention. In this work, we propose ZClip, an adaptive gradient clipping algorithm that dynamically adjusts the clipping threshold based on statistical properties of gradient norms over time. Unlike prior reactive strategies, ZClip proactively adapts to training dynamics without making any prior assumptions on the scale and the temporal evolution of gradient norms. At its core, it leverages z-score-based anomaly detection to identify and mitigate large gradient spikes, preventing malignant loss spikes while not interfering with convergence otherwise. Our code is available at: https://github.com/bluorion-com/ZClip.
中文摘要:ZClip是一种自适应梯度裁剪算法,通过基于z分数的异常检测动态调整阈值,有效防止大语言模型训练中的梯度不稳定和损失峰值问题,同时不影响模型收敛。
English Summary: ZClip is an adaptive gradient clipping algorithm that dynamically adjusts thresholds using z-score-based anomaly detection to prevent gradient instability and loss spikes in large language model training without hindering convergence.
Authors:Jingyi Wang, Duanfeng Chu, Zejian Deng, Liping Lu, Jinxiang Wang, Chen Sun
Abstract:
To address the challenge of insufficient interactivity and behavioral diversity in autonomous driving decision-making, this paper proposes a Cognitive Hierarchical Agent for Reasoning and Motion Stylization (CHARMS). By leveraging Level-k game theory, CHARMS captures human-like reasoning patterns through a two-stage training pipeline comprising reinforcement learning pretraining and supervised fine-tuning. This enables the resulting models to exhibit diverse and human-like behaviors, enhancing their decision-making capacity and interaction fidelity in complex traffic environments. Building upon this capability, we further develop a scenario generation framework that utilizes the Poisson cognitive hierarchy theory to control the distribution of vehicles with different driving styles through Poisson and binomial sampling. Experimental results demonstrate that CHARMS is capable of both making intelligent driving decisions as an ego vehicle and generating diverse, realistic driving scenarios as environment vehicles. The code for CHARMS is released at https://github.com/chuduanfeng/CHARMS.
中文摘要:本文提出CHARMS认知分层代理,利用Level-k博弈论通过两阶段训练实现类人推理与多样化行为,实验证明其既能作为主车做出智能驾驶决策,也能生成多样真实的环境车辆驾驶场景。
English Summary: This paper introduces CHARMS, a cognitive hierarchical agent that uses Level-k game theory to enhance autonomous driving decision-making with human-like reasoning and diverse behaviors, validated through experiments for both intelligent driving decisions and realistic scenario generation.
Authors:Ye Su, Hezhe Qiao, Di Wu, Yuwen Chen, Lin Chen
Abstract:
The imputation of the Multivariate time series (MTS) is particularly challenging since the MTS typically contains irregular patterns of missing values due to various factors such as instrument failures, interference from irrelevant data, and privacy regulations. Existing statistical methods and deep learning methods have shown promising results in time series imputation. In this paper, we propose a Temporal Gaussian Copula Model (TGC) for three-order MTS imputation. The key idea is to leverage the Gaussian Copula to explore the cross-variable and temporal relationships based on the latent Gaussian representation. Subsequently, we employ an Expectation-Maximization (EM) algorithm to improve robustness in managing data with varying missing rates. Comprehensive experiments were conducted on three real-world MTS datasets. The results demonstrate that our TGC substantially outperforms the state-of-the-art imputation methods. Additionally, the TGC model exhibits stronger robustness to the varying missing ratios in the test dataset. Our code is available at https://github.com/MVL-Lab/TGC-MTS.
Chinese: 本文提出了一种时序高斯Copula模型(TGC),通过捕捉跨变量和时间依赖性来有效填补多元时间序列中的缺失值,实验证明该模型在性能和鲁棒性上均优于现有方法。
English: The paper introduces a Temporal Gaussian Copula Model (TGC) that effectively imputes missing values in multivariate time series by capturing cross-variable and temporal dependencies, demonstrating superior performance and robustness in experiments compared to existing methods.
Authors:Xinyu Luo, Kecheng Chen, Pao-Sheng Vincent Sun, Chris Xing Tian, Arindam Basu, Haoliang Li
Abstract:
Spiking Neural Networks (SNNs), as a biologically plausible alternative to Artificial Neural Networks (ANNs), have demonstrated advantages in terms of energy efficiency, temporal processing, and biological plausibility. However, SNNs are highly sensitive to distribution shifts, which can significantly degrade their performance in real-world scenarios. Traditional test-time adaptation (TTA) methods designed for ANNs often fail to address the unique computational dynamics of SNNs, such as sparsity and temporal spiking behavior. To address these challenges, we propose SPike-Aware Consistency Enhancement (SPACE), the first source-free and single-instance TTA method specifically designed for SNNs. SPACE leverages the inherent spike dynamics of SNNs to maximize the consistency of spike-behavior-based local feature maps across augmented versions of a single test sample, enabling robust adaptation without requiring source data. We evaluate SPACE on multiple datasets. Furthermore, SPACE exhibits robust generalization across diverse network architectures, consistently enhancing the performance of SNNs on CNNs, Transformer, and ConvLSTM architectures. Experimental results show that SPACE outperforms state-of-the-art ANN methods while maintaining lower computational cost, highlighting its effectiveness and robustness for SNNs in real-world settings. The code will be available at https://github.com/ethanxyluo/SPACE.
Chinese: 提出的SPACE方法是首个专为脉冲神经网络设计的无源单实例测试时自适应方法,利用脉冲动态特性增强对分布偏移的鲁棒性,在多种网络架构上优于现有方法且计算成本更低。
English: The proposed SPACE method is a novel source-free and single-instance test-time adaptation approach designed specifically for Spiking Neural Networks, leveraging their spike dynamics to enhance robustness against distribution shifts and outperforming existing methods across various architectures with lower computational cost.
Authors:Amit Rand, Hadi Ibrahim
Abstract:
Medical imaging, particularly X-ray analysis, often involves detecting multiple conditions simultaneously within a single scan, making multi-label classification crucial for real-world clinical applications. We present the Medical X-ray Attention (MXA) block, a novel attention mechanism tailored specifically to address the unique challenges of X-ray abnormality detection. The MXA block enhances traditional Multi-Head Self Attention (MHSA) by integrating a specialized module that efficiently captures both detailed local information and broader global context. To the best of our knowledge, this is the first work to propose a task-specific attention mechanism for diagnosing chest X-rays, as well as to attempt multi-label classification using an Efficient Vision Transformer (EfficientViT). By embedding the MXA block within the EfficientViT architecture and employing knowledge distillation, our proposed model significantly improves performance on the CheXpert dataset, a widely used benchmark for multi-label chest X-ray abnormality detection. Our approach achieves an area under the curve (AUC) of 0.85, an absolute improvement of 0.19 compared to our baseline model's AUC of 0.66, corresponding to a substantial approximate 233% relative improvement over random guessing (AUC = 0.5).
中文: MXA模块作为一种新型注意力机制,集成于EfficientViT架构中,通过捕捉局部细节和全局上下文,显著提升了多标签胸部X光分类性能,在CheXpert数据集上实现了0.85的AUC值。
English: The MXA block, a novel attention mechanism integrated into the EfficientViT architecture, significantly enhances multi-label chest X-ray classification by capturing both local details and global context, achieving a 0.85 AUC on the CheXpert dataset.
Authors:Shaojin Wu, Mengqi Huang, Wenxu Wu, Yufeng Cheng, Fei Ding, Qian He
Abstract:
Although subject-driven generation has been extensively explored in image generation due to its wide applications, it still has challenges in data scalability and subject expansibility. For the first challenge, moving from curating single-subject datasets to multiple-subject ones and scaling them is particularly difficult. For the second, most recent methods center on single-subject generation, making it hard to apply when dealing with multi-subject scenarios. In this study, we propose a highly-consistent data synthesis pipeline to tackle this challenge. This pipeline harnesses the intrinsic in-context generation capabilities of diffusion transformers and generates high-consistency multi-subject paired data. Additionally, we introduce UNO, which consists of progressive cross-modal alignment and universal rotary position embedding. It is a multi-image conditioned subject-to-image model iteratively trained from a text-to-image model. Extensive experiments show that our method can achieve high consistency while ensuring controllability in both single-subject and multi-subject driven generation.
Chinese: 本研究提出了一种数据合成流程和UNO模型,以解决主题驱动图像生成中的数据扩展和主题扩展难题,在单主题与多主题生成中均实现了高度一致性和可控性。
English: This study introduces a data synthesis pipeline and the UNO model to overcome challenges in subject-driven image generation, achieving high consistency and controllability in both single- and multi-subject scenarios.
Authors:Heming Zhang, Tim Xu, Dekang Cao, Shunning Liang, Lars Schimmelpfennig, Levi Kaster, Di Huang, Carlos Cruchaga, Guangfu Li, Michael Province, Yixin Chen, Philip Payne, Fuhai Li
Abstract:
Complex cell signaling systems -- governed by varying protein abundances and interactions -- generate diverse cell types across organs. These systems evolve under influences such as age, sex, diet, environmental exposures, and diseases, making them challenging to decode given the involvement of tens of thousands of genes and proteins. Recently, hundreds of millions of single-cell omics data have provided a robust foundation for understanding these signaling networks within various cell subpopulations and conditions. Inspired by the success of large foundation models (for example, large language models and large vision models) pre-trained on massive datasets, we introduce OmniCellTOSG, the first dataset of cell text-omic signaling graphs (TOSGs). Each TOSG represents the signaling network of an individual or meta-cell and is labeled with information such as organ, disease, sex, age, and cell subtype. OmniCellTOSG offers two key contributions. First, it introduces a novel graph model that integrates human-readable annotations -- such as biological functions, cellular locations, signaling pathways, related diseases, and drugs -- with quantitative gene and protein abundance data, enabling graph reasoning to decode cell signaling. This approach calls for new joint models combining large language models and graph neural networks. Second, the dataset is built from single-cell RNA sequencing data of approximately 120 million cells from diverse tissues and conditions (healthy and diseased) and is fully compatible with PyTorch. This facilitates the development of innovative cell signaling models that could transform research in life sciences, healthcare, and precision medicine. The OmniCellTOSG dataset is continuously expanding and will be updated regularly. The dataset and code are available at https://github.com/FuhaiLiAiLab/OmniCellTOSG.
中文: OmniCellTOSG推出了首个细胞文本-组学信号图谱数据集,将人类可读的生物注释与定量基因数据相结合,通过图推理解码不同条件下的复杂细胞信号系统。
English: OmniCellTOSG introduces the first dataset of cell text-omic signaling graphs that integrates human-readable biological annotations with quantitative gene data, enabling graph reasoning to decode complex cell signaling systems across diverse conditions.
Authors:Jeffrey Li, Mohammadreza Armandpour, Iman Mirzadeh, Sachin Mehta, Vaishaal Shankar, Raviteja Vemulapalli, Samy Bengio, Oncel Tuzel, Mehrdad Farajtabar, Hadi Pouransari, Fartash Faghri
Abstract:
Large Language Models (LLMs) trained on historical web data inevitably become outdated. We investigate evaluation strategies and update methods for LLMs as new data becomes available. We introduce a web-scale dataset for time-continual pretraining of LLMs derived from 114 dumps of Common Crawl (CC) - orders of magnitude larger than previous continual language modeling benchmarks. We also design time-stratified evaluations across both general CC data and specific domains (Wikipedia, StackExchange, and code documentation) to assess how well various continual learning methods adapt to new data while retaining past knowledge. Our findings demonstrate that, on general CC data, autoregressive meta-schedules combined with a fixed-ratio replay of older data can achieve comparable held-out loss to re-training from scratch, while requiring significantly less computation (2.6x). However, the optimal balance between incorporating new data and replaying old data differs as replay is crucial to avoid forgetting on generic web data but less so on specific domains.
中文: 大型语言模型可通过自回归元调度与选择性旧数据回放实现高效更新,在通用网络数据上需依赖回放防止遗忘,而在特定领域则需求较低,能以2.6倍计算效率达到与完全重新训练相当的效能。
English: Large Language Models (LLMs) can be efficiently updated using autoregressive meta-schedules with selective replay of older data, achieving performance comparable to full retraining while reducing computational costs by 2.6 times, though the need for replay varies between generic web data and specialized domains.
Authors:Dohyun Kim, Sehwan Park, Geonhee Han, Seung Wook Kim, Paul Hongsuck Seo
Abstract:
Diffusion models generate high-quality images through progressive denoising but are computationally intensive due to large model sizes and repeated sampling. Knowledge distillation, which transfers knowledge from a complex teacher to a simpler student model, has been widely studied in recognition tasks, particularly for transferring concepts unseen during student training. However, its application to diffusion models remains underexplored, especially in enabling student models to generate concepts not covered by the training images. In this work, we propose Random Conditioning, a novel approach that pairs noised images with randomly selected text conditions to enable efficient, image-free knowledge distillation. By leveraging this technique, we show that the student can generate concepts unseen in the training images. When applied to conditional diffusion model distillation, our method allows the student to explore the condition space without generating condition-specific images, resulting in notable improvements in both generation quality and efficiency. This promotes resource-efficient deployment of generative diffusion models, broadening their accessibility for both research and real-world applications. Code, models, and datasets are available at https://dohyun-as.github.io/Random-Conditioning .
Authors:Boshi Wang, Huan Sun
Abstract:
Despite their impressive capabilities, LLMs exhibit a basic generalization failure known as the Reversal Curse, where they struggle to learn reversible factual associations. Understanding why this occurs could help identify weaknesses in current models and advance their generalization and robustness. In this paper, we conjecture that the Reversal Curse in LLMs is a manifestation of the long-standing binding problem in cognitive science, neuroscience and AI. Specifically, we identify two primary causes of the Reversal Curse stemming from transformers' limitations in conceptual binding: the inconsistency and entanglements of concept representations. We perform a series of experiments that support these conjectures. Our exploration leads to a model design based on JEPA (Joint-Embedding Predictive Architecture) that for the first time breaks the Reversal Curse without side-stepping it with specialized data augmentation or non-causal masking, and moreover, generalization could be further improved by incorporating special memory layers that support disentangled concept representations. We demonstrate that the skill of reversal unlocks a new kind of memory integration that enables models to solve large-scale arithmetic reasoning problems via parametric forward-chaining, outperforming frontier LLMs based on non-parametric memory and prolonged explicit reasoning.
Chinese: 大语言模型中的逆转诅咒源于Transformer在概念绑定上的局限,而基于JEPA的新型模型设计结合记忆层不仅突破了这一诅咒,还通过参数化前向链实现了卓越的算术推理能力。
English: The Reversal Curse in LLMs arises from transformers' limitations in conceptual binding, and a novel JEPA-based model design with memory layers overcomes this curse, enabling superior arithmetic reasoning through parametric forward-chaining.
Authors:Andrey Sidorenko, Michael Platzer, Mario Scriminaci, Paul Tiwald
Abstract:
Evaluating the quality of synthetic data remains a key challenge for ensuring privacy and utility in data-driven research. In this work, we present an evaluation framework that quantifies how well synthetic data replicates original distributional properties while ensuring privacy. The proposed approach employs a holdout-based benchmarking strategy that facilitates quantitative assessment through low- and high-dimensional distribution comparisons, embedding-based similarity measures, and nearest-neighbor distance metrics. The framework supports various data types and structures, including sequential and contextual information, and enables interpretable quality diagnostics through a set of standardized metrics. These contributions aim to support reproducibility and methodological consistency in benchmarking of synthetic data generation techniques. The code of the framework is available at https://github.com/mostly-ai/mostlyai-qa.
ZH: 本研究提出一个综合框架,通过多维指标和基准测试策略量化合成数据的分布保真度与隐私保护效果,以评估其数据质量。
EN: This study introduces a comprehensive framework for evaluating synthetic data quality by quantifying distributional fidelity and privacy protection through multi-dimensional metrics and benchmarking strategies.
Authors:Neville K. Kitson, Anthony C. Constantinou
Abstract:
Many Bayesian Network structure learning algorithms are unstable, with the learned graph sensitive to arbitrary dataset artifacts, such as the ordering of columns (i.e., variable order). PC-Stable attempts to address this issue for the widely-used PC algorithm, prompting researchers to use the "stable" version instead. However, this problem seems to have been overlooked for score-based algorithms. In this study, we show that some widely-used score-based algorithms, as well as hybrid and constraint-based algorithms, including PC-Stable, suffer from the same issue. We propose a novel solution for score-based greedy hill-climbing that eliminates instability by determining a stable node order, leading to consistent results regardless of variable ordering. Two implementations, HC-Stable and Tabu-Stable, are introduced. Tabu-Stable achieves the highest BIC scores across all networks, and the highest accuracy for categorical networks. These results highlight the importance of addressing instability in structure learning and provide a robust and practical approach for future applications. This extends the scope and impact of our previous work presented at Probabilistic Graphical Models 2024 by incorporating continuous variables. The implementation, along with usage instructions, is freely available on GitHub at https://github.com/causal-iq/discovery.
Chinese: 本研究揭示了包括PC-Stable在内的常用评分型、混合型和约束型贝叶斯网络算法均存在变量顺序敏感性问题,提出了HC-Stable和Tabu-Stable解决方案,其中Tabu-Stable在所有网络中取得了最优的BIC分数和分类网络准确率。
English: This study reveals that widely-used score-based, hybrid, and constraint-based Bayesian Network algorithms, including PC-Stable, suffer from instability due to variable ordering and proposes HC-Stable and Tabu-Stable solutions, with Tabu-Stable achieving top performance metrics.
Authors:Yongjun He, Roger Waleffe, Zhichao Han, Johnu George, Binhang Yuan, Zitao Zhang, Yinan Shan, Yang Zhao, Debojyoti Dutta, Theodoros Rekatsinas, Ce Zhang
Abstract:
Many modern machine learning (ML) methods rely on embedding models to learn vector representations (embeddings) for a set of entities (embedding tables). As increasingly diverse ML applications utilize embedding models and embedding tables continue to grow in size and number, there has been a surge in the ad-hoc development of specialized frameworks targeted to train large embedding models for specific tasks. Although the scalability issues that arise in different embedding model training tasks are similar, each of these frameworks independently reinvents and customizes storage components for specific tasks, leading to substantial duplicated engineering efforts in both development and deployment. This paper presents MLKV, an efficient, extensible, and reusable data storage framework designed to address the scalability challenges in embedding model training, specifically data stall and staleness. MLKV augments disk-based key-value storage by democratizing optimizations that were previously exclusive to individual specialized frameworks and provides easy-to-use interfaces for embedding model training tasks. Extensive experiments on open-source workloads, as well as applications in eBay's payment transaction risk detection and seller payment risk detection, show that MLKV outperforms offloading strategies built on top of industrial-strength key-value stores by 1.6-12.6x. MLKV is open-source at https://github.com/llm-db/MLKV.
中文: MLKV是一个高效、可扩展且可复用的数据存储框架,通过普及优化策略解决嵌入模型训练中的数据停滞和过时问题,其性能比工业级键值存储高出1.6至12.6倍。
English: MLKV is a scalable and reusable data storage framework that addresses data stall and staleness in embedding model training by democratizing optimizations and outperforming existing key-value stores by 1.6-12.6 times.
Authors:Khoa A. Tran, John V. Pearson, Nicola Waddell
Abstract:
Motivation: Building and iterating machine learning models is often a resource-intensive process. In biomedical research, scientific codebases can lack scalability and are not easily transferable to work beyond what they were intended. xML-workFlow addresses this issue by providing a rapid, robust, and traceable end-to-end workflow that can be adapted to any ML project with minimal code rewriting.
Results: We show a practical, end-to-end workflow that integrates scikit-learn, MLflow, and SHAP. This template significantly reduces the time and effort required to build and iterate on ML models, addressing the common challenges of scalability and reproducibility in biomedical research. Adapting our template may save bioinformaticians time in development and enables biomedical researchers to deploy ML projects.
Availability and implementation: xML-workFlow is available at https://github.com/MedicalGenomicsLab/xML-workFlow.
中文摘要:xML-workFlow 提供了一种可扩展的端到端机器学习解决方案,通过整合 scikit-learn、MLflow 和 SHAP 工具并最小化代码修改,有效缩短生物医学研究中的开发时间并提升结果可复现性。
English Summary: xML-workFlow provides a scalable end-to-end machine learning solution that reduces development time and improves reproducibility in biomedical research by integrating scikit-learn, MLflow, and SHAP with minimal code adaptation.
Authors:Salim Khazem, Jeremy Fix, Cédric Pradalier
Abstract:
Deep learning models have achieved significant success in various image related tasks. However, they often encounter challenges related to computational complexity and overfitting. In this paper, we propose an efficient approach that leverages polygonal representations of images using dominant points or contour coordinates. By transforming input images into these compact forms, our method significantly reduces computational requirements, accelerates training, and conserves resources making it suitable for real time and resource constrained applications. These representations inherently capture essential image features while filtering noise, providing a natural regularization effect that mitigates overfitting. The resulting lightweight models achieve performance comparable to state of the art methods using full resolution images while enabling deployment on edge devices. Extensive experiments on benchmark datasets validate the effectiveness of our approach in reducing complexity, improving generalization, and facilitating edge computing applications. This work demonstrates the potential of polygonal representations in advancing efficient and scalable deep learning solutions for real world scenarios. The code for the experiments of the paper is provided in https://github.com/salimkhazem/PolygoNet.
Chinese: 本文提出一种高效的深度学习方法,通过采用多边形图像表示来降低计算复杂度、防止过拟合,并能在保持与先进方法相当性能的同时,实现在边缘设备上的部署。
English: This paper introduces an efficient deep learning method that uses polygonal image representations to reduce computational complexity, prevent overfitting, and enable deployment on edge devices while maintaining performance comparable to state-of-the-art approaches.
Authors:Jose Gallego-Posada, Juan Ramirez, Meraj Hashemizadeh, Simon Lacoste-Julien
Abstract:
Cooper is an open-source package for solving constrained optimization problems involving deep learning models. Cooper implements several Lagrangian-based first-order update schemes, making it easy to combine constrained optimization algorithms with high-level features of PyTorch such as automatic differentiation, and specialized deep learning architectures and optimizers. Although Cooper is specifically designed for deep learning applications where gradients are estimated based on mini-batches, it is suitable for general non-convex continuous constrained optimization. Cooper's source code is available at https://github.com/cooper-org/cooper.
中文: Cooper是一个用于深度学习约束优化的开源工具包,它将拉格朗日方法与PyTorch的自动微分和专业架构等特性相结合。
English: Cooper is an open-source toolkit for constrained optimization in deep learning, integrating Lagrangian methods with PyTorch's features like automatic differentiation and specialized architectures.
Authors:Gregory M. Campbell, Gentian Muhaxheri, Leonardo Ferreira Guilhoto, Christian D. Santangelo, Paris Perdikaris, James Pikul, Mark Yim
Abstract:
Soft pneumatic actuators (SPA) made from elastomeric materials can provide large strain and large force. The behavior of locally strain-restricted hyperelastic materials under inflation has been investigated thoroughly for shape reconfiguration, but requires further investigation for trajectories involving external force. In this work we model force-pressure-height relationships for a concentrically strain-limited class of soft pneumatic actuators and demonstrate the use of this model to design SPA response for object lifting. We predict relationships under different loadings by solving energy minimization equations and verify this theory by using an automated test rig to collect rich data for n=22 Ecoflex 00-30 membranes. We collect this data using an active learning pipeline to efficiently model the design space. We show that this learned material model outperforms the theory-based model and naive curve-fitting approaches. We use our model to optimize membrane design for different lift tasks and compare this performance to other designs. These contributions represent a step towards understanding the natural response for this class of actuator and embodying intelligent lifts in a single-pressure input actuator system.
中文: 本研究针对同心应变限制型软气动执行器进行建模与优化,证明了在预测力-压力关系和提升执行器性能方面,学习得到的材料模型优于理论模型及简单曲线拟合方法。
English: This study models and optimizes concentrically strain-limited soft pneumatic actuators for object lifting, demonstrating that a learned material model outperforms theoretical and curve-fitting approaches in predicting force-pressure relationships and enhancing actuator design.
Authors:Nishad Singhi, Hritik Bansal, Arian Hosseini, Aditya Grover, Kai-Wei Chang, Marcus Rohrbach, Anna Rohrbach
Abstract:
Scaling test-time compute has emerged as a key strategy for enhancing the reasoning capabilities of large language models (LLMs), particularly in tasks like mathematical problem-solving. A traditional approach, Self-Consistency (SC), generates multiple solutions to a problem and selects the most common answer via majority voting. Another common method involves scoring each solution with a reward model (verifier) and choosing the best one. Recent advancements in Generative Reward Models (GenRM) reframe verification as a next-token prediction task, enabling inference-time scaling along a new axis. Specifically, GenRM generates multiple verification chains-of-thought to score each solution. Under a limited inference budget, this introduces a fundamental trade-off: should you spend the budget on scaling solutions via SC or generate fewer solutions and allocate compute to verification via GenRM? To address this, we evaluate GenRM against SC under a fixed inference budget. Interestingly, we find that SC is more compute-efficient than GenRM for most practical inference budgets across diverse models and datasets. For instance, GenRM first matches SC after consuming up to 8x the inference compute and requires significantly more compute to outperform it. Furthermore, we derive inference scaling laws for the GenRM paradigm, revealing that compute-optimal inference favors scaling solution generation more aggressively than scaling the number of verifications. Our work provides practical guidance on optimizing test-time scaling by balancing solution generation and verification. The code is available at https://github.com/nishadsinghi/sc-genrm-scaling.
Chinese: 在大语言模型测试时计算扩展中,自我一致性方法通过多数投票选择答案,而生成式奖励模型通过验证链评分,研究发现自我一致性在多数实际计算预算下效率更高,且最优推理策略更倾向于大力扩展解决方案生成。
English: Scaling test-time compute for large language models involves a trade-off between generating more solutions through Self-Consistency and using fewer solutions with Generative Reward Model verification, with findings showing Self-Consistency is more compute-efficient for most practical budgets and that optimal inference favors aggressive solution scaling.
Authors:Saaket Agashe, Kyle Wong, Vincent Tu, Jiachen Yang, Ang Li, Xin Eric Wang
Abstract:
Computer use agents automate digital tasks by directly interacting with graphical user interfaces (GUIs) on computers and mobile devices, offering significant potential to enhance human productivity by completing an open-ended space of user queries. However, current agents face significant challenges: imprecise grounding of GUI elements, difficulties with long-horizon task planning, and performance bottlenecks from relying on single generalist models for diverse cognitive tasks. To this end, we introduce Agent S2, a novel compositional framework that delegates cognitive responsibilities across various generalist and specialist models. We propose a novel Mixture-of-Grounding technique to achieve precise GUI localization and introduce Proactive Hierarchical Planning, dynamically refining action plans at multiple temporal scales in response to evolving observations. Evaluations demonstrate that Agent S2 establishes new state-of-the-art (SOTA) performance on three prominent computer use benchmarks. Specifically, Agent S2 achieves 18.9% and 32.7% relative improvements over leading baseline agents such as Claude Computer Use and UI-TARS on the OSWorld 15-step and 50-step evaluation. Moreover, Agent S2 generalizes effectively to other operating systems and applications, surpassing previous best methods by 52.8% on WindowsAgentArena and by 16.52% on AndroidWorld relatively. Code available at https://github.com/simular-ai/Agent-S.
Chinese: Agent S2通过组合式框架结合混合定位技术和主动分层规划,解决了图形界面交互中的核心难题,在多项计算机使用基准测试中创下了最优性能记录。
English: Agent S2 introduces a compositional framework with Mixture-of-Grounding and Proactive Hierarchical Planning to overcome GUI interaction challenges, achieving state-of-the-art performance across multiple computer use benchmarks.
Authors:Wei Zhou, Yuyang Gao, Xuanhe Zhou, Guoliang Li
Abstract:
Dialect translation plays a key role in enabling seamless interaction across heterogeneous database systems. However, translating SQL queries between different dialects (e.g., from PostgreSQL to MySQL) remains a challenging task due to syntactic discrepancies and subtle semantic variations. Existing approaches including manual rewriting, rule-based systems, and large language model (LLM)-based techniques often involve high maintenance effort (e.g., crafting custom translation rules) or produce unreliable results (e.g., LLM generates non-existent functions), especially when handling complex queries. In this demonstration, we present CrackSQL, the first hybrid SQL dialect translation system that combines rule and LLM-based methods to overcome these limitations. CrackSQL leverages the adaptability of LLMs to minimize manual intervention, while enhancing translation accuracy by segmenting lengthy complex SQL via functionality-based query processing. To further improve robustness, it incorporates a novel cross-dialect syntax embedding model for precise syntax alignment, as well as an adaptive local-to-global translation strategy that effectively resolves interdependent query operations. CrackSQL supports three translation modes and offers multiple deployment and access options including a web console interface, a PyPI package, and a command-line prompt, facilitating adoption across a variety of real-world use cases
中文: CrackSQL是一种结合规则与大型语言模型的混合式SQL方言翻译系统,通过功能化查询分割和跨方言语法嵌入技术提升翻译准确性,同时支持多种部署模式以适应实际应用场景。
English: CrackSQL is a hybrid SQL dialect translation system that combines rule-based and LLM-based methods to enhance accuracy and reduce manual effort by segmenting complex queries and employing cross-dialect syntax alignment.
Authors:Xin Tong, Xuanhe Zhou, Bingsheng He, Guoliang Li, Zirui Tang, Wei Zhou, Fan Wu, Mian Lu, Yuqiang Chen
Abstract:
Feature management is essential for many online machine learning applications and can often become the performance bottleneck (e.g., taking up to 70% of the overall latency in sales prediction service). Improper feature configurations (e.g., introducing too many irrelevant features) can severely undermine the model's generalization capabilities. However, managing online ML features is challenging due to (1) large-scale, complex raw data (e.g., the 2018 PHM dataset contains 17 tables and dozens to hundreds of columns), (2) the need for high-performance, consistent computation of interdependent features with complex patterns, and (3) the requirement for rapid updates and deployments to accommodate real-time data changes. In this demo, we present FeatInsight, a system that supports the entire feature lifecycle, including feature design, storage, visualization, computation, verification, and lineage management. FeatInsight (with OpenMLDB as the execution engine) has been deployed in over 100 real-world scenarios on 4Paradigm's Sage Studio platform, handling up to a trillion-dimensional feature space and enabling millisecond-level feature updates. We demonstrate how FeatInsight enhances feature design efficiency (e.g., for online product recommendation) and improve feature computation performance (e.g., for online fraud detection). The code is available at https://github.com/4paradigm/FeatInsight.
中文: 特征管理对在线机器学习至关重要却常成为性能瓶颈,而FeatInsight作为全周期特征管理系统,已在众多实际场景中部署,显著提升了特征设计效率和计算性能。
English: Feature management is critical in online machine learning but often becomes a performance bottleneck, and FeatInsight is a comprehensive system that enhances efficiency and performance across the entire feature lifecycle, deployed in numerous real-world scenarios.
Authors:Yang Yang, Xijie Xu, Yixun Zhou, Jie Zheng
Abstract:
Cell instance segmentation is a fundamental task in digital pathology with broad clinical applications. Recently, vision foundation models, which are predominantly based on Vision Transformers (ViTs), have achieved remarkable success in pathology image analysis. However, their improvements in cell instance segmentation remain limited. A key challenge arises from the tokenization process in ViTs, which substantially reduces the spatial resolution of input images, leading to suboptimal segmentation quality, especially for small and densely packed cells. To address this problem, we propose CellVTA (Cell Vision Transformer with Adapter), a novel method that improves the performance of vision foundation models for cell instance segmentation by incorporating a CNN-based adapter module. This adapter extracts high-resolution spatial information from input images and injects it into the ViT through a cross-attention mechanism. Our method preserves the core architecture of ViT, ensuring seamless integration with pretrained foundation models. Extensive experiments show that CellVTA achieves 0.538 mPQ on the CoNIC dataset and 0.506 mPQ on the PanNuke dataset, which significantly outperforms the state-of-the-art cell segmentation methods. Ablation studies confirm the superiority of our approach over other fine-tuning strategies, including decoder-only fine-tuning and full fine-tuning. Our code and models are publicly available at https://github.com/JieZheng-ShanghaiTech/CellVTA.
Chinese: CellVTA通过将基于CNN的适配器与视觉Transformer结合,保留高分辨率空间细节,显著提升了数字病理学中的细胞实例分割性能,在基准数据集上取得了领先成果。
English: CellVTA enhances cell instance segmentation in digital pathology by integrating a CNN-based adapter with Vision Transformers to preserve high-resolution spatial details, achieving state-of-the-art results on benchmark datasets.
Authors:Owen Cook, Jake Vasilakes, Ian Roberts, Xingyi Song
Abstract:
Data annotation is an essential component of the machine learning pipeline; it is also a costly and time-consuming process. With the introduction of transformer-based models, annotation at the document level is increasingly popular; however, there is no standard framework for structuring such tasks. The EffiARA annotation framework is, to our knowledge, the first project to support the whole annotation pipeline, from understanding the resources required for an annotation task to compiling the annotated dataset and gaining insights into the reliability of individual annotators as well as the dataset as a whole. The framework's efficacy is supported by two previous studies: one improving classification performance through annotator-reliability-based soft-label aggregation and sample weighting, and the other increasing the overall agreement among annotators through removing identifying and replacing an unreliable annotator. This work introduces the EffiARA Python package and its accompanying webtool, which provides an accessible graphical user interface for the system. We open-source the EffiARA Python package at https://github.com/MiniEggz/EffiARA and the webtool is publicly accessible at https://effiara.gate.ac.uk.
中文:EffiARA框架是首个支持完整文档级标注流程的综合解决方案,通过提升标注可靠性和效率得到验证,现已作为开源Python包及易用的网络工具发布。
English: The EffiARA framework is the first comprehensive solution supporting the entire document-level annotation pipeline, enhancing reliability and efficiency, as demonstrated in previous studies, and is now available as an open-source Python package with a user-friendly webtool.
Authors:Ruoyu Chen, Siyuan Liang, Jingzhi Li, Shiming Liu, Li Liu, Hua Zhang, Xiaochun Cao
Abstract:
To develop a trustworthy AI system, which aim to identify the input regions that most influence the models decisions. The primary task of existing attribution methods lies in efficiently and accurately identifying the relationships among input-prediction interactions. Particularly when the input data is discrete, such as images, analyzing the relationship between inputs and outputs poses a significant challenge due to the combinatorial explosion. In this paper, we propose a novel and efficient black-box attribution mechanism, LiMA (Less input is More faithful for Attribution), which reformulates the attribution of important regions as an optimization problem for submodular subset selection. First, to accurately assess interactions, we design a submodular function that quantifies subset importance and effectively captures their impact on decision outcomes. Then, efficiently ranking input sub-regions by their importance for attribution, we improve optimization efficiency through a novel bidirectional greedy search algorithm. LiMA identifies both the most and least important samples while ensuring an optimal attribution boundary that minimizes errors. Extensive experiments on eight foundation models demonstrate that our method provides faithful interpretations with fewer regions and exhibits strong generalization, shows an average improvement of 36.3% in Insertion and 39.6% in Deletion. Our method also outperforms the naive greedy search in attribution efficiency, being 1.6 times faster. Furthermore, when explaining the reasons behind model prediction errors, the average highest confidence achieved by our method is, on average, 86.1% higher than that of state-of-the-art attribution algorithms. The code is available at https://github.com/RuoyuChen10/LIMA.
中文: 本文提出LiMA这一高效黑盒归因方法,通过将输入区域重要性重构为子模优化问题,在识别影响AI决策的关键区域方面实现了更优的准确性和效率。
English: This paper introduces LiMA, an efficient black-box attribution method that reformulates input region importance as a submodular optimization problem, achieving superior accuracy and speed in identifying influential regions for AI decisions.
Authors:Thomas Bailie, Yun Sing Koh, S. Karthik Mukkavilli, Varvara Vetrova
Abstract:
Graphical forecasting models learn the structure of time series data via projecting onto a graph, with recent techniques capturing spatial-temporal associations between variables via edge weights. Hierarchical variants offer a distinct advantage by analysing the time series across multiple resolutions, making them particularly effective in tasks like global weather forecasting, where low-resolution variable interactions are significant. A critical challenge in hierarchical models is information loss during forward or backward passes through the hierarchy. We propose the Hierarchical Graph Flow (HiGFlow) network, which introduces a memory buffer variable of dynamic size to store previously seen information across variable resolutions. We theoretically show two key results: HiGFlow reduces smoothness when mapping onto new feature spaces in the hierarchy and non-strictly enhances the utility of message-passing by improving Weisfeiler-Lehman (WL) expressivity. Empirical results demonstrate that HiGFlow outperforms state-of-the-art baselines, including transformer models, by at least an average of 6.1% in MAE and 6.2% in RMSE. Code is available at https://github.com/TB862/ HiGFlow.git.
Chinese: 分层图流网络通过引入动态内存缓冲区来缓解层次图形预测中的多分辨率信息损失,理论上提升了特征映射和消息传递表达能力,并在实验中以超过6%的误差指标优势超越了现有最优模型。
English: The Hierarchical Graph Flow (HiGFlow) network introduces a dynamic memory buffer to mitigate information loss across resolutions in hierarchical graphical forecasting, theoretically enhancing feature mapping and message-passing expressivity while empirically outperforming state-of-the-art models by over 6% in error metrics.
Authors:Muhammad Tahir, Shehroz S. Khan, James Davie, Soichiro Yamanaka, Ahmed Ashraf
Abstract:
In mammalian and vertebrate genomes, the promoter regions of the gene and their distal enhancers may be located millions of base-pairs from each other, while a promoter may not interact with the closest enhancer. Since base-pair proximity is not a good indicator of these interactions, there is considerable work toward developing methods for predicting Enhancer-Promoter Interactions (EPI). Several machine learning methods have reported increasingly higher accuracies for predicting EPI. Typically, these approaches randomly split the dataset of Enhancer-Promoter (EP) pairs into training and testing subsets followed by model training. However, the aforementioned random splitting causes information leakage by assigning EP pairs from the same genomic region to both testing and training sets, leading to performance overestimation. In this paper we propose to use a more thorough training and testing paradigm i.e., Leave-one-chromosome-out (LOCO) cross-validation for EPI-prediction. We demonstrate that a deep learning algorithm, which gives higher accuracies when trained and tested on random-splitting setting, drops drastically in performance under LOCO setting, confirming overestimation of performance. We further propose a novel hybrid deep neural network for EPI-prediction that fuses k-mer features of the nucleotide sequence. We show that the hybrid architecture performs significantly better in the LOCO setting, demonstrating it can learn more generalizable aspects of EP interactions. With this paper we are also releasing the LOCO splitting-based EPI dataset. Research data is available in this public repository: https://github.com/malikmtahir/EPI
在哺乳动物和脊椎动物基因组中,增强子与启动子的相互作用无法通过距离可靠预测,尽管机器学习方法常显示高准确率,但随机数据集分割会导致性能高估,本文通过留一染色体交叉验证法和一种混合深度神经网络解决了此问题,提升了模型泛化能力。
In mammalian and vertebrate genomes, enhancer-promoter interactions (EPI) are not reliably predicted by proximity, and while machine learning methods often show high accuracy, random dataset splitting leads to performance overestimation, which is addressed through a Leave-one-chromosome-out (LOCO) cross-validation approach and a hybrid deep neural network that improves generalizability.
Authors:Srinitish Srinivasan, Omkumar CU
Abstract:
While hyperbolic GNNs show promise for hierarchical data, they often have limited discriminative power compared to Euclidean counterparts or the WL test, due to non-injective aggregation. To address this expressivity gap, we propose the Lorentzian Graph Isomorphic Network (LGIN), a novel HGNN designed for enhanced discrimination within the Lorentzian model. LGIN introduces a new update rule that preserves the Lorentzian metric while effectively capturing richer structural information. This marks a significant step towards more expressive GNNs on Riemannian manifolds. Extensive evaluations across nine benchmark datasets demonstrate LGIN's superior performance, consistently outperforming or matching state-of-the-art hyperbolic and Euclidean baselines, showcasing its ability to capture complex graph structures. LGIN is the first to adapt principles of powerful, highly discriminative GNN architectures to a Riemannian manifold. The code for our paper can be found at https://github.com/Deceptrax123/LGIN
中文: 提出的洛伦兹图同构网络(LGIN)通过保留洛伦兹度量的新型更新规则,增强了双曲图神经网络的判别能力,在九个基准测试中展现出优越性能,能有效捕捉复杂图结构。
English: The proposed Lorentzian Graph Isomorphic Network (LGIN) enhances hyperbolic graph neural networks' discriminative power through a novel update rule that preserves the Lorentzian metric, demonstrating superior performance across nine benchmarks by effectively capturing complex graph structures.
Authors:Reza Nematirad, Anil Pahwa, Balasubramaniam Natarajan
Abstract:
Time series forecasting is an important application in various domains such as energy management, traffic planning, financial markets, meteorology, and medicine. However, real-time series data often present intricate temporal variability and sharp fluctuations, which pose significant challenges for time series forecasting. Previous models that rely on 1D time series representations usually struggle with complex temporal variations. To address the limitations of 1D time series, this study introduces the Times2D method that transforms the 1D time series into 2D space. Times2D consists of three main parts: first, a Periodic Decomposition Block (PDB) that captures temporal variations within a period and between the same periods by converting the time series into a 2D tensor in the frequency domain. Second, the First and Second Derivative Heatmaps (FSDH) capture sharp changes and turning points, respectively. Finally, an Aggregation Forecasting Block (AFB) integrates the output tensors from PDB and FSDH for accurate forecasting. This 2D transformation enables the utilization of 2D convolutional operations to effectively capture long and short characteristics of the time series. Comprehensive experimental results across large-scale data in the literature demonstrate that the proposed Times2D model achieves state-of-the-art performance in both short-term and long-term forecasting. The code is available in this repository: https://github.com/Tims2D/Times2D.
中文: 本研究提出Times2D方法,通过将一维时间序列转换为二维空间,利用周期性分解和导数热图捕捉复杂时序模式,经二维卷积处理后在综合实验中实现了最先进的预测性能。
English: This study introduces Times2D, a novel method that transforms 1D time series into 2D space using periodic decomposition and derivative heatmaps to capture complex temporal patterns through 2D convolutions, achieving state-of-the-art forecasting performance in comprehensive experiments.
Authors:J. V. S. Souza, C. B. Vieira, G. D. C. Cavalcanti, R. M. O. Cruz
Abstract:
In recent years, the rise of cyber threats has emphasized the need for robust malware detection systems, especially on mobile devices. Malware, which targets vulnerabilities in devices and user data, represents a substantial security risk. A significant challenge in malware detection is the imbalance in datasets, where most applications are benign, with only a small fraction posing a threat. This study addresses the often-overlooked issue of class imbalance in malware detection by evaluating various machine learning strategies for detecting malware in Android applications. We assess monolithic classifiers and ensemble methods, focusing on dynamic selection algorithms, which have shown superior performance compared to traditional approaches. In contrast to balancing strategies performed on the whole dataset, we propose a balancing procedure that works individually for each classifier in the pool. Our empirical analysis demonstrates that the KNOP algorithm obtained the best results using a pool of Random Forest. Additionally, an instance hardness assessment revealed that balancing reduces the difficulty of the minority class and enhances the detection of the minority class (malware). The code used for the experiments is available at https://github.com/jvss2/Machine-Learning-Empirical-Evaluation.
中文: 本研究通过评估多种机器学习策略解决安卓恶意软件检测中的类别不平衡问题,发现采用随机森林池的KNOP算法效果最佳,且平衡技术能有效提升少数类(恶意软件)的检测性能。
English: This study tackles class imbalance in Android malware detection by evaluating machine learning strategies, finding that the KNOP algorithm with a Random Forest pool achieves optimal results and that balancing techniques improve minority class detection.
Authors:Huan Zhao, Yiming Liu, Jina Yao, Ling Xiong, Zexin Zhou, Zixing Zhang
Abstract:
Recent breakthroughs in single-cell technology have ushered in unparalleled opportunities to decode the molecular intricacy of intricate biological systems, especially those linked to diseases unique to humans. However, these progressions have also ushered in novel obstacles-specifically, the efficient annotation of extensive, long-tailed single-cell data pertaining to disease conditions. To effectively surmount this challenge, we introduce Celler, a state-of-the-art generative pre-training model crafted specifically for the annotation of single-cell data. Celler incorporates two groundbreaking elements: First, we introduced the Gaussian Inflation (GInf) Loss function. By dynamically adjusting sample weights, GInf Loss significantly enhances the model's ability to learn from rare categories while reducing the risk of overfitting for common categories. Secondly, we introduce an innovative Hard Data Mining (HDM) strategy into the training process, specifically targeting the challenging-to-learn minority data samples, which significantly improved the model's predictive accuracy. Additionally, to further advance research in this field, we have constructed a large-scale single-cell dataset: Celler-75, which encompasses 40 million cells distributed across 80 human tissues and 75 specific diseases. This dataset provides critical support for comprehensively exploring the potential of single-cell technology in disease research. Our code is available at https://github.com/AI4science-ym/HiCeller.
中文: 单细胞技术的最新进展带来了大规模长尾数据标注的挑战,Celler通过引入高斯膨胀损失函数和硬数据挖掘策略的生成预训练模型有效应对,并辅以全面的Celler-75数据集支持。
English: Recent advances in single-cell technology present challenges in annotating large-scale, long-tailed data, which Celler addresses through its generative pre-training model featuring Gaussian Inflation Loss and Hard Data Mining strategy, supported by the extensive Celler-75 dataset.
Authors:Yuyao Zhang, Jinghao Li, Yu-Wing Tai
Abstract:
Text-to-image (T2I) generation has made remarkable progress, yet existing systems still lack intuitive control over spatial composition, object consistency, and multi-step editing. We present $\textbf{LayerCraft}$, a modular framework that uses large language models (LLMs) as autonomous agents to orchestrate structured, layered image generation and editing. LayerCraft supports two key capabilities: (1) $\textit{structured generation}$ from simple prompts via chain-of-thought (CoT) reasoning, enabling it to decompose scenes, reason about object placement, and guide composition in a controllable, interpretable manner; and (2) $\textit{layered object integration}$, allowing users to insert and customize objects -- such as characters or props -- across diverse images or scenes while preserving identity, context, and style. The system comprises a coordinator agent, the $\textbf{ChainArchitect}$ for CoT-driven layout planning, and the $\textbf{Object Integration Network (OIN)}$ for seamless image editing using off-the-shelf T2I models without retraining. Through applications like batch collage editing and narrative scene generation, LayerCraft empowers non-experts to iteratively design, customize, and refine visual content with minimal manual effort. Code will be released at https://github.com/PeterYYZhang/LayerCraft.
中文: LayerCraft是一个利用大语言模型作为自主代理的模块化框架,通过思维链推理实现结构化、分层的图像生成与编辑,无需重新训练现有模型即可直观控制空间构图和对象一致性。
English: LayerCraft is a modular framework employing large language models as autonomous agents to enable structured, layered image generation and editing, offering intuitive control over spatial composition and object consistency through chain-of-thought reasoning and seamless integration without retraining existing models.
Authors:Yuping Wang, Xiangyu Huang, Xiaokang Sun, Mingxuan Yan, Shuo Xing, Zhengzhong Tu, Jiachen Li
Abstract:
We introduce UniOcc, a comprehensive, unified benchmark and toolkit for occupancy forecasting (i.e., predicting future occupancies based on historical information) and occupancy prediction (i.e., predicting current-frame occupancy from camera images. UniOcc unifies the data from multiple real-world datasets (i.e., nuScenes, Waymo) and high-fidelity driving simulators (i.e., CARLA, OpenCOOD), providing 2D/3D occupancy labels and annotating innovative per-voxel flows. Unlike existing studies that rely on suboptimal pseudo labels for evaluation, UniOcc incorporates novel evaluation metrics that do not depend on ground-truth labels, enabling robust assessment on additional aspects of occupancy quality. Through extensive experiments on state-of-the-art models, we demonstrate that large-scale, diverse training data and explicit flow information significantly enhance occupancy prediction and forecasting performance. Our data and code are available at https://uniocc.github.io/.
中文:UniOcc是一个统一的占用预测基准与工具包,整合了多源真实数据和仿真数据集,通过创新的三维流标注和无真值标签评估指标,证明大规模训练数据与显式流信息能显著提升占用预测性能。
English: UniOcc is a unified benchmark and toolkit for occupancy forecasting and prediction, integrating diverse real-world and simulated datasets with innovative 3D flow annotations and label-free evaluation metrics to significantly improve model performance through large-scale training data and explicit flow information.
Authors:Yi Chen, Yuying Ge, Rui Wang, Yixiao Ge, Lu Qiu, Ying Shan, Xihui Liu
Abstract:
Recent advancements in Chain of Thought (COT) generation have significantly improved the reasoning capabilities of Large Language Models (LLMs), with reinforcement learning (RL) emerging as an effective post-training approach. Multimodal Large Language Models (MLLMs) inherit this reasoning potential but remain underexplored in tasks requiring both perception and logical reasoning. To address this, we introduce SEED-Bench-R1, a benchmark designed to systematically evaluate post-training methods for MLLMs in video understanding. It includes intricate real-world videos and complex everyday planning tasks in the format of multiple-choice questions, requiring sophisticated perception and reasoning. SEED-Bench-R1 assesses generalization through a three-level hierarchy: in-distribution, cross-environment, and cross-environment-task scenarios, equipped with a large-scale training dataset with easily verifiable ground-truth answers. Using Qwen2-VL-Instruct-7B as a base model, we compare RL with supervised fine-tuning (SFT), demonstrating RL's data efficiency and superior performance on both in-distribution and out-of-distribution tasks, even outperforming SFT on general video understanding benchmarks like LongVideoBench. Our detailed analysis reveals that RL enhances visual perception but often produces less logically coherent reasoning chains. We identify key limitations such as inconsistent reasoning and overlooked visual cues, and suggest future improvements in base model reasoning, reward modeling, and RL robustness against noisy signals.
Chinese: 强化学习的最新进展提升了多模态大语言模型的推理能力,SEED-Bench-R1基准测试验证了其有效性,但在复杂视觉推理任务中保持逻辑连贯性仍是待解决的挑战。
English: Recent advancements in reinforcement learning have enhanced multimodal large language models' reasoning capabilities, as demonstrated by the SEED-Bench-R1 benchmark, though challenges remain in maintaining logical coherence during complex visual reasoning tasks.
Authors:Patrick Knab, Sascha Marton, Udo Schlegel, Christian Bartelt
Abstract:
As neural networks become dominant in essential systems, Explainable Artificial Intelligence (XAI) plays a crucial role in fostering trust and detecting potential misbehavior of opaque models. LIME (Local Interpretable Model-agnostic Explanations) is among the most prominent model-agnostic approaches, generating explanations by approximating the behavior of black-box models around specific instances. Despite its popularity, LIME faces challenges related to fidelity, stability, and applicability to domain-specific problems. Numerous adaptations and enhancements have been proposed to address these issues, but the growing number of developments can be overwhelming, complicating efforts to navigate LIME-related research. To the best of our knowledge, this is the first survey to comprehensively explore and collect LIME's foundational concepts and known limitations. We categorize and compare its various enhancements, offering a structured taxonomy based on intermediate steps and key issues. Our analysis provides a holistic overview of advancements in LIME, guiding future research and helping practitioners identify suitable approaches. Additionally, we provide a continuously updated interactive website (https://patrick-knab.github.io/which-lime-to-trust/), offering a concise and accessible overview of the survey.
Authors:Sewoong Lee, Adam Davies, Marc E. Canby, Julia Hockenmaier
Abstract:
Sparse autoencoders (SAEs) are widely used in mechanistic interpretability research for large language models; however, the state-of-the-art method of using $k$-sparse autoencoders lacks a theoretical grounding for selecting the hyperparameter $k$ that represents the number of nonzero activations, often denoted by $\ell_0$. In this paper, we reveal a theoretical link that the $\ell_2$-norm of the sparse feature vector can be approximated with the $\ell_2$-norm of the dense vector with a closed-form error, which allows sparse autoencoders to be trained without the need to manually determine $\ell_0$. Specifically, we validate two applications of our theoretical findings. First, we introduce a new methodology that can assess the feature activations of pre-trained SAEs by computing the theoretically expected value from the input embedding, which has been overlooked by existing SAE evaluation methods and loss functions. Second, we introduce a novel activation function, top-AFA, which builds upon our formulation of approximate feature activation (AFA). This function enables top-$k$ style activation without requiring a constant hyperparameter $k$ to be tuned, dynamically determining the number of activated features for each input. By training SAEs on three intermediate layers to reconstruct GPT2 hidden embeddings for over 80 million tokens from the OpenWebText dataset, we demonstrate the empirical merits of this approach and compare it with current state-of-the-art $k$-sparse autoencoders. Our code is available at: https://github.com/SewoongLee/top-afa-sae.
中文: 本文揭示了稀疏与稠密向量$\ell_2$范数间的理论关联,使得稀疏自编码器无需手动设定$\ell_0$超参数即可训练,并提出了新的评估方法和动态激活函数,在实验中优于当前最优的$k$-稀疏自编码器。
English: This paper establishes a theoretical connection between the $\ell_2$-norms of sparse and dense vectors, enabling sparse autoencoders to be trained without manually setting the $\ell_0$ hyperparameter, and introduces both a new evaluation method and a dynamic activation function that outperforms current $k$-sparse autoencoders.
Authors:Karim Radouane, Hanane Azzag, Mustapha lebbah
Abstract:
We propose a unified framework that integrates object detection (OD) and visual grounding (VG) for remote sensing (RS) imagery. To support conventional OD and establish an intuitive prior for VG task, we fine-tune an open-set object detector using referring expression data, framing it as a partially supervised OD task. In the first stage, we construct a graph representation of each image, comprising object queries, class embeddings, and proposal locations. Then, our task-aware architecture processes this graph to perform the VG task. The model consists of: (i) a multi-branch network that integrates spatial, visual, and categorical features to generate task-aware proposals, and (ii) an object reasoning network that assigns probabilities across proposals, followed by a soft selection mechanism for final referring object localization. Our model demonstrates superior performance on the OPT-RSVG and DIOR-RSVG datasets, achieving significant improvements over state-of-the-art methods while retaining classical OD capabilities. The code will be available in our repository: \url{https://github.com/rd20karim/MB-ORES}.
中文摘要:本研究提出了一种统一框架,将目标检测与视觉定位相结合应用于遥感图像,通过微调开放集检测器和任务感知架构,在基准数据集上实现了优于现有方法的性能,同时保持了传统检测功能。
English Summary: This study introduces a unified framework that combines object detection and visual grounding for remote sensing imagery, utilizing a fine-tuned open-set detector and a task-aware architecture to achieve state-of-the-art performance on benchmark datasets while preserving traditional detection capabilities.
Authors:Jianhao Li, Xianchao Xiu
Abstract:
Recent advances in large language models (LLMs) have provided new opportunities for decision-making, particularly in the task of automated feature selection. In this paper, we first comprehensively evaluate LLM-based feature selection methods, covering the state-of-the-art DeepSeek-R1, GPT-o3-mini, and GPT-4.5. Then, we propose a new hybrid strategy called LLM4FS that integrates LLMs with traditional data-driven methods. Specifically, input data samples into LLMs, and directly call traditional data-driven techniques such as random forest and forward sequential selection. Notably, our analysis reveals that the hybrid strategy leverages the contextual understanding of LLMs and the high statistical reliability of traditional data-driven methods to achieve excellent feature selection performance, even surpassing LLMs and traditional data-driven methods. Finally, we point out the limitations of its application in decision-making. Our code is available at https://github.com/xianchaoxiu/LLM4FS.
中文: 本文提出LLM4FS混合策略,将大语言模型与传统数据驱动方法相结合,通过整合语境理解能力和统计可靠性优势,在特征选择任务中实现卓越性能,同时指出了该方法在决策应用中的局限性。
English: This paper introduces LLM4FS, a hybrid strategy that combines large language models with traditional data-driven methods to enhance automated feature selection by leveraging their respective strengths in contextual understanding and statistical reliability, achieving superior performance despite noted limitations in decision-making applications.
Authors:Dominik Schnaus, Nikita Araslanov, Daniel Cremers
Abstract:
The platonic representation hypothesis suggests that vision and language embeddings become more homogeneous as model and dataset sizes increase. In particular, pairwise distances within each modality become more similar. This suggests that as foundation models mature, it may become possible to match vision and language embeddings in a fully unsupervised fashion, i.e. without parallel data. We present the first feasibility study, and investigate conformity of existing vision and language foundation models in the context of unsupervised, or "blind", matching. First, we formulate unsupervised matching as a quadratic assignment problem and introduce a novel heuristic that outperforms previous solvers. We also develop a technique to find optimal matching problems, for which a non-trivial match is very likely. Second, we conduct an extensive study deploying a range of vision and language models on four datasets. Our analysis reveals that for many problem instances, vision and language representations can be indeed matched without supervision. This finding opens up the exciting possibility of embedding semantic knowledge into other modalities virtually annotation-free. As a proof of concept, we showcase an unsupervised classifier, which achieves non-trivial classification accuracy without any image-text annotation.
Authors:Wei Gao, Xinyu Zhou, Peng Sun, Tianwei Zhang, Yonggang Wen
Abstract:
Key-Value cache (\texttt{KV} \texttt{cache}) compression has emerged as a promising technique to optimize Large Language Model (LLM) serving. It primarily decreases the memory consumption of \texttt{KV} \texttt{cache} to reduce the computation cost. Despite the development of many compression algorithms, their applications in production environments are still not prevalent. In this paper, we revisit mainstream \texttt{KV} \texttt{cache} compression solutions from a practical perspective. Our contributions are three-fold. First, we comprehensively review existing algorithmic designs and benchmark studies for \texttt{KV} \texttt{cache} compression and identify missing pieces in their performance measurement, which could hinder their adoption in practice. Second, we empirically evaluate representative \texttt{KV} \texttt{cache} compression methods to uncover two key issues that affect the computational efficiency: (1) while compressing \texttt{KV} \texttt{cache} can reduce memory consumption, current implementations (e.g., FlashAttention, PagedAttention) do not optimize for production-level LLM serving, resulting in suboptimal throughput performance; (2) compressing \texttt{KV} \texttt{cache} may lead to longer outputs, resulting in increased end-to-end latency. We further investigate the accuracy performance of individual samples rather than the overall performance, revealing the intrinsic limitations in \texttt{KV} \texttt{cache} compression when handling specific LLM tasks. Third, we provide tools to shed light on future \texttt{KV} \texttt{cache} compression studies and facilitate their practical deployment in production. They are open-sourced in \href{https://github.com/LLMkvsys/rethink-kv-compression}{https://github.com/LLMkvsys/rethink-kv-compression}.
中文: KV缓存压缩技术虽能降低大语言模型服务中的内存消耗,但在实际部署中存在吞吐量不足和延迟增加等问题,本文通过全面评估并开源工具以推动其未来发展。
English: KV cache compression reduces memory usage in LLM serving but faces practical deployment challenges, including suboptimal throughput and increased latency, prompting a comprehensive evaluation and open-source tools for future improvements.
Authors:Sebastian Springer, Andre Scaffidi, Maximilian Autenrieth, Gabriella Contardo, Alessandro Laio, Roberto Trotta, Heikki Haario
Abstract:
Detecting localized density differences in multivariate data is a crucial task in computational science. Such anomalies can indicate a critical system failure, lead to a groundbreaking scientific discovery, or reveal unexpected changes in data distribution. We introduce EagleEye, an anomaly detection method to compare two multivariate datasets with the aim of identifying local density anomalies, namely over- or under-densities affecting only localised regions of the feature space. Anomalies are detected by modelling, for each point, the ordered sequence of its neighbours' membership label as a coin-flipping process and monitoring deviations from the expected behaviour of such process. A unique advantage of our method is its ability to provide an accurate, entirely unsupervised estimate of the local signal purity. We demonstrate its effectiveness through experiments on both synthetic and real-world datasets. In synthetic data, EagleEye accurately detects anomalies in multiple dimensions even when they affect a tiny fraction of the data. When applied to a challenging resonant anomaly detection benchmark task in simulated Large Hadron Collider data, EagleEye successfully identifies particle decay events present in just 0.3% of the dataset. In global temperature data, EagleEye uncovers previously unidentified, geographically localised changes in temperature fields that occurred in the most recent years. Thanks to its key advantages of conceptual simplicity, computational efficiency, trivial parallelisation, and scalability, EagleEye is widely applicable across many fields.
中文: EagleEye是一种无监督异常检测方法,通过将邻域成员标签序列建模为抛硬币过程来识别多元数据集中的局部密度差异,在合成和真实数据实验中均能有效检测细微异常。
English: EagleEye is an unsupervised anomaly detection method that identifies localized density differences in multivariate datasets by modeling neighbor membership sequences as a coin-flipping process, demonstrating high accuracy in detecting subtle anomalies across synthetic and real-world applications.
Authors:Nicolas Gillis, Margherita Porcelli, Giovanni Seraghiti
Abstract:
Nonlinear matrix decomposition (NMD) with the ReLU function, denoted ReLU-NMD, is the following problem: given a sparse, nonnegative matrix $X$ and a factorization rank $r$, identify a rank-$r$ matrix $Î$ such that $X\approx \max(0,Î)$. This decomposition finds application in data compression, matrix completion with entries missing not at random, and manifold learning. The standard ReLU-NMD model minimizes the least squares error, that is, $\|X - \max(0,Î)\|_F^2$. The corresponding optimization problem is nondifferentiable and highly nonconvex. This motivated Saul to propose an alternative model, Latent-ReLU-NMD, where a latent variable $Z$ is introduced and satisfies $\max(0,Z)=X$ while minimizing $\|Z - Î\|_F^2$ (``A nonlinear matrix decomposition for mining the zeros of sparse data'', SIAM J. Math. Data Sci., 2022). Our first contribution is to show that the two formulations may yield different low-rank solutions $Î$; in particular, we show that Latent-ReLU-NMD can be ill-posed when ReLU-NMD is not, meaning that there are instances in which the infimum of Latent-ReLU-NMD is not attained while that of ReLU-NMD is. We also consider another alternative model, called 3B-ReLU-NMD, which parameterizes $Î=WH$, where $W$ has $r$ columns and $H$ has $r$ rows, allowing one to get rid of the rank constraint in Latent-ReLU-NMD. Our second contribution is to prove the convergence of a block coordinate descent (BCD) applied to 3B-ReLU-NMD and referred to as BCD-NMD. Our third contribution is a novel extrapolated variant of BCD-NMD, dubbed eBCD-NMD, which we prove is also convergent under mild assumptions. We illustrate the significant acceleration effect of eBCD-NMD compared to BCD-NMD, and also show that eBCD-NMD performs well against the state of the art on synthetic and real-world data sets.
中文: 本文分析了基于ReLU的非线性矩阵分解,揭示了Latent-ReLU-NMD模型相比标准ReLU-NMD可能不适定,并提出了一种可证明收敛的外推块坐标下降法(eBCD-NMD),在合成和真实数据集上显著提升了性能。
English: This paper analyzes ReLU-based nonlinear matrix decomposition (NMD), revealing that the Latent-ReLU-NMD model can be ill-posed compared to standard ReLU-NMD, and introduces a provably convergent extrapolated block coordinate descent method (eBCD-NMD) that significantly accelerates performance on synthetic and real-world datasets.
Authors:Fabian L. Thiemann, Thiago Reschützegger, Massimiliano Esposito, Tseden Taddese, Juan D. Olarte-Plata, Fausto Martelli
Abstract:
Molecular dynamics (MD) simulations play a crucial role in scientific research. Yet their computational cost often limits the timescales and system sizes that can be explored. Most data-driven efforts have been focused on reducing the computational cost of accurate interatomic forces required for solving the equations of motion. Despite their success, however, these machine learning interatomic potentials (MLIPs) are still bound to small time-steps. In this work, we introduce TrajCast, a transferable and data-efficient framework based on autoregressive equivariant message passing networks that directly updates atomic positions and velocities lifting the constraints imposed by traditional numerical integration. We benchmark our framework across various systems, including a small molecule, crystalline material, and bulk liquid, demonstrating excellent agreement with reference MD simulations for structural, dynamical, and energetic properties. Depending on the system, TrajCast allows for forecast intervals up to $30\times$ larger than traditional MD time-steps, generating over 15 ns of trajectory data per day for a solid with more than 4,000 atoms. By enabling efficient large-scale simulations over extended timescales, TrajCast can accelerate materials discovery and explore physical phenomena beyond the reach of traditional simulations and experiments. An open-source implementation of TrajCast is accessible under https://github.com/IBM/trajcast.
中文摘要:TrajCast采用自回归等变消息传递网络直接更新原子位置和速度,相比传统分子动力学方法将时间步长提升高达30倍,在保持精度的同时实现了更高效的大尺度材料模拟。
English Summary: TrajCast is a novel framework using autoregressive equivariant networks to directly update atomic positions and velocities, enabling simulations with time-steps up to 30 times larger than traditional molecular dynamics while maintaining accuracy across various material systems.
Authors:Anirudh Satheesh, Keenan Powell
Abstract:
Traffic congestion in modern cities is exacerbated by the limitations of traditional fixed-time traffic signal systems, which fail to adapt to dynamic traffic patterns. Adaptive Traffic Signal Control (ATSC) algorithms have emerged as a solution by dynamically adjusting signal timing based on real-time traffic conditions. However, the main limitation of such methods is that they are not transferable to environments under real-world constraints, such as balancing efficiency, minimizing collisions, and ensuring fairness across intersections. In this paper, we view the ATSC problem as a constrained multi-agent reinforcement learning (MARL) problem and propose a novel algorithm named Multi-Agent Proximal Policy Optimization with Lagrange Cost Estimator (MAPPO-LCE) to produce effective traffic signal control policies. Our approach integrates the Lagrange multipliers method to balance rewards and constraints, with a cost estimator for stable adjustment. We also introduce three constraints on the traffic network: GreenTime, GreenSkip, and PhaseSkip, which penalize traffic policies that do not conform to real-world scenarios. Our experimental results on three real-world datasets demonstrate that MAPPO-LCE outperforms three baseline MARL algorithms by across all environments and traffic constraints (improving on MAPPO by 12.60%, IPPO by 10.29%, and QTRAN by 13.10%). Our results show that constrained MARL is a valuable tool for traffic planners to deploy scalable and efficient ATSC methods in real-world traffic networks. We provide code at https://github.com/Asatheesh6561/MAPPO-LCE.
中文: 本文提出MAPPO-LCE算法,通过约束多智能体强化学习动态调整交通信号,在平衡现实约束的同时显著优于现有方法,为实际交通网络提供了可扩展的解决方案。
English: The paper proposes MAPPO-LCE, a constrained multi-agent reinforcement learning algorithm that dynamically adjusts traffic signals while balancing real-world constraints, demonstrating superior performance over existing methods in real-world traffic networks.
Authors:Maximilian Augustin, Yannic Neuhaus, Matthias Hein
Abstract:
Vision-language models (VLMs) are prone to object hallucinations, where they erroneously indicate the presenceof certain objects in an image. Existing benchmarks quantify hallucinations using relatively small, labeled datasets. However, this approach is i) insufficient to assess hallucinations that arise in open-world settings, where VLMs are widely used, and ii) inadequate for detecting systematic errors in VLMs. We propose DASH (Detection and Assessment of Systematic Hallucinations), an automatic, large-scale pipeline designed to identify systematic hallucinations of VLMs on real-world images in an open-world setting. A key component is DASH-OPT for image-based retrieval, where we optimize over the ''natural image manifold'' to generate images that mislead the VLM. The output of DASH consists of clusters of real and semantically similar images for which the VLM hallucinates an object. We apply DASH to PaliGemma and two LLaVA-NeXT models across 380 object classes and, in total, find more than 19k clusters with 950k images. We study the transfer of the identified systematic hallucinations to other VLMs and show that fine-tuning PaliGemma with the model-specific images obtained with DASH mitigates object hallucinations. Code and data are available at https://YanNeu.github.io/DASH.
Authors:Jiahao Li, Yiqiang Chen, Yunbing Xing, Yang Gu, Xiangyuan Lan
Abstract:
Unlearnable data (ULD) has emerged as an innovative defense technique to prevent machine learning models from learning meaningful patterns from specific data, thus protecting data privacy and security. By introducing perturbations to the training data, ULD degrades model performance, making it difficult for unauthorized models to extract useful representations. Despite the growing significance of ULD, existing surveys predominantly focus on related fields, such as adversarial attacks and machine unlearning, with little attention given to ULD as an independent area of study. This survey fills that gap by offering a comprehensive review of ULD, examining unlearnable data generation methods, public benchmarks, evaluation metrics, theoretical foundations and practical applications. We compare and contrast different ULD approaches, analyzing their strengths, limitations, and trade-offs related to unlearnability, imperceptibility, efficiency and robustness. Moreover, we discuss key challenges, such as balancing perturbation imperceptibility with model degradation and the computational complexity of ULD generation. Finally, we highlight promising future research directions to advance the effectiveness and applicability of ULD, underscoring its potential to become a crucial tool in the evolving landscape of data protection in machine learning.
中文摘要:本综述首次将不可学习数据(ULD)作为独立领域进行全面探讨,系统梳理了其生成方法、评估指标及应用场景,并剖析了现有技术的局限性及未来研究方向,以推动机器学习数据保护能力的发展。
English Summary: This survey provides a comprehensive review of unlearnable data (ULD) as an independent defense technique, covering generation methods, benchmarks, and applications while analyzing trade-offs and future directions to enhance data protection in machine learning.
Authors:Jannik Endres, Oliver Hahn, Charles Corbière, Simone Schaub-Meyer, Stefan Roth, Alexandre Alahi
Abstract:
Omnidirectional depth perception is essential for mobile robotics applications that require scene understanding across a full 360° field of view. Camera-based setups offer a cost-effective option by using stereo depth estimation to generate dense, high-resolution depth maps without relying on expensive active sensing. However, existing omnidirectional stereo matching approaches achieve only limited depth accuracy across diverse environments, depth ranges, and lighting conditions, due to the scarcity of real-world data. We present DFI-OmniStereo, a novel omnidirectional stereo matching method that leverages a large-scale pre-trained foundation model for relative monocular depth estimation within an iterative optimization-based stereo matching architecture. We introduce a dedicated two-stage training strategy to utilize the relative monocular depth features for our omnidirectional stereo matching before scale-invariant fine-tuning. DFI-OmniStereo achieves state-of-the-art results on the real-world Helvipad dataset, reducing disparity MAE by approximately 16% compared to the previous best omnidirectional stereo method.
Authors:Ivan Anokhin, Rishav Rishav, Matthew Riemer, Stephen Chung, Irina Rish, Samira Ebrahimi Kahou
Abstract:
Real-time reinforcement learning (RL) introduces several challenges. First, policies are constrained to a fixed number of actions per second due to hardware limitations. Second, the environment may change while the network is still computing an action, leading to observational delay. The first issue can partly be addressed with pipelining, leading to higher throughput and potentially better policies. However, the second issue remains: if each neuron operates in parallel with an execution time of $Ï$, an $N$-layer feed-forward network experiences observation delay of $ÏN$. Reducing the number of layers can decrease this delay, but at the cost of the network's expressivity. In this work, we explore the trade-off between minimizing delay and network's expressivity. We present a theoretically motivated solution that leverages temporal skip connections combined with history-augmented observations. We evaluate several architectures and show that those incorporating temporal skip connections achieve strong performance across various neuron execution times, reinforcement learning algorithms, and environments, including four Mujoco tasks and all MinAtar games. Moreover, we demonstrate parallel neuron computation can accelerate inference by 6-350% on standard hardware. Our investigation into temporal skip connections and parallel computations paves the way for more efficient RL agents in real-time setting.
中文摘要:本研究通过引入时间跳跃连接和历史增强观测的方法,解决了实时强化学习中观测延迟与网络表达能力之间的权衡问题,该方案在不同环境中均表现出色,并通过并行神经元计算实现了显著的推理加速。
English Summary: This study addresses the trade-off between observational delay and network expressivity in real-time reinforcement learning by proposing a solution using temporal skip connections and history-augmented observations, which achieves robust performance across various settings and enables significant inference acceleration through parallel neuron computation.
Authors:Haiduo Huang, Yadong Zhang, Pengju Ren
Abstract:
Dynamic convolution enhances model capacity by adaptively combining multiple kernels, yet faces critical trade-offs: prior works either (1) incur significant parameter overhead by scaling kernel numbers linearly, (2) compromise inference speed through complex kernel interactions, or (3) struggle to jointly optimize dynamic attention and static kernels. We also observe that pre-trained Convolutional Neural Networks (CNNs) exhibit inter-layer redundancy akin to that in Large Language Models (LLMs). Specifically, dense convolutional layers can be efficiently replaced by derived ``child" layers generated from a shared ``parent" convolutional kernel through an adapter.
To address these limitations and implement the weight-sharing mechanism, we propose a lightweight convolution kernel plug-in, named KernelDNA. It decouples kernel adaptation into input-dependent dynamic routing and pre-trained static modulation, ensuring both parameter efficiency and hardware-friendly inference. Unlike existing dynamic convolutions that expand parameters via multi-kernel ensembles, our method leverages cross-layer weight sharing and adapter-based modulation, enabling dynamic kernel specialization without altering the standard convolution structure. This design preserves the native computational efficiency of standard convolutions while enhancing representation power through input-adaptive kernel adjustments. Experiments on image classification and dense prediction tasks demonstrate that KernelDNA achieves state-of-the-art accuracy-efficiency balance among dynamic convolution variants. Our codes are available at https://github.com/haiduo/KernelDNA.
中文: KernelDNA提出了一种轻量级卷积插件,通过跨层权重共享和基于适配器的调制实现动态核适应,在不改变标准卷积结构的情况下达到了最优的精度-效率平衡。
English: KernelDNA introduces a lightweight convolution plug-in that enables dynamic kernel adaptation through cross-layer weight sharing and adapter-based modulation, achieving state-of-the-art accuracy-efficiency balance without altering standard convolution structures.
Authors:Björn Möller, Lucas Görnhardt, Tim Fingscheidt
Abstract:
Transformer architectures prominently lead single-image super-resolution (SISR) benchmarks, reconstructing high-resolution (HR) images from their low-resolution (LR) counterparts. Their strong representative power, however, comes with a higher demand for training data compared to convolutional neural networks (CNNs). For many real-world SR applications, the availability of high-quality HR training images is not given, sparking interest in LR-only training methods. The LR-only SISR benchmark mimics this condition by allowing only low-resolution (LR) images for model training. For a 4x super-resolution, this effectively reduces the amount of available training data to 6.25% of the HR image pixels, which puts the employment of a data-hungry transformer model into question. In this work, we are the first to utilize a lightweight vision transformer model with LR-only training methods addressing the unsupervised SISR LR-only benchmark. We adopt and configure a recent LR-only training method from microscopy image super-resolution to macroscopic real-world data, resulting in our multi-scale training method for bicubic degradation (MSTbic). Furthermore, we compare it with reference methods and prove its effectiveness both for a transformer and a CNN model. We evaluate on the classic SR benchmark datasets Set5, Set14, BSD100, Urban100, and Manga109, and show superior performance over state-of-the-art (so far: CNN-based) LR-only SISR methods. The code is available on GitHub: https://github.com/ifnspaml/SuperResolutionMultiscaleTraining.
中文: 本研究提出了一种轻量级视觉变换器及多尺度训练方法,用于无监督单图像超分辨率任务,在仅使用低分辨率图像训练的条件下超越了现有基于卷积神经网络的方法性能。
English: This study introduces a lightweight vision transformer with a multi-scale training method for unsupervised single-image super-resolution, achieving superior performance on LR-only benchmarks compared to existing CNN-based approaches.
Authors:Reza Esfandiarpoor, George Zerveas, Ruochen Zhang, Macton Mgonzo, Carsten Eickhoff, Stephen H. Bach
Abstract:
Recent advancements in large language models (LLMs) have allowed the augmentation of information retrieval (IR) pipelines with synthetic data in various ways. Yet, the main training paradigm remains: contrastive learning with binary relevance labels and the InfoNCE loss, where one positive document is compared against one or more negatives. This objective treats all documents that are not explicitly annotated as relevant on an equally negative footing, regardless of their actual degree of relevance, thus (a) missing subtle nuances that are useful for ranking and (b) being susceptible to annotation noise. To overcome this limitation, in this work we forgo real training documents and annotations altogether and use open-source LLMs to directly generate synthetic documents that answer real user queries according to several different levels of relevance. This fully synthetic ranking context of graduated relevance, together with an appropriate list-wise loss (Wasserstein distance), enables us to train dense retrievers in a way that better captures the ranking task. Experiments on various IR datasets show that our proposed approach outperforms conventional training with InfoNCE by a large margin. Without using any real documents for training, our dense retriever significantly outperforms the same retriever trained through self-supervision. More importantly, it matches the performance of the same retriever trained on real, labeled training documents of the same dataset, while being more robust to distribution shift and clearly outperforming it when evaluated zero-shot on the BEIR dataset collection.
中文摘要:本研究提出了一种新方法,通过使用大语言模型生成具有分级相关性的全合成文档,结合列表式Wasserstein损失函数训练密集检索器,该方法显著优于传统对比学习方法,在达到与真实标注数据训练相当性能的同时,对分布偏移表现出更强的鲁棒性。
English Summary: This study introduces a novel approach to training dense retrievers by generating fully synthetic documents with graduated relevance levels using LLMs, which, combined with a list-wise Wasserstein loss, significantly outperforms traditional contrastive learning methods and matches the performance of training on real labeled data while offering greater robustness to distribution shifts.
Authors:Zewen Liu, Xiaoda Wang, Bohan Wang, Zijie Huang, Carl Yang, Wei Jin
Abstract:
Graph Neural Networks (GNNs) and differential equations (DEs) are two rapidly advancing areas of research that have shown remarkable synergy in recent years. GNNs have emerged as powerful tools for learning on graph-structured data, while differential equations provide a principled framework for modeling continuous dynamics across time and space. The intersection of these fields has led to innovative approaches that leverage the strengths of both, enabling applications in physics-informed learning, spatiotemporal modeling, and scientific computing. This survey aims to provide a comprehensive overview of the burgeoning research at the intersection of GNNs and DEs. We will categorize existing methods, discuss their underlying principles, and highlight their applications across domains such as molecular modeling, traffic prediction, and epidemic spreading. Furthermore, we identify open challenges and outline future research directions to advance this interdisciplinary field. A comprehensive paper list is provided at https://github.com/Emory-Melody/Awesome-Graph-NDEs. This survey serves as a resource for researchers and practitioners seeking to understand and contribute to the fusion of GNNs and DEs
中文摘要:本综述系统探讨图神经网络与微分方程的协同融合,分类研究方法与应用领域,并为这一交叉学科指明未来发展方向。
English Summary: This survey comprehensively explores the synergistic integration of Graph Neural Networks and differential equations, categorizing methods and applications while identifying future research directions for this interdisciplinary field.
Authors:Anjiang Wei, Tarun Suresh, Jiannan Cao, Naveen Kannan, Yuheng Wu, Kai Yan, Thiago S. F. X. Teixeira, Ke Wang, Alex Aiken
Abstract:
Inductive program synthesis, or programming by example, requires synthesizing functions from input-output examples that generalize to unseen inputs. While large language model agents have shown promise in programming tasks guided by natural language, their ability to perform inductive program synthesis is underexplored. Existing evaluation protocols rely on static sets of examples and held-out tests, offering no feedback when synthesized functions are incorrect and failing to reflect real-world scenarios such as reverse engineering. We propose CodeARC, the Code Abstraction and Reasoning Challenge, a new evaluation framework where agents interact with a hidden target function by querying it with new inputs, synthesizing candidate functions, and iteratively refining their solutions using a differential testing oracle. This interactive setting encourages agents to perform function calls and self-correction based on feedback. We construct the first large-scale benchmark for general-purpose inductive program synthesis, featuring 1114 functions. Among 18 models evaluated, o3-mini performs best with a success rate of 52.7%, highlighting the difficulty of this task. Fine-tuning LLaMA-3.1-8B-Instruct on curated synthesis traces yields up to a 31% relative performance gain. CodeARC provides a more realistic and challenging testbed for evaluating LLM-based program synthesis and inductive reasoning. Our code, data, and models are publicly available at https://github.com/Anjiang-Wei/CodeARC
中文: CodeARC提出了一个交互式评估框架,用于归纳程序合成,智能体通过隐藏目标函数的反馈迭代优化方案,实验表明最佳模型成功率仅达52.7%,而微调可带来显著性能提升。
English: CodeARC introduces an interactive evaluation framework for inductive program synthesis, where agents iteratively refine functions using feedback from a hidden target, with experiments showing top-performing models achieving 52.7% success and fine-tuning yielding significant improvements.
Authors:Hyeongju Kim, Jinhyeok Yang, Yechan Yu, Seunghun Ji, Jacob Morton, Frederik Bous, Joon Byun, Juheon Lee
Abstract:
We introduce SupertonicTTS, a novel text-to-speech (TTS) system designed for efficient and streamlined speech synthesis. SupertonicTTS comprises three components: a speech autoencoder for continuous latent representation, a text-to-latent module leveraging flow-matching for text-to-latent mapping, and an utterance-level duration predictor. To enable a lightweight architecture, we employ a low-dimensional latent space, temporal compression of latents, and ConvNeXt blocks. The TTS pipeline is further simplified by operating directly on raw character-level text and employing cross-attention for text-speech alignment, thus eliminating the need for grapheme-to-phoneme (G2P) modules and external aligners. In addition, we propose context-sharing batch expansion that accelerates loss convergence and stabilizes text-speech alignment with minimal memory and I/O overhead. Experimental results demonstrate that SupertonicTTS delivers performance comparable to contemporary zero-shot TTS models with only 44M parameters, while significantly reducing architectural complexity and computational cost. Audio samples are available at: https://supertonictts.github.io/.
Authors:Paul Caillon, Erwan Fagnou, Alexandre Allauzen
Abstract:
Recurrent neural networks (RNNs) have recently demonstrated strong performance and faster inference than Transformers at comparable parameter budgets. However, the recursive gradient computation with the backpropagation through time (or BPTT) algorithm remains the major computational bottleneck. In this work, we propose a novel method that replaces BPTT with a fixed gradient feedback mechanism, yielding an efficient approximation of the exact gradient propagation based on the assumption of time stationarity. Our approach leverages state-space model (SSM) principles to define a structured feedback matrix that directly propagates gradients from future time steps. This formulation bypasses the need for recursive gradient backpropagation, significantly reducing training overhead while preserving the network's ability to capture long-term dependencies. The experiments on language modeling benchmarks exhibit competitive perplexity scores, while significantly reducing the training costs. These promising results suggest that designing a feedback method like an SSM can fully exploit the efficiency advantages of RNNs for many practical applications.
中文: 本文提出一种基于状态空间模型的固定梯度反馈方法,替代循环神经网络中计算成本高的时间反向传播,以降低训练成本的同时保持竞争力。
English: This paper introduces a fixed gradient feedback method based on state-space model principles to replace the computationally expensive backpropagation through time in recurrent neural networks, achieving competitive performance with reduced training costs.
Authors:Yuyang Liang, Yankai Chen, Yixiang Fang, Laks V. S. Lakshmanan, Chenhao Ma
Abstract:
Electronic Health Records (EHR) have become a valuable resource for a wide range of predictive tasks in healthcare. However, existing approaches have largely focused on inter-visit event predictions, overlooking the importance of intra-visit nowcasting, which provides prompt clinical insights during an ongoing patient visit. To address this gap, we introduce the task of laboratory measurement prediction within a hospital visit. We study the laboratory data that, however, remained underexplored in previous work. We propose TRACE, a Transformer-based model designed for clinical event nowcasting by encoding patient trajectories. TRACE effectively handles long sequences and captures temporal dependencies through a novel timestamp embedding that integrates decay properties and periodic patterns of data. Additionally, we introduce a smoothed mask for denoising, improving the robustness of the model. Experiments on two large-scale electronic health record datasets demonstrate that the proposed model significantly outperforms previous methods, highlighting its potential for improving patient care through more accurate laboratory measurement nowcasting. The code is available at https://github.com/Amehi/TRACE.
中文摘要:该研究提出TRACE模型,一种基于Transformer的临床事件即时预测方法,通过处理长序列数据和整合时间衰减特性与周期模式的新型时间戳嵌入,显著提升了医院就诊期间实验室测量的预测准确性。
English Summary: The study introduces TRACE, a Transformer-based model for intra-visit laboratory measurement nowcasting in Electronic Health Records, which outperforms existing methods by handling long sequences and capturing temporal dependencies through innovative timestamp embeddings and denoising techniques.
Authors:Aske Plaat, Max van Duijn, Niki van Stein, Mike Preuss, Peter van der Putten, Kees Joost Batenburg
Abstract:
There is great interest in agentic LLMs, large language models that act as agents. We review the growing body of work in this area and provide a research agenda. Agentic LLMs are LLMs that (1) reason, (2) act, and (3) interact. We organize the literature according to these three categories. The research in the first category focuses on reasoning, reflection, and retrieval, aiming to improve decision making; the second category focuses on action models, robots, and tools, aiming for agents that act as useful assistants; the third category focuses on multi-agent systems, aiming for collaborative task solving and simulating interaction to study emergent social behavior. We find that works mutually benefit from results in other categories: retrieval enables tool use, reflection improves multi-agent collaboration, and reasoning benefits all categories. We discuss applications of agentic LLMs and provide an agenda for further research. Important applications are in medical diagnosis, logistics and financial market analysis. Meanwhile, self-reflective agents playing roles and interacting with one another augment the process of scientific research itself. Further, agentic LLMs may provide a solution for the problem of LLMs running out of training data: inference-time behavior generates new training states, such that LLMs can keep learning without needing ever larger datasets. We note that there is risk associated with LLM assistants taking action in the real world, while agentic LLMs are also likely to benefit society.
Authors:Beibei Wang, Boyue Cui, Shiqu Chen, Xuan Wang, Yadong Wang, Junyi Li
Abstract:
Motivation: In recent years, protein function prediction has broken through the bottleneck of sequence features, significantly improving prediction accuracy using high-precision protein structures predicted by AlphaFold2. While single-species protein function prediction methods have achieved remarkable success, multi-species protein function prediction methods are still in the stage of using PPI networks and sequence features. Providing effective cross-species label propagation for species with sparse protein annotations remains a challenging issue. To address this problem, we propose the MSNGO model, which integrates structural features and network propagation methods. Our validation shows that using structural features can significantly improve the accuracy of multi-species protein function prediction. Results: We employ graph representation learning techniques to extract amino acid representations from protein structure contact maps and train a structural model using a graph convolution pooling module to derive protein-level structural features. After incorporating the sequence features from ESM-2, we apply a network propagation algorithm to aggregate information and update node representations within a heterogeneous network. The results demonstrate that MSNGO outperforms previous multi-species protein function prediction methods that rely on sequence features and PPI networks. Availability: https://github.com/blingbell/MSNGO.
中文:MSNGO模型通过整合AlphaFold2预测的结构特征与序列数据,显著提升了跨物种蛋白质功能预测的准确性,优于以往依赖序列特征和蛋白质相互作用网络的方法。
English: The MSNGO model enhances multi-species protein function prediction by integrating structural features from AlphaFold2-derived contact maps with sequence data, achieving superior accuracy over traditional sequence-based and PPI network methods.
Authors:Hung-Yueh Chiang, Chi-Chih Chang, Natalia Frumkin, Kai-Chiang Wu, Mohamed S. Abdelfattah, Diana Marculescu
Abstract:
State Space Models (SSMs) are emerging as a compelling alternative to Transformers because of their consistent memory usage and high performance. Despite this, scaling up SSMs on cloud services or limited-resource devices is challenging due to their storage requirements and computational power. To overcome this, quantizing SSMs with low bit-width data formats can reduce model size and benefit from hardware acceleration. As SSMs are prone to quantization-induced errors, recent efforts have focused on optimizing a particular model or bit-width for efficiency without sacrificing performance. However, distinct bit-width configurations are essential for different scenarios, like W4A8 for boosting large-batch decoding speed, and W4A16 for enhancing generation speed in short prompt applications for a single user. To this end, we present Quamba2, compatible with W8A8, W4A8, and W4A16 for both Mamba1 and Mamba2 backbones, addressing the growing demand for SSM deployment on various platforms. Based on the channel order preserving and activation persistence of SSMs, we propose an offline approach to quantize inputs of a linear recurrence in 8-bit by sorting and clustering for input $x$, combined with a per-state-group quantization for input-dependent parameters $B$ and $C$. To ensure compute-invariance in the SSM output, we rearrange weights offline according to the clustering sequence. The experiments show that Quamba2-8B outperforms two state-of-the-art SSM quantization methods and delivers 1.3$\times$ and 3$\times$ speed-ups in the pre-filling and generation stages, respectively, while offering 4$\times$ memory reduction with only a $1.6\%$ average accuracy drop. The evaluation on MMLU shows the generalizability and robustness of our framework. The code and quantized models will be released at: https://github.com/enyac-group/Quamba.
中文:Quamba2是一种适用于状态空间模型的多位宽量化框架,通过离线排序聚类和权重重排技术,在实现4倍内存压缩和显著加速的同时,仅造成1.6%的精度损失,有效支持不同部署场景的需求。
English: Quamba2 is a versatile quantization framework for State Space Models that supports multiple bit-width configurations to reduce memory usage and accelerate computation while maintaining accuracy, achieving significant speed-ups and memory savings with minimal performance loss.
Authors:Nina Weng, Aasa Feragen, Siavash Bigdeli
Abstract:
Diffusion-based generative models, such as Denoising Diffusion Probabilistic Models (DDPMs), have achieved remarkable success in image generation, but their step-by-step denoising process remains opaque, leaving critical aspects of the generation mechanism unexplained. To address this, we introduce \emph{Patronus}, an interpretable diffusion model inspired by ProtoPNet. Patronus integrates a prototypical network into DDPMs, enabling the extraction of prototypes and conditioning of the generation process on their prototype activation vector. This design enhances interpretability by showing the learned prototypes and how they influence the generation process. Additionally, the model supports downstream tasks like image manipulation, enabling more transparent and controlled modifications. Moreover, Patronus could reveal shortcut learning in the generation process by detecting unwanted correlations between learned prototypes. Notably, Patronus operates entirely without any annotations or text prompts. This work opens new avenues for understanding and controlling diffusion models through prototype-based interpretability. Our code is available at \href{https://github.com/nina-weng/patronus}{https://github.com/nina-weng/patronus}.
中文: Patronus是一种可解释的扩散模型,它将原型网络集成到DDPM中,通过展示学习到的原型及其对生成过程的影响来提高透明度,同时支持图像操控和检测捷径学习,且无需任何标注。
English: Patronus is an interpretable diffusion model that integrates a prototypical network into DDPMs to enhance transparency by revealing learned prototypes and their influence on image generation, while enabling manipulation and detecting shortcut learning without requiring annotations.
Authors:Yuying Duan, Gelei Xu, Yiyu Shi, Michael Lemmon
Abstract:
With the emerging application of Federated Learning (FL) in finance, hiring and healthcare, FL models are regulated to be fair, preventing disparities with respect to legally protected attributes such as race or gender. Two concepts of fairness are important in FL: global and local fairness. Global fairness addresses the disparity across the entire population and local fairness is concerned with the disparity within each client. Prior fair FL frameworks have improved either global or local fairness without considering both. Furthermore, while the majority of studies on fair FL focuses on binary settings, many real-world applications are multi-class problems. This paper proposes a framework that investigates the minimum accuracy lost for enforcing a specified level of global and local fairness in multi-class FL settings. Our framework leads to a simple post-processing algorithm that derives fair outcome predictors from the Bayesian optimal score functions. Experimental results show that our algorithm outperforms the current state of the art (SOTA) with regard to the accuracy-fairness tradoffs, computational and communication costs. Codes are available at: https://github.com/papersubmission678/The-cost-of-local-and-global-fairness-in-FL .
中文摘要:本文提出一个联邦学习框架,在保证全局与局部公平性的同时最小化多分类场景中的精度损失,其提出的后处理算法在公平性-精度权衡及计算效率方面均优于现有方法。
English Summary: This paper introduces a framework for enforcing both global and local fairness in multi-class federated learning with minimal accuracy loss, featuring a post-processing algorithm that outperforms current methods in fairness-accuracy tradeoffs and efficiency.
Authors:Gongzhu Yin, Hongli Zhang, Yi Luo, Yuchen Yang, Kun Lu, Chao Meng
Abstract:
Temporal Knowledge Graph (TKG) forecasting is crucial for predicting future events using historical data. With the surge of Large Language Models (LLMs), recent studies have begun exploring their integration into TKG forecasting and achieved some success. However, they still face limitations such as limited input length, inefficient output generation, and resource-intensive refinement, which undermine their performance and practical applicability. To address these limitations, we introduce SPARK, a Sequence-level Proxy-Adapting framework for Refining LLMs in TKG forecasting. Inspired by inference-time algorithms adopted in controlling generation, SPARK offers a cost-effective, plug-and-play solution through two key innovations: (1) Beam Sequence-Level Generation, which reframes TKG forecasting as a top-K sequence-level generation task, using beam search for efficiently generating next-entity distribution in a single forward pass. (2) TKG Adapter for Refinement, which employs traditional TKG models as trainable proxy adapters to leverage global graph information and refine LLM outputs, overcoming both the input length and the resource-intensive fine-tuning problems. Experiments across diverse datasets validate SPARK's forecasting performance, robust generalization capabilities, and high efficiency. We release source codes at https://github.com/yin-gz/SPARK.
Chinese Summary: SPARK提出了一种经济高效的即插即用框架,通过集成束序列级生成和TKG适配器,有效解决了大型语言模型在时序知识图谱预测中的输入限制和效率低下等问题。
English Summary: SPARK introduces a cost-effective, plug-and-play framework that enhances TKG forecasting by integrating beam sequence-level generation and TKG adapters to overcome limitations of LLMs like input constraints and inefficiency.
Authors:Xu Yang, Rui Wang, Kaiwen Li, Wenhua Li, Tao Zhang, Fujun He
Abstract:
The landscape of optimization problems has become increasingly complex, necessitating the development of advanced optimization techniques. Meta-Black-Box Optimization (MetaBBO), which involves refining the optimization algorithms themselves via meta-learning, has emerged as a promising approach. Recognizing the limitations in existing platforms, we presents PlatMetaX, a novel MATLAB platform for MetaBBO with reinforcement learning. PlatMetaX integrates the strengths of MetaBox and PlatEMO, offering a comprehensive framework for developing, evaluating, and comparing optimization algorithms. The platform is designed to handle a wide range of optimization problems, from single-objective to multi-objective, and is equipped with a rich set of baseline algorithms and evaluation metrics. We demonstrate the utility of PlatMetaX through extensive experiments and provide insights into its design and implementation. PlatMetaX is available at: \href{https://github.com/Yxxx616/PlatMetaX}{https://github.com/Yxxx616/PlatMetaX}.
Chinese Summary: PlatMetaX是一个新颖的MATLAB平台,它整合了MetaBox和PlatEMO的优势,为基于强化学习的元黑盒优化算法开发、评估和比较提供了全面框架,并通过大量实验验证了其实用性。
English Summary: PlatMetaX is a novel MATLAB platform that integrates MetaBox and PlatEMO to provide a comprehensive framework for developing, evaluating, and comparing meta-black-box optimization algorithms using reinforcement learning, demonstrated through extensive experiments.
Authors:Belinda Z. Li, Been Kim, Zi Wang
Abstract:
Recently, a large amount of work has focused on improving large language models' (LLMs') performance on reasoning benchmarks such as math and logic. However, past work has largely assumed that tasks are well-defined. In the real world, queries to LLMs are often underspecified, only solvable through acquiring missing information. We formalize this as a constraint satisfaction problem (CSP) with missing variable assignments. Using a special case of this formalism where only one necessary variable assignment is missing, we can rigorously evaluate an LLM's ability to identify the minimal necessary question to ask and quantify axes of difficulty levels for each problem. We present QuestBench, a set of underspecified reasoning tasks solvable by asking at most one question, which includes: (1) Logic-Q: Logical reasoning tasks with one missing proposition, (2) Planning-Q: PDDL planning problems with initial states that are partially-observed, (3) GSM-Q: Human-annotated grade school math problems with one missing variable assignment, and (4) GSME-Q: a version of GSM-Q where word problems are translated into equations by human annotators. The LLM is tasked with selecting the correct clarification question(s) from a list of options. While state-of-the-art models excel at GSM-Q and GSME-Q, their accuracy is only 40-50% on Logic-Q and Planning-Q. Analysis demonstrates that the ability to solve well-specified reasoning problems may not be sufficient for success on our benchmark: models have difficulty identifying the right question to ask, even when they can solve the fully specified version of the problem. Furthermore, in the Planning-Q domain, LLMs tend not to hedge, even when explicitly presented with the option to predict ``not sure.'' This highlights the need for deeper investigation into models' information acquisition capabilities.
中文: 最新研究提出QuestBench基准,用于评估大语言模型在信息不全的推理任务中识别最小必要问题的能力,发现即使最先进的模型在逻辑和规划问题上表现不佳,尽管它们在明确定义的任务中表现出色。
English: Recent research introduces QuestBench, a benchmark evaluating LLMs' ability to identify minimal necessary questions in underspecified reasoning tasks, revealing that even state-of-the-art models struggle with logic and planning problems despite excelling in well-defined scenarios.
Authors:Jing Li, Hao Sun
Abstract:
Neural networks have emerged as a powerful paradigm for tasks in high energy physics, yet their opaque training process renders them as a black box. In contrast, the traditional cut flow method offers simplicity and interpretability but requires extensive manual tuning to identify optimal cut boundaries. To merge the strengths of both approaches, we propose the Learnable Cut Flow (LCF), a neural network that transforms the traditional cut selection into a fully differentiable, data-driven process. LCF implements two cut strategies-parallel, where observable distributions are treated independently, and sequential, where prior cuts shape subsequent ones-to flexibly determine optimal boundaries. Building on this strategy, we introduce the Learnable Importance, a metric that quantifies feature importance and adjusts their contributions to the loss accordingly, offering model-driven insights unlike ad-hoc metrics. To ensure differentiability, a modified loss function replaces hard cuts with mask operations, preserving data shape throughout the training process. LCF is tested on six varied mock datasets and a realistic diboson vs. QCD dataset. Results demonstrate that LCF 1. accurately learns cut boundaries across typical feature distributions in both parallel and sequential strategies, 2. assigns higher importance to discriminative features with minimal overlap, 3. handles redundant or correlated features robustly, and 4. performs effectively in real-world scenarios. In the diboson dataset, LCF initially underperforms boosted decision trees and multiplayer perceptrons when using all observables. LCF bridges the gap between traditional cut flow method and modern black-box neural networks, delivering actionable insights into the training process and feature importance. Source code and experimental data are available at https://github.com/Star9daisy/learnable-cut-flow.
Chinese: 可学习截断流(LCF)方法将传统的截断选择转化为可微分的数据驱动过程,结合了截断流的可解释性与神经网络的高效性,并通过实际测试提供了对特征重要性的深入理解。
English: The Learnable Cut Flow (LCF) method transforms traditional cut selection into a differentiable, data-driven process, combining the interpretability of cut flows with neural network efficiency and providing insights into feature importance through real-world testing.
Authors:Reza Nematirad, Anil Pahwa, Balasubramaniam Natarajan
Abstract:
Residential electricity demand forecasting is critical for efficient energy management and grid stability. Accurate predictions enable utility companies to optimize planning and operations. However, real-world residential electricity demand data often exhibit intricate temporal variability, including multiple seasonalities, periodicities, and abrupt fluctuations, which pose significant challenges for forecasting models. Previous models that rely on statistical methods, recurrent, convolutional neural networks, and transformers often struggle to capture these intricate temporal dynamics. To address these challenges, we propose the Seasonal-Periodic Decomposition Network (SPDNet), a novel deep learning framework consisting of two main modules. The first is the Seasonal-Trend Decomposition Module (STDM), which decomposes the input data into trend, seasonal, and residual components. The second is the Periodical Decomposition Module (PDM), which employs the Fast Fourier Transform to identify the dominant periods. For each dominant period, 1D input data is reshaped into a 2D tensor, where rows represent periods and columns correspond to frequencies. The 2D representations are then processed through three submodules: a 1D convolution to capture sharp fluctuations, a transformer-based encoder to model global patterns, and a 2D convolution to capture interactions between periods. Extensive experiments conducted on real-world residential electricity load data demonstrate that SPDNet outperforms traditional and advanced models in both forecasting accuracy and computational efficiency. The code is available in this repository: https://github.com/Tims2D/SPDNet.
中文: 提出的季节性周期分解网络(SPDNet)通过将数据分解为季节性、趋势和周期分量,有效应对住宅电力需求的复杂时序变化,在预测精度和计算效率上均优于现有模型。
English: The proposed Seasonal-Periodic Decomposition Network (SPDNet) effectively addresses complex temporal variations in residential electricity demand by decomposing data into seasonal, trend, and periodic components, outperforming existing models in accuracy and efficiency.
Authors:Diego Coello de Portugal Mecke, Haya Alyoussef, Maximilian Stubbemann, Ilia Koloiarov, Tom Hanika, Lars Schmidt-Thieme
Abstract:
Recently, Large Language Models (LLMs) have become very widespread and are used to solve a wide variety of tasks. To successfully handle these tasks, LLMs require longer training times and larger model sizes. This makes LLMs ideal candidates for pruning methods that reduce computational demands while maintaining performance. Previous methods require a retraining phase after pruning to maintain the original model's performance. However, state-of-the-art pruning methods, such as Wanda, prune the model without retraining, making the pruning process faster and more efficient. Building upon Wanda's work, this study provides a theoretical explanation of why the method is effective and leverages these insights to enhance the pruning process. Specifically, a theoretical analysis of the pruning problem reveals a common scenario in Machine Learning where Wanda is the optimal pruning method. Furthermore, this analysis is extended to cases where Wanda is no longer optimal, leading to the development of a new method, STADE, based on the standard deviation of the input. From a theoretical standpoint, STADE demonstrates better generality across different scenarios. Finally, extensive experiments on Llama and Open Pre-trained Transformers (OPT) models validate these theoretical findings, showing that depending on the training conditions, Wanda's optimal performance varies as predicted by the theoretical framework. These insights contribute to a more robust understanding of pruning strategies and their practical implications. Code is available at: https://github.com/Coello-dev/STADE/
Chinese: 本研究在Wanda剪枝方法的基础上,从理论上解释了其有效性,并提出基于输入标准差的新方法STADE,该方法具有更广泛的适用性,并通过在Llama和OPT等模型上的实验验证了理论发现。
English: This study builds on the Wanda pruning method by providing a theoretical justification for its effectiveness and introduces STADE, a new pruning approach based on input standard deviation that offers broader applicability and is validated through experiments on models like Llama and OPT.
Authors:Dongping Liao, Xitong Gao, Yabo Xu, Chengzhong Xu
Abstract:
The increasing emphasis on privacy and data security has driven the adoption of federated learning, a decentralized approach to train machine learning models without sharing raw data. Prompt learning, which fine-tunes prompt embeddings of pretrained models, offers significant advantages in federated settings by reducing computational costs and communication overheads while leveraging the strong performance and generalization capabilities of vision-language models such as CLIP. This paper addresses the intersection of federated learning and prompt learning, particularly for vision-language models. In this work, we introduce a comprehensive framework, named FLIP, to evaluate federated prompt learning algorithms. FLIP assesses the performance of 8 state-of-the-art federated prompt learning methods across 4 federated learning protocols and 12 open datasets, considering 6 distinct evaluation scenarios. Our findings demonstrate that prompt learning maintains strong generalization performance in both in-distribution and out-of-distribution settings with minimal resource consumption. This work highlights the effectiveness of federated prompt learning in environments characterized by data scarcity, unseen classes, and cross-domain distributional shifts. We open-source the code for all implemented algorithms in FLIP to facilitate further research in this domain.
中文: 联邦提示学习通过FLIP框架展示了其在保护隐私的同时,利用预训练视觉语言模型以最少资源消耗实现高效训练,并在多种数据场景下保持强大的泛化能力。
English: Federated prompt learning, exemplified by the FLIP framework, enables efficient and privacy-preserving model training by leveraging pre-trained vision-language models with minimal resource usage while maintaining robust generalization across diverse data scenarios.
Authors:Chanhyuk Lee, Jiho Choi, Chanryeol Lee, Donggyun Kim, Seunghoon Hong
Abstract:
Model merging has emerged as a promising approach for unifying independently fine-tuned models into an integrated framework, significantly enhancing computational efficiency in multi-task learning. Recently, several SVD-based techniques have been introduced to exploit low-rank structures for enhanced merging, but their reliance on such manually designed rank selection often leads to cross-task interference and suboptimal performance. In this paper, we propose AdaRank, a novel model merging framework that adaptively selects the most beneficial singular directions of task vectors to merge multiple models. We empirically show that the dominant singular components of task vectors can cause critical interference with other tasks, and that naive truncation across tasks and layers degrades performance. In contrast, AdaRank dynamically prunes the singular components that cause interference and offers an optimal amount of information to each task vector by learning to prune ranks during test-time via entropy minimization. Our analysis demonstrates that such method mitigates detrimental overlaps among tasks, while empirical results show that AdaRank consistently achieves state-of-the-art performance with various backbones and number of tasks, reducing the performance gap between fine-tuned models to nearly 1%.
中文摘要:AdaRank是一种自适应模型融合框架,通过熵最小化动态选择任务向量的最优奇异分量,有效减少任务间干扰,以接近最优性能的表现将性能差距缩小至约1%。
English Summary: AdaRank is an adaptive model merging framework that dynamically selects optimal singular components from task vectors through entropy minimization, effectively reducing cross-task interference and achieving near state-of-the-art performance with minimal performance gap.
Authors:Zhanke Zhou, Zhaocheng Zhu, Xuan Li, Mikhail Galkin, Xiao Feng, Sanmi Koyejo, Jian Tang, Bo Han
Abstract:
Numerous applications of large language models (LLMs) rely on their ability to perform step-by-step reasoning. However, the reasoning behavior of LLMs remains poorly understood, posing challenges to research, development, and safety. To address this gap, we introduce landscape of thoughts-the first visualization tool for users to inspect the reasoning paths of chain-of-thought and its derivatives on any multi-choice dataset. Specifically, we represent the states in a reasoning path as feature vectors that quantify their distances to all answer choices. These features are then visualized in two-dimensional plots using t-SNE. Qualitative and quantitative analysis with the landscape of thoughts effectively distinguishes between strong and weak models, correct and incorrect answers, as well as different reasoning tasks. It also uncovers undesirable reasoning patterns, such as low consistency and high uncertainty. Additionally, users can adapt our tool to a model that predicts the property they observe. We showcase this advantage by adapting our tool to a lightweight verifier that evaluates the correctness of reasoning paths. Empirically, this verifier boosts the accuracy of reasoning as well as the test-time scaling effect. The code is publicly available at: https://github.com/tmlr-group/landscape-of-thoughts.
Chinese Summary: 该研究提出了“思维景观”可视化工具,通过将大语言模型的推理路径转化为二维特征向量图,有效区分模型强弱与答案正误,并可通过适配验证器提升推理准确性和测试扩展效果。
English Summary: The study introduces "landscape of thoughts," a visualization tool that enables users to analyze LLM reasoning paths by mapping them as feature vectors in 2D plots, revealing performance patterns and enhancing model evaluation through an adaptable verifier.
Authors:Chung-En Sun, Ge Yan, Tsui-Wei Weng
Abstract:
Recent studies have shown that Large Language Models (LLMs) augmented with chain-of-thought (CoT) reasoning demonstrate impressive problem-solving abilities. However, in this work, we identify a recurring issue where these models occasionally generate overly short reasoning, leading to degraded performance on even simple mathematical problems. Specifically, we investigate how reasoning length is embedded in the hidden representations of reasoning models and its impact on accuracy. Our analysis reveals that reasoning length is governed by a linear direction in the representation space, allowing us to induce overly short reasoning by steering the model along this direction. Building on this insight, we introduce \textbf{\textit{ThinkEdit}}, a simple yet effective weight-editing approach to mitigate the issue of overly short reasoning. We first identify a small subset of attention heads (approximately 4%) that predominantly drive short reasoning behavior. We then edit the output projection weights of these heads to remove the short reasoning direction. With changes to only 0.2% of the model's parameters, \textbf{\textit{ThinkEdit}} effectively reduces overly short reasoning and yields notable accuracy gains for short reasoning outputs (+6.39%), along with an overall improvement across multiple math benchmarks (+3.34%). Our findings provide new mechanistic insights into how reasoning length is controlled within LLMs and highlight the potential of fine-grained model interventions to improve reasoning quality. Our code is available at: https://github.com/Trustworthy-ML-Lab/ThinkEdit\
中文: 近期研究发现思维链增强的大语言模型中推理过短会降低性能,为此提出的ThinkEdit方法通过选择性编辑少量注意力头权重,有效减少过短推理并显著提升数学任务准确率。
English: Recent research identifies that overly short reasoning in chain-of-thought augmented LLMs impairs performance, leading to the development of ThinkEdit, a weight-editing method that selectively modifies a small subset of attention heads to effectively reduce this issue and improve accuracy on mathematical tasks.
Authors:Heng Zhang, Gokhan Solak, Arash Ajoudani
Abstract:
Ensuring safety in reinforcement learning (RL)-based robotic systems is a critical challenge, especially in contact-rich tasks within unstructured environments. While the state-of-the-art safe RL approaches mitigate risks through safe exploration or high-level recovery mechanisms, they often overlook low-level execution safety, where reflexive responses to potential hazards are crucial. Similarly, variable impedance control (VIC) enhances safety by adjusting the robot's mechanical response, yet lacks a systematic way to adapt parameters, such as stiffness and damping throughout the task. In this paper, we propose Bresa, a Bio-inspired Reflexive Hierarchical Safe RL method inspired by biological reflexes. Our method decouples task learning from safety learning, incorporating a safety critic network that evaluates action risks and operates at a higher frequency than the task solver. Unlike existing recovery-based methods, our safety critic functions at a low-level control layer, allowing real-time intervention when unsafe conditions arise. The task-solving RL policy, running at a lower frequency, focuses on high-level planning (decision-making), while the safety critic ensures instantaneous safety corrections. We validate Bresa on multiple tasks including a contact-rich robotic task, demonstrating its reflexive ability to enhance safety, and adaptability in unforeseen dynamic environments. Our results show that Bresa outperforms the baseline, providing a robust and reflexive safety mechanism that bridges the gap between high-level planning and low-level execution. Real-world experiments and supplementary material are available at project website https://jack-sherman01.github.io/Bresa.
Authors:Taufiq Ahmed, Abhishek Kumar, Constantino Ãlvarez Casado, Anlan Zhang, Tuomo Hänninen, Lauri Loven, Miguel Bordallo López, Sasu Tarkoma
Abstract:
Object detection models often struggle with class imbalance, where rare categories appear significantly less frequently than common ones. Existing sampling-based rebalancing strategies, such as Repeat Factor Sampling (RFS) and Instance-Aware Repeat Factor Sampling (IRFS), mitigate this issue by adjusting sample frequencies based on image and instance counts. However, these methods are based on linear adjustments, which limit their effectiveness in long-tailed distributions. This work introduces Exponentially Weighted Instance-Aware Repeat Factor Sampling (E-IRFS), an extension of IRFS that applies exponential scaling to better differentiate between rare and frequent classes. E-IRFS adjusts sampling probabilities using an exponential function applied to the geometric mean of image and instance frequencies, ensuring a more adaptive rebalancing strategy. We evaluate E-IRFS on a dataset derived from the Fireman-UAV-RGBT Dataset and four additional public datasets, using YOLOv11 object detection models to identify fire, smoke, people and lakes in emergency scenarios. The results show that E-IRFS improves detection performance by 22\% over the baseline and outperforms RFS and IRFS, particularly for rare categories. The analysis also highlights that E-IRFS has a stronger effect on lightweight models with limited capacity, as these models rely more on data sampling strategies to address class imbalance. The findings demonstrate that E-IRFS improves rare object detection in resource-constrained environments, making it a suitable solution for real-time applications such as UAV-based emergency monitoring. The code is available at: https://github.com/futurians/E-IRFS.
中文: 本文提出的E-IRFS方法通过指数加权采样策略,在线性方法RFS和IRFS基础上显著提升长尾分布中稀有类别的检测性能,在无人机应急监测任务中实现22%的性能提升。
English: This paper introduces E-IRFS, an exponentially weighted sampling method that enhances rare class detection in imbalanced datasets by outperforming linear approaches like RFS and IRFS, achieving a 22% improvement in emergency monitoring scenarios.
Authors:Haolong Yan, Kaijun Tan, Yeqing Shen, Xin Huang, Zheng Ge, Xiangyu Zhang, Si Li, Daxin Jiang
Abstract:
We investigate a critical yet under-explored question in Large Vision-Language Models (LVLMs): Do LVLMs genuinely comprehend interleaved image-text in the document? Existing document understanding benchmarks often assess LVLMs using question-answer formats, which are information-sparse and difficult to guarantee the coverage of long-range dependencies. To address this issue, we introduce a novel and challenging Multimodal Document Summarization Benchmark (M-DocSum-Bench), which comprises 500 high-quality arXiv papers, along with interleaved multimodal summaries aligned with human preferences. M-DocSum-Bench is a reference-based generation task and necessitates the generation of interleaved image-text summaries using provided reference images, thereby simultaneously evaluating capabilities in understanding, reasoning, localization, and summarization within complex multimodal document scenarios. To facilitate this benchmark, we develop an automated framework to construct summaries and propose a fine-grained evaluation method called M-DocEval. Moreover, we further develop a robust summarization baseline, i.e., M-DocSum-7B, by progressive two-stage training with diverse instruction and preference data. The extensive results on our M-DocSum-Bench reveal that the leading LVLMs struggle to maintain coherence and accurately integrate information within long and interleaved contexts, often exhibiting confusion between similar images and a lack of robustness. Notably, M-DocSum-7B achieves state-of-the-art performance compared to larger and closed-source models (including GPT-4o, Gemini Pro, Claude-3.5-Sonnet and Qwen2.5-VL-72B, etc.), demonstrating the potential of LVLMs for improved interleaved image-text understanding. The code, data, and models are available at https://github.com/stepfun-ai/M-DocSum-Bench.
Chinese Summary: 本研究提出了新型多模态文档摘要基准(M-DocSum-Bench),用于评估大视觉语言模型对图文交织文档的理解能力,发现现有模型在复杂多模态场景中难以保持连贯性和准确整合信息。
English Summary: This study introduces a novel Multimodal Document Summarization Benchmark (M-DocSum-Bench) to evaluate Large Vision-Language Models' ability to understand interleaved image-text documents, revealing that current models struggle with coherence and information integration in complex multimodal contexts.
Authors:Ivan Diaz, Florin Scherer, Yanik Berli, Roland Wiest, Helly Hammer, Robert Hoepner, Alejandro Leon Betancourt, Piotr Radojewski, Richard McKinley
Abstract:
In the setting of clinical imaging, differences in between vendors, hospitals and sequences can yield highly inhomogeneous imaging data. In MRI in particular, voxel dimension, slice spacing and acquisition plane can vary substantially. For clinical applications, therefore, algorithms must be trained to handle data with various voxel resolutions. The usual strategy to deal with heterogeneity of resolution is harmonization: resampling imaging data to a common (usually isovoxel) resolution. This can lead to loss of fidelity arising from interpolation artifacts out-of-plane and downsampling in-plane. We present in this paper a network architecture designed to be able to learn directly from spatially heterogeneous data, without resampling: a segmentation network based on the e3nn framework that leverages a spherical harmonic, rather than voxel-grid, parameterization of convolutional kernels, with a fixed physical radius. Networks based on these kernels can be resampled to their input voxel dimensions. We trained and tested our network on a publicly available dataset assembled from three centres, and on an in-house dataset of Multiple Sclerosis cases with a high degree of spatial inhomogeneity. We compared our approach to a standard U-Net with two strategies for handling inhomogeneous data: training directly on the data without resampling, and resampling to a common resolution of 1mm isovoxels. We show that our network is able to learn from various combinations of voxel sizes and outperforms classical U-Nets on 2D testing cases and most 3D testing cases. This shows an ability to generalize well when tested on image resolutions not seen during training. Our code can be found at: http://github.com/SCAN-NRAD/e3nn\_U-Net.
中文摘要:本文提出了一种新型神经网络架构,可直接处理空间异构的MRI数据而无需重采样,利用球谐卷积核在不同体素分辨率下表现优于传统方法。
English Summary: This paper introduces a novel neural network architecture that directly learns from spatially heterogeneous MRI data without resampling, using spherical harmonic convolutional kernels to outperform traditional methods in handling diverse voxel resolutions.
Authors:Yizhen Luo, Jiashuo Wang, Siqi Fan, Zaiqing Nie
Abstract:
Structural biology relies on accurate three-dimensional biomolecular structures to advance our understanding of biological functions, disease mechanisms, and therapeutics. While recent advances in deep learning have enabled the development of all-atom foundation models for molecular modeling and generation, existing approaches face challenges in generalization due to the multi-modal nature of atomic data and the lack of comprehensive analysis of training and sampling strategies. To address these limitations, we propose PharMolixFM, a unified framework for constructing all-atom foundation models based on multi-modal generative techniques. Our framework includes three variants using state-of-the-art multi-modal generative models. By formulating molecular tasks as a generalized denoising process with task-specific priors, PharMolixFM achieves robust performance across various structural biology applications. Experimental results demonstrate that PharMolixFM-Diff achieves competitive prediction accuracy in protein-small-molecule docking (83.9% vs. 90.2% RMSD < 2Ã
, given pocket) with significantly improved inference speed. Moreover, we explore the empirical inference scaling law by introducing more sampling repeats or steps. Our code and model are available at https://github.com/PharMolix/OpenBioMed.
中文:PharMolixFM提出了一种统一的多模态生成框架,解决了全原子分子建模中的泛化难题,在蛋白质-小分子对接中实现了具有竞争力的精度和更快的推理速度。
English: PharMolixFM introduces a unified multi-modal generative framework that overcomes generalization challenges in all-atom molecular modeling, achieving competitive accuracy in protein-small-molecule docking with enhanced inference speed.
Authors:Jiahao Xie, Alessio Tonioni, Nathalie Rauschmayr, Federico Tombari, Bernt Schiele
Abstract:
Visual in-context learning (VICL), as a new paradigm in computer vision, allows the model to rapidly adapt to various tasks with only a handful of prompts and examples. While effective, the existing VICL paradigm exhibits poor generalizability under distribution shifts. In this work, we propose test-time Visual In-Context Tuning (VICT), a method that can adapt VICL models on the fly with a single test sample. Specifically, we flip the role between the task prompts and the test sample and use a cycle consistency loss to reconstruct the original task prompt output. Our key insight is that a model should be aware of a new test distribution if it can successfully recover the original task prompts. Extensive experiments on six representative vision tasks ranging from high-level visual understanding to low-level image processing, with 15 common corruptions, demonstrate that our VICT can improve the generalizability of VICL to unseen new domains. In addition, we show the potential of applying VICT for unseen tasks at test time. Code: https://github.com/Jiahao000/VICT.
中文: 本文提出测试时视觉上下文调优(VICT)方法,通过单一样本自适应和循环一致性损失重构原始任务提示,有效提升了视觉上下文学习模型在未知领域的泛化能力。
English: This paper introduces test-time Visual In-Context Tuning (VICT), a method that enhances the generalizability of visual in-context learning models to unseen domains by adapting them with a single test sample and using cycle consistency loss to recover original task prompts.
Authors:David Yifan Yao, Albert J. Zhai, Shenlong Wang
Abstract:
This paper presents a unified approach to understanding dynamic scenes from casual videos. Large pretrained vision foundation models, such as vision-language, video depth prediction, motion tracking, and segmentation models, offer promising capabilities. However, training a single model for comprehensive 4D understanding remains challenging. We introduce Uni4D, a multi-stage optimization framework that harnesses multiple pretrained models to advance dynamic 3D modeling, including static/dynamic reconstruction, camera pose estimation, and dense 3D motion tracking. Our results show state-of-the-art performance in dynamic 4D modeling with superior visual quality. Notably, Uni4D requires no retraining or fine-tuning, highlighting the effectiveness of repurposing visual foundation models for 4D understanding.
Authors:Yongce Li, Chung-En Sun, Tsui-Wei Weng
Abstract:
Large language Models (LLMs) have demonstrated remarkable skills across various domains. Understanding the mechanisms behind their abilities and implementing controls over them is becoming increasingly important for developing better models. In this paper, we focus on skill unlearning in LLMs, specifically unlearning a particular skill while retaining their overall capabilities. We introduce two lightweight, training-free machine skill unlearning techniques for LLMs. First, we observe that the pre-activation distribution of neurons in each Feed-Forward Layer (FFL) differs when the model demonstrates different skills. Additionally, we find that queries triggering the same skill cluster within the FFL key space and can be separated from other queries using a hypercube. Based on these observations, we propose two lightweight, training-free skill unlearning methods via \textit{intervention} and \textit{abstention} respectively: \texttt{Neuron Adjust} and \texttt{Key Space Detection}. We evaluate our methods on unlearning math-solving, Python-coding, and comprehension skills across seven different languages. The results demonstrate their strong unlearning capabilities for the designated skills. Specifically, \texttt{Key Space Detection} achieves over 80\% relative performance drop on the forgetting skill and less than 10\% relative performance drop on other skills and the model's general knowledge (MMLU) for most unlearning tasks. Our code is available at https://github.com/Trustworthy-ML-Lab/effective_skill_unlearning
中文: 本文提出了两种轻量级、无需训练的大语言模型技能遗忘方法,能有效消除特定技能,同时保持模型的整体能力和通用知识。
English: This paper introduces two lightweight, training-free methods for skill unlearning in large language models, which effectively eliminate specific skills while preserving overall capabilities and general knowledge.
Authors:Yassir Lairgi
Abstract:
The accurate determination of the beginning of each Hijri month is essential for religious, cultural, and administrative purposes. Manazel (The code and datasets are available at https://github.com/lairgiyassir/manazel) addresses this challenge in Morocco by leveraging 13 years of crescent visibility data to refine the ODEH criterion, a widely used standard for lunar crescent visibility prediction. The study integrates two key features, the Arc of Vision (ARCV) and the total width of the crescent (W), to enhance the accuracy of lunar visibility assessments. A machine learning approach utilizing the Logistic Regression algorithm is employed to classify crescent visibility conditions, achieving a predictive accuracy of 98.83%. This data-driven methodology offers a robust and reliable framework for determining the start of the Hijri month, comparing different data classification tools, and improving the consistency of lunar calendar calculations in Morocco. The findings demonstrate the effectiveness of machine learning in astronomical applications and highlight the potential for further enhancements in the modeling of crescent visibility.
Chinese: Manazel项目通过机器学习分析13年新月可见性数据,优化了ODEH准则,以98.83%的预测准确率提升了摩洛哥伊斯兰历月初判定的精确度。
English: The Manazel project enhances the accuracy of determining Hijri months in Morocco by applying machine learning to 13 years of lunar data, achieving 98.83% predictive accuracy through refined visibility criteria.
Authors:Hyunjun Lee, Hyunsoo Lee, Sookwan Han
Abstract:
There have been many attempts to leverage multiple diffusion models for collaborative generation, extending beyond the original domain. A prominent approach involves synchronizing multiple diffusion trajectories by mixing the estimated scores to artificially correlate the generation processes. However, existing methods rely on naive heuristics, such as averaging, without considering task specificity. These approaches do not clarify why such methods work and often produce suboptimal results when a heuristic suitable for one task is blindly applied to others. In this paper, we present a probabilistic framework for analyzing why diffusion synchronization works and reveal where heuristics should be focused; modeling correlations between multiple trajectories and adapting them to each specific task. We further identify optimal correlation models per task, achieving better results than previous approaches that apply a single heuristic across all tasks without justification.
Authors:Yuwei Yin, EunJeong Hwang, Giuseppe Carenini
Abstract:
Intent, typically clearly formulated and planned, functions as a cognitive framework for communication and problem-solving. This paper introduces the concept of Speaking with Intent (SWI) in large language models (LLMs), where the explicitly generated intent encapsulates the model's underlying intention and provides high-level planning to guide subsequent analysis and action. By emulating deliberate and purposeful thoughts in the human mind, SWI is hypothesized to enhance the reasoning capabilities and generation quality of LLMs. Extensive experiments on text summarization, multi-task question answering, and mathematical reasoning benchmarks consistently demonstrate the effectiveness and generalizability of Speaking with Intent over direct generation without explicit intent. Further analysis corroborates the generalizability of SWI under different experimental settings. Moreover, human evaluations verify the coherence, effectiveness, and interpretability of the intent produced by SWI. The promising results in enhancing LLMs with explicit intents pave a new avenue for boosting LLMs' generation and reasoning abilities with cognitive notions.
中文: 本文提出“有意图对话”方法,通过显式生成意图来增强大语言模型的推理能力和生成质量,实验和人工评估均证实其有效性。
English: This paper proposes Speaking with Intent (SWI) for large language models, where explicit intent generation enhances reasoning and output quality, as validated by experiments and human evaluation.
Authors:David P. Hofmeyr
Abstract:
A novel and intuitive nearest neighbours based clustering algorithm is introduced, in which a cluster is defined in terms of an equilibrium condition which balances its size and cohesiveness. The formulation of the equilibrium condition allows for a quantification of the strength of alignment of each point to a cluster, with these cluster alignment strengths leading naturally to a model selection criterion which renders the proposed approach fully automatable. The algorithm is simple to implement and computationally efficient, and produces clustering solutions of extremely high quality in comparison with relevant benchmarks from the literature. R code to implement the approach is available from https://github.com/DavidHofmeyr/NNEC.
中文摘要:一种基于最近邻的新聚类算法通过平衡簇的大小和内聚性来定义簇,提供自动化模型选择并生成高质量结果,附有可用的R代码。
English Summary: A new clustering algorithm using nearest neighbors defines clusters by balancing size and cohesiveness, offering automated model selection and high-quality results with available R code.
Authors:Erik Wallin, Fredrik Kahl, Lars Hammarstrand
Abstract:
Out-of-distribution (OOD) detection in deep learning has traditionally been framed as a binary task, where samples are either classified as belonging to the known classes or marked as OOD, with little attention given to the semantic relationships between OOD samples and the in-distribution (ID) classes. We propose a framework for detecting and classifying OOD samples in a given class hierarchy. Specifically, we aim to predict OOD data to their correct internal nodes of the class hierarchy, whereas the known ID classes should be predicted as their corresponding leaf nodes. Our approach leverages the class hierarchy to create a probabilistic model and we implement this model by using networks trained for ID classification at multiple hierarchy depths. We conduct experiments on three datasets with predefined class hierarchies and show the effectiveness of our method. Our code is available at https://github.com/walline/prohoc.
中文摘要:本文提出了一种在类别层次结构中检测并分类分布外样本的框架,能够将未知样本归类至层次结构的内部节点,同时保持对已知类别的叶节点准确预测,并在多个数据集上验证了方法的有效性。
English Summary: This paper introduces a framework that detects out-of-distribution (OOD) samples by classifying them into appropriate internal nodes of a class hierarchy, while maintaining accurate leaf-level predictions for in-distribution data, demonstrating effectiveness across multiple datasets.
Authors:Haoming Xu, Shuxun Wang, Yanqiu Zhao, Yi Zhong, Ziyan Jiang, Ningyuan Zhao, Shumin Deng, Huajun Chen, Ningyu Zhang
Abstract:
This paper presents the ZJUKLAB team's submission for SemEval-2025 Task 4: Unlearning Sensitive Content from Large Language Models. This task aims to selectively erase sensitive knowledge from large language models, avoiding both over-forgetting and under-forgetting issues. We propose an unlearning system that leverages Model Merging (specifically TIES-Merging), combining two specialized models into a more balanced unlearned model. Our system achieves competitive results, ranking second among 26 teams, with an online score of 0.944 for Task Aggregate and 0.487 for overall Aggregate. In this paper, we also conduct local experiments and perform a comprehensive analysis of the unlearning process, examining performance trajectories, loss dynamics, and weight perspectives, along with several supplementary experiments, to understand the effectiveness of our method. Furthermore, we analyze the shortcomings of our method and evaluation metrics, emphasizing that MIA scores and ROUGE-based metrics alone are insufficient to fully evaluate successful unlearning. Finally, we emphasize the need for more comprehensive evaluation methodologies and rethinking of unlearning objectives in future research. Code is available at https://github.com/zjunlp/unlearn/tree/main/semeval25.
中文: ZJUKLAB团队针对SemEval-2025任务四提出了基于模型融合的遗忘系统,在26支队伍中荣获第二名,其方法在有效消除敏感内容的同时,揭示了当前评估指标的不足,并呼吁建立更全面的评估体系。
English: The ZJUKLAB team introduced a model merging-based unlearning system for SemEval-2025 Task 4, achieving second place by effectively removing sensitive content while highlighting limitations in current evaluation metrics and calling for improved assessment methods.
Authors:Yusong Hu, Zichen Liang, Fei Yang, Qibin Hou, Xialei Liu, Ming-Ming Cheng
Abstract:
Continual learning requires models to train continuously across consecutive tasks without forgetting. Most existing methods utilize linear classifiers, which struggle to maintain a stable classification space while learning new tasks. Inspired by the success of Kolmogorov-Arnold Networks (KAN) in preserving learning stability during simple continual regression tasks, we set out to explore their potential in more complex continual learning scenarios. In this paper, we introduce the Kolmogorov-Arnold Classifier (KAC), a novel classifier developed for continual learning based on the KAN structure. We delve into the impact of KAN's spline functions and introduce Radial Basis Functions (RBF) for improved compatibility with continual learning. We replace linear classifiers with KAC in several recent approaches and conduct experiments across various continual learning benchmarks, all of which demonstrate performance improvements, highlighting the effectiveness and robustness of KAC in continual learning. The code is available at https://github.com/Ethanhuhuhu/KAC.
中文: 本文提出基于KAN结构的Kolmogorov-Arnold分类器(KAC),通过采用样条函数和径向基函数增强稳定性与兼容性,在持续学习中替代线性分类器,在多项基准测试中均表现出性能提升和鲁棒性优势。
English: This paper introduces the Kolmogorov-Arnold Classifier (KAC), a novel classifier based on KAN structure that replaces linear classifiers in continual learning, demonstrating improved performance and robustness across various benchmarks through enhanced stability and compatibility with spline and radial basis functions.
Authors:Ooha Lakkadi Reddy
Abstract:
This thesis employs a hybrid CNN-Transformer architecture, alongside a detailed anthropological framework, to investigate potential historical connections between the visual morphology of the Indus Valley script and pictographic systems of the Tibetan-Yi Corridor. Through an ensemble methodology of three target scripts across 15 independently trained models, we demonstrate that Tibetan-Yi Corridor scripts exhibit approximately six-fold higher visual similarity to the Indus script (0.635) than to the Bronze Age Proto-Cuneiform (0.102) or Proto-Elamite (0.078).
Contrary to expectations, when measured through direct script-to-script embedding comparisons, the Indus script maps closer to Tibetan-Yi Corridor scripts with a mean cosine similarity of 0.930 (CI: [0.917, 0.942]) than to contemporaneous West Asian signaries, which recorded mean similarities of 0.887 (CI: [0.863, 0.911]) and 0.855 (CI: [0.818, 0.891]). Across dimensionality reduction and clustering methods, the Indus script consistently clusters closest to Tibetan-Yi Corridor scripts.
These computational findings align with observed pictorial parallels in numeral systems, gender markers, and iconographic elements. Archaeological evidence of contact networks along the ancient Shu-Shendu road, coinciding with the Indus Civilization's decline, provides a plausible transmission pathway. While alternate explanations cannot be ruled out, the specificity and consistency of similarities suggest more complex cultural transmission networks between South and East Asia than previously recognized.
中文: 本研究采用混合CNN-Transformer模型,发现印度河文字与藏彝走廊文字在视觉形态上存在显著相似性,其关联强度远超同期西亚文字,暗示了沿古代通道可能存在文化传播网络。
English: This study uses a hybrid CNN-Transformer model to reveal that the Indus Valley script shares significantly stronger visual and structural similarities with Tibetan-Yi Corridor scripts than with contemporaneous West Asian scripts, suggesting potential historical cultural transmission along ancient pathways.
Authors:Caspar Meijer, Jiyue Huang, Shreshtha Sharma, Elena Lazovik, Lydia Y. Chen
Abstract:
Federated learning (FL) for time series forecasting (TSF) enables clients with privacy-sensitive time series (TS) data to collaboratively learn accurate forecasting models, for example, in energy load prediction. Unfortunately, privacy risks in FL persist, as servers can potentially reconstruct clients' training data through gradient inversion attacks (GIA). Although GIA is demonstrated for image classification tasks, little is known about time series regression tasks. In this paper, we first conduct an extensive empirical study on inverting TS data across 4 TSF models and 4 datasets, identifying the unique challenges of reconstructing both observations and targets of TS data. We then propose TS-Inverse, a novel GIA that improves the inversion of TS data by (i) learning a gradient inversion model that outputs quantile predictions, (ii) a unique loss function that incorporates periodicity and trend regularization, and (iii) regularization according to the quantile predictions. Our evaluations demonstrate a remarkable performance of TS-Inverse, achieving at least a 2x-10x improvement in terms of the sMAPE metric over existing GIA methods on TS data. Code repository: https://github.com/Capsar/ts-inverse
Chinese: 针对时间序列预测中的联邦学习隐私风险,本研究提出TS-Inverse方法,通过引入分位数预测和正则化技术,显著提升了梯度反演攻击对时序数据的重构精度,相比现有方法性能提升2-10倍。
English: Federated learning for time series forecasting faces privacy risks from gradient inversion attacks, which this study addresses by proposing TS-Inverse, a novel method that significantly improves data reconstruction accuracy by incorporating quantile predictions and regularization techniques.
Authors:Venkat Adithya Amula, Sunayana Samavedam, Saurabh Saini, Avani Gupta, Narayanan P J
Abstract:
Deep learning models are susceptible to {\em backdoor attacks} involving malicious attackers perturbing a small subset of training data with a {\em trigger} to causes misclassifications. Various triggers have been used, including semantic triggers that are easily realizable without requiring the attacker to manipulate the image. The emergence of generative AI has eased the generation of varied poisoned samples. Robustness across types of triggers is crucial to effective defense. We propose Prototype Guided Backdoor Defense (PGBD), a robust post-hoc defense that scales across different trigger types, including previously unsolved semantic triggers. PGBD exploits displacements in the geometric spaces of activations to penalize movements toward the trigger. This is done using a novel sanitization loss of a post-hoc fine-tuning step. The geometric approach scales easily to all types of attacks. PGBD achieves better performance across all settings. We also present the first defense against a new semantic attack on celebrity face images. Project page: \hyperlink{https://venkatadithya9.github.io/pgbd.github.io/}{this https URL}.
Authors:Syed Ariff Syed Hesham, Yun Liu, Guolei Sun, Henghui Ding, Jing Yang, Ender Konukoglu, Xue Geng, Xudong Jiang
Abstract:
Video semantic segmentation (VSS) plays a vital role in understanding the temporal evolution of scenes. Traditional methods often segment videos frame-by-frame or in a short temporal window, leading to limited temporal context, redundant computations, and heavy memory requirements. To this end, we introduce a Temporal Video State Space Sharing (TV3S) architecture to leverage Mamba state space models for temporal feature sharing. Our model features a selective gating mechanism that efficiently propagates relevant information across video frames, eliminating the need for a memory-heavy feature pool. By processing spatial patches independently and incorporating shifted operation, TV3S supports highly parallel computation in both training and inference stages, which reduces the delay in sequential state space processing and improves the scalability for long video sequences. Moreover, TV3S incorporates information from prior frames during inference, achieving long-range temporal coherence and superior adaptability to extended sequences. Evaluations on the VSPW and Cityscapes datasets reveal that our approach outperforms current state-of-the-art methods, establishing a new standard for VSS with consistent results across long video sequences. By achieving a good balance between accuracy and efficiency, TV3S shows a significant advancement in spatiotemporal modeling, paving the way for efficient video analysis. The code is publicly available at https://github.com/Ashesham/TV3S.git.
Chinese: TV3S架构通过引入时序视频状态空间共享模型和选择性门控机制,有效跨帧传播信息以增强视频语义分割,在基准数据集上实现了优越性能,同时平衡了准确性与效率。
English: The TV3S architecture introduces a temporal video state space sharing model with a selective gating mechanism that enhances video semantic segmentation by efficiently propagating information across frames, achieving superior performance on benchmark datasets while balancing accuracy and efficiency.
Authors:Xiaoming Qi, Jingyang Zhang, Huazhu Fu, Guanyu Yang, Shuo Li, Yueming Jin
Abstract:
Federated continual learning (FCL) offers an emerging pattern to facilitate the applicability of federated learning (FL) in real-world scenarios, where tasks evolve dynamically and asynchronously across clients, especially in medical scenario. Existing server-side FCL methods in nature domain construct a continually learnable server model by client aggregation on all-involved tasks. However, they are challenged by: (1) Catastrophic forgetting for previously learned tasks, leading to error accumulation in server model, making it difficult to sustain comprehensive knowledge across all tasks. (2) Biased optimization due to asynchronous tasks handled across different clients, leading to the collision of optimization targets of different clients at the same time steps. In this work, we take the first step to propose a novel server-side FCL pattern in medical domain, Dynamic Allocation Hypernetwork with adaptive model recalibration (FedDAH). It is to facilitate collaborative learning under the distinct and dynamic task streams across clients. To alleviate the catastrophic forgetting, we propose a dynamic allocation hypernetwork (DAHyper) where a continually updated hypernetwork is designed to manage the mapping between task identities and their associated model parameters, enabling the dynamic allocation of the model across clients. For the biased optimization, we introduce a novel adaptive model recalibration (AMR) to incorporate the candidate changes of historical models into current server updates, and assign weights to identical tasks across different time steps based on the similarity for continual optimization. Extensive experiments on the AMOS dataset demonstrate the superiority of our FedDAH to other FCL methods on sites with different task streams. The code is available:https://github.com/jinlab-imvr/FedDAH.
中文: 联邦持续学习在动态医疗场景下面临灾难性遗忘和优化偏差的挑战,FedDAH方法通过动态分配超网络和自适应模型重校准,有效提升了跨客户端协同学习能力。
English: Federated continual learning (FCL) faces challenges of catastrophic forgetting and biased optimization in dynamic medical scenarios, which the proposed FedDAH method addresses through a dynamic allocation hypernetwork and adaptive model recalibration to enhance collaborative learning across clients.
Authors:Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, Min Lin
Abstract:
DeepSeek-R1-Zero has shown that reinforcement learning (RL) at scale can directly enhance the reasoning capabilities of LLMs without supervised fine-tuning. In this work, we critically examine R1-Zero-like training by analyzing its two core components: base models and RL. We investigate a wide range of base models, including DeepSeek-V3-Base, to understand how pretraining characteristics influence RL performance. Our analysis reveals that DeepSeek-V3-Base already exhibit ''Aha moment'', while Qwen2.5 base models demonstrate strong reasoning capabilities even without prompt templates, suggesting potential pretraining biases. Additionally, we identify an optimization bias in Group Relative Policy Optimization (GRPO), which artificially increases response length (especially for incorrect outputs) during training. To address this, we introduce Dr. GRPO, an unbiased optimization method that improves token efficiency while maintaining reasoning performance. Leveraging these insights, we present a minimalist R1-Zero recipe that achieves 43.3% accuracy on AIME 2024 with a 7B base model, establishing a new state-of-the-art. Our code is available at https://github.com/sail-sg/understand-r1-zero.
中文摘要:本研究分析了R1-Zero训练的核心组件,发现GRPO存在优化偏差并提出无偏的Dr. GRPO方法,在保持推理性能的同时提升效率,最终通过极简配方使7B模型在AIME 2024上达到43.3%的最新最优准确率。
English Summary: This study analyzes R1-Zero training components and identifies optimization bias in GRPO, introducing Dr. GRPO to improve efficiency while achieving state-of-the-art 43.3% accuracy on AIME 2024 with a minimalist 7B model recipe.
Authors:Yan-Bo Lin, Kevin Lin, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Chung-Ching Lin, Xiaofei Wang, Gedas Bertasius, Lijuan Wang
Abstract:
In this paper, we introduce zero-shot audio-video editing, a novel task that requires transforming original audio-visual content to align with a specified textual prompt without additional model training. To evaluate this task, we curate a benchmark dataset, AvED-Bench, designed explicitly for zero-shot audio-video editing. AvED-Bench includes 110 videos, each with a 10-second duration, spanning 11 categories from VGGSound. It offers diverse prompts and scenarios that require precise alignment between auditory and visual elements, enabling robust evaluation. We identify limitations in existing zero-shot audio and video editing methods, particularly in synchronization and coherence between modalities, which often result in inconsistent outcomes. To address these challenges, we propose AvED, a zero-shot cross-modal delta denoising framework that leverages audio-video interactions to achieve synchronized and coherent edits. AvED demonstrates superior results on both AvED-Bench and the recent OAVE dataset to validate its generalization capabilities. Results are available at https://genjib.github.io/project_page/AVED/index.html
Chinese: 本文提出了零样本音视频编辑任务,构建了AvED-Bench评估数据集,并开发了AvED框架来实现无需训练的跨模态同步编辑。
English: This paper introduces zero-shot audio-video editing, creates the AvED-Bench evaluation dataset, and proposes the AvED framework to achieve synchronized cross-modal edits without training.
Authors:Chenxi Wang, Jizhan Fang, Xiang Chen, Bozhong Tian, Ziwen Xu, Huajun Chen, Ningyu Zhang
Abstract:
Recent advancements in Large Multimodal Models (LMMs) have shown promise in Autonomous Driving Systems (ADS). However, their direct application to ADS is hindered by challenges such as misunderstanding of traffic knowledge, complex road conditions, and diverse states of vehicle. To address these challenges, we propose the use of Knowledge Editing, which enables targeted modifications to a model's behavior without the need for full retraining. Meanwhile, we introduce ADS-Edit, a multimodal knowledge editing dataset specifically designed for ADS, which includes various real-world scenarios, multiple data types, and comprehensive evaluation metrics. We conduct comprehensive experiments and derive several interesting conclusions. We hope that our work will contribute to the further advancement of knowledge editing applications in the field of autonomous driving. Code and data are available in https://github.com/zjunlp/EasyEdit/blob/main/examples/ADSEdit.md.
中文摘要:本研究提出知识编辑方法和ADS-Edit数据集,通过针对性修正模型行为来解决自动驾驶中交通知识误解与复杂路况问题,无需完整重新训练即可提升多模态大模型性能。
English Summary: This study introduces Knowledge Editing and the ADS-Edit dataset to enhance Large Multimodal Models for autonomous driving by addressing traffic knowledge gaps and complex road scenarios without full retraining.
Authors:Masane Fuchi, Tomohiro Takagi
Abstract:
Score-based or diffusion models generate high-quality tabular data, surpassing GAN-based and VAE-based models. However, these methods require substantial training time. In this paper, we introduce RecTable, which uses the rectified flow modeling, applied in such as text-to-image generation and text-to-video generation. RecTable features a simple architecture consisting of a few stacked gated linear unit blocks. Additionally, our training strategies are also simple, incorporating a mixed-type noise distribution and a logit-normal timestep distribution. Our experiments demonstrate that RecTable achieves competitive performance compared to the several state-of-the-art diffusion and score-based models while reducing the required training time. Our code is available at https://github.com/fmp453/rectable.
Chinese: RecTable采用简化的整流流模型,具有简单的架构和训练策略,在表格数据生成中实现了与最先进的扩散和基于分数的模型相竞争的性能,同时显著减少了训练时间。
English: RecTable introduces a simplified rectified flow model with a straightforward architecture and training strategies, achieving competitive performance in tabular data generation while significantly reducing training time compared to state-of-the-art diffusion and score-based models.
Authors:Yankai Chen, Taotao Wang, Yixiang Fang, Yunyu Xiao
Abstract:
Node importance estimation, a classical problem in network analysis, underpins various web applications. Previous methods either exploit intrinsic topological characteristics, e.g., graph centrality, or leverage additional information, e.g., data heterogeneity, for node feature enhancement. However, these methods follow the supervised learning setting, overlooking the fact that ground-truth node-importance data are usually partially labeled in practice. In this work, we propose the first semi-supervised node importance estimation framework, i.e., EASING, to improve learning quality for unlabeled data in heterogeneous graphs. Different from previous approaches, EASING explicitly captures uncertainty to reflect the confidence of model predictions. To jointly estimate the importance values and uncertainties, EASING incorporates DJE, a deep encoder-decoder neural architecture. DJE introduces distribution modeling for graph nodes, where the distribution representations derive both importance and uncertainty estimates. Additionally, DJE facilitates effective pseudo-label generation for the unlabeled data to enrich the training samples. Based on labeled and pseudo-labeled data, EASING develops effective semi-supervised heteroscedastic learning with varying node uncertainty regularization. Extensive experiments on three real-world datasets highlight the superior performance of EASING compared to competing methods. Codes are available via https://github.com/yankai-chen/EASING.
Chinese: 本文提出EASING,首个半监督节点重要性评估框架,通过深度编码器-解码器架构对异质图中未标记数据进行分布建模和伪标签生成,联合估计节点重要性及预测不确定性。
English: This paper introduces EASING, a semi-supervised framework that estimates node importance and uncertainty in heterogeneous graphs using a deep encoder-decoder architecture to handle partially labeled data through distribution modeling and pseudo-label generation.
Authors:Gongzhu Yin, Hongli Zhang, Yuchen Yang, Yi Luo
Abstract:
N-ary relational facts represent semantic correlations among more than two entities. While recent studies have developed link prediction (LP) methods to infer missing relations for knowledge graphs (KGs) containing n-ary relational facts, they are generally limited to transductive settings. Fully inductive settings, where predictions are made on previously unseen entities, remain a significant challenge. As existing methods are mainly entity embedding-based, they struggle to capture entity-independent logical rules. To fill in this gap, we propose an n-ary subgraph reasoning framework for fully inductive link prediction (ILP) on n-ary relational facts. This framework reasons over local subgraphs and has a strong inductive inference ability to capture n-ary patterns. Specifically, we introduce a novel graph structure, the n-ary semantic hypergraph, to facilitate subgraph extraction. Moreover, we develop a subgraph aggregating network, NS-HART, to effectively mine complex semantic correlations within subgraphs. Theoretically, we provide a thorough analysis from the score function optimization perspective to shed light on NS-HART's effectiveness for n-ary ILP tasks. Empirically, we conduct extensive experiments on a series of inductive benchmarks, including transfer reasoning (with and without entity features) and pairwise subgraph reasoning. The results highlight the superiority of the n-ary subgraph reasoning framework and the exceptional inductive ability of NS-HART. The source code of this paper has been made publicly available at https://github.com/yin-gz/Nary-Inductive-SubGraph.
中文: 本文提出了一种用于n元关系事实全归纳链接预测的n元子图推理框架,通过创新的n元语义超图和NS-HART网络有效挖掘复杂语义关联,展现出卓越的归纳推理能力。
English: This paper introduces an n-ary subgraph reasoning framework for fully inductive link prediction on n-ary relational facts, which utilizes a novel n-ary semantic hypergraph and the NS-HART network to effectively capture complex semantic patterns and demonstrate strong inductive capabilities.
Authors:Trung Duc Ha, Sidney Bender
Abstract:
Counterfactual explanations have been successfully applied to create human interpretable explanations for various black-box models. They are handy for tasks in the image domain, where the quality of the explanations benefits from recent advances in generative models. Although counterfactual explanations have been widely applied to classification models, their application to regression tasks remains underexplored. We present two methods to create counterfactual explanations for image regression tasks using diffusion-based generative models to address challenges in sparsity and quality: 1) one based on a Denoising Diffusion Probabilistic Model that operates directly in pixel-space and 2) another based on a Diffusion Autoencoder operating in latent space. Both produce realistic, semantic, and smooth counterfactuals on CelebA-HQ and a synthetic data set, providing easily interpretable insights into the decision-making process of the regression model and reveal spurious correlations. We find that for regression counterfactuals, changes in features depend on the region of the predicted value. Large semantic changes are needed for significant changes in predicted values, making it harder to find sparse counterfactuals than with classifiers. Moreover, pixel space counterfactuals are more sparse while latent space counterfactuals are of higher quality and allow bigger semantic changes.
中文摘要:针对图像回归任务,本文采用基于扩散的像素空间和潜在空间模型生成反事实解释,虽能提供对模型决策的可解释性洞察并揭示伪相关性,但在稀疏性和质量方面相比分类任务更具挑战性。
English Summary: Counterfactual explanations for image regression tasks are generated using diffusion-based models in pixel and latent spaces, producing realistic insights into model decisions while revealing challenges in achieving sparsity and quality compared to classification tasks.
Authors:Carlos Gomes, Benedikt Blumenstiel, Joao Lucas de Sousa Almeida, Pedro Henrique de Oliveira, Paolo Fraccaro, Francesc Marti Escofet, Daniela Szwarcman, Naomi Simumba, Romeo Kienzler, Bianca Zadrozny
Abstract:
TerraTorch is a fine-tuning and benchmarking toolkit for Geospatial Foundation Models built on PyTorch Lightning and tailored for satellite, weather, and climate data. It integrates domain-specific data modules, pre-defined tasks, and a modular model factory that pairs any backbone with diverse decoder heads. These components allow researchers and practitioners to fine-tune supported models in a no-code fashion by simply editing a training configuration. By consolidating best practices for model development and incorporating the automated hyperparameter optimization extension Iterate, TerraTorch reduces the expertise and time required to fine-tune or benchmark models on new Earth Observation use cases. Furthermore, TerraTorch directly integrates with GEO-Bench, allowing for systematic and reproducible benchmarking of Geospatial Foundation Models. TerraTorch is open sourced under Apache 2.0, available at https://github.com/IBM/terratorch, and can be installed via pip install terratorch.
中文: TerraTorch 是一个基于 PyTorch Lightning 的工具包,专为卫星、气象和气候数据设计,用于微调和基准测试地理空间基础模型,支持无代码定制和高效模型优化。
English: TerraTorch is a PyTorch Lightning-based toolkit designed for fine-tuning and benchmarking geospatial foundation models using satellite, weather, and climate data, enabling no-code customization and efficient model optimization.
Authors:Henrik Christiansen, Takashi Maruyama, Federico Errica, Viktor Zaverkin, Makoto Takamoto, Francesco Alesiani
Abstract:
We present an end-to-end differentiable molecular simulation framework (DIMOS) for molecular dynamics and Monte Carlo simulations. DIMOS easily integrates machine-learning-based interatomic potentials and implements classical force fields including particle-mesh Ewald electrostatics. Thanks to its modularity, both classical and machine-learning-based approaches can be easily combined into a hybrid description of the system (ML/MM). By supporting key molecular dynamics features such as efficient neighborlists and constraint algorithms for larger time steps, the framework bridges the gap between hand-optimized simulation engines and the flexibility of a PyTorch implementation. The superior performance and the high versatility is probed in different benchmarks and applications, with speed-up factors of up to $170\times$. The advantage of differentiability is demonstrated by an end-to-end optimization of the proposal distribution in a Markov Chain Monte Carlo simulation based on Hamiltonian Monte Carlo. Using these optimized simulation parameters a $3\times$ acceleration is observed in comparison to ad-hoc chosen simulation parameters. The code is available at https://github.com/nec-research/DIMOS.
中文: DIMOS是一个端到端可微分分子模拟框架,它融合了机器学习势与经典力场,实现了高达170倍的加速,并通过优化蒙特卡洛模拟参数展示了3倍的效率提升。
English: DIMOS is an end-to-end differentiable molecular simulation framework that integrates machine-learning potentials with classical force fields, achieving up to 170x speed-up and demonstrating a 3x acceleration through optimized parameters in Monte Carlo simulations.
Authors:Jinghui Yuan, Fangyuan Xie, Feiping Nie, Xuelong Li
Abstract:
The indicator matrix plays an important role in machine learning, but optimizing it is an NP-hard problem. We propose a new relaxation of the indicator matrix and prove that this relaxation forms a manifold, which we call the Relaxed Indicator Matrix Manifold (RIM manifold). Based on Riemannian geometry, we develop a Riemannian toolbox for optimization on the RIM manifold. Specifically, we provide several methods of Retraction, including a fast Retraction method to obtain geodesics. We point out that the RIM manifold is a generalization of the double stochastic manifold, and it is much faster than existing methods on the double stochastic manifold, which has a complexity of \( \mathcal{O}(n^3) \), while RIM manifold optimization is \( \mathcal{O}(n) \) and often yields better results. We conducted extensive experiments, including image denoising, with millions of variables to support our conclusion, and applied the RIM manifold to Ratio Cut, we provide a rigorous convergence proof and achieve clustering results that outperform the state-of-the-art methods. Our Code in \href{https://github.com/Yuan-Jinghui/Riemannian-Optimization-on-Relaxed-Indicator-Matrix-Manifold}{here}.
中文摘要:作者提出了一种松弛指示器矩阵流形(RIM流形),实现了线性复杂度的黎曼优化,在聚类和图像去噪应用中超越了现有方法的性能。
English Summary: The authors introduce a relaxed indicator matrix manifold (RIM manifold) that enables efficient Riemannian optimization with linear complexity, outperforming existing methods in clustering and image denoising applications.
Authors:Jiale Cheng, Ruiliang Lyu, Xiaotao Gu, Xiao Liu, Jiazheng Xu, Yida Lu, Jiayan Teng, Zhuoyi Yang, Yuxiao Dong, Jie Tang, Hongning Wang, Minlie Huang
Abstract:
Video generation models have achieved remarkable progress in text-to-video tasks. These models are typically trained on text-video pairs with highly detailed and carefully crafted descriptions, while real-world user inputs during inference are often concise, vague, or poorly structured. This gap makes prompt optimization crucial for generating high-quality videos. Current methods often rely on large language models (LLMs) to refine prompts through in-context learning, but suffer from several limitations: they may distort user intent, omit critical details, or introduce safety risks. Moreover, they optimize prompts without considering the impact on the final video quality, which can lead to suboptimal results. To address these issues, we introduce VPO, a principled framework that optimizes prompts based on three core principles: harmlessness, accuracy, and helpfulness. The generated prompts faithfully preserve user intents and, more importantly, enhance the safety and quality of generated videos. To achieve this, VPO employs a two-stage optimization approach. First, we construct and refine a supervised fine-tuning (SFT) dataset based on principles of safety and alignment. Second, we introduce both text-level and video-level feedback to further optimize the SFT model with preference learning. Our extensive experiments demonstrate that VPO significantly improves safety, alignment, and video quality compared to baseline methods. Moreover, VPO shows strong generalization across video generation models. Furthermore, we demonstrate that VPO could outperform and be combined with RLHF methods on video generation models, underscoring the effectiveness of VPO in aligning video generation models. Our code and data are publicly available at https://github.com/thu-coai/VPO.
中文: VPO是一个基于无害性、准确性和有益性三原则的提示优化框架,能忠实保留用户意图并显著提升生成视频的安全性和质量。
English: VPO is a principled framework that optimizes text prompts for video generation by ensuring harmlessness, accuracy, and helpfulness, significantly improving safety, alignment, and video quality across models.
Authors:Haoran Zheng, Renchi Yang, Jianliang Xu
Abstract:
Given a graph $G$ and a seed node $v_s$, the objective of local graph clustering (LGC) is to identify a subgraph $C_s \in G$ (a.k.a. local cluster) surrounding $v_s$ in time roughly linear with the size of $C_s$. This approach yields personalized clusters without needing to access the entire graph, which makes it highly suitable for numerous applications involving large graphs. However, most existing solutions merely rely on the topological connectivity between nodes in $G$, rendering them vulnerable to missing or noisy links that are commonly present in real-world graphs.
To address this issue, this paper resorts to leveraging the complementary nature of graph topology and node attributes to enhance local clustering quality. To effectively exploit the attribute information, we first formulate the LGC as an estimation of the bidirectional diffusion distribution (BDD), which is specialized for capturing the multi-hop affinity between nodes in the presence of attributes. Furthermore, we propose LACA, an efficient and effective approach for LGC that achieves superb empirical performance on multiple real datasets while maintaining strong locality. The core components of LACA include (i) a fast and theoretically-grounded preprocessing technique for node attributes, (ii) an adaptive algorithm for diffusing any vectors over $G$ with rigorous theoretical guarantees and expedited convergence, and (iii) an effective three-step scheme for BDD approximation. Extensive experiments, comparing 17 competitors on 8 real datasets, show that LACA outperforms all competitors in terms of result quality measured against ground truth local clusters, while also being up to orders of magnitude faster. The code is available at https://github.com/HaoranZ99/alac.
中文: 本文提出LACA方法,通过结合图拓扑与节点属性来提升局部图聚类质量,在真实数据集上展现出卓越的聚类效果和运算效率。
English: This paper introduces LACA, a local graph clustering method that integrates both graph topology and node attributes to improve clustering quality and efficiency, demonstrating superior performance and speed on real datasets.
Authors:Yingdong Shi, Changming Li, Yifan Wang, Yongxiang Zhao, Anqi Pang, Sibei Yang, Jingyi Yu, Kan Ren
Abstract:
Diffusion models have demonstrated impressive capabilities in synthesizing diverse content. However, despite their high-quality outputs, these models often perpetuate social biases, including those related to gender and race. These biases can potentially contribute to harmful real-world consequences, reinforcing stereotypes and exacerbating inequalities in various social contexts. While existing research on diffusion bias mitigation has predominantly focused on guiding content generation, it often neglects the intrinsic mechanisms within diffusion models that causally drive biased outputs. In this paper, we investigate the internal processes of diffusion models, identifying specific decision-making mechanisms, termed bias features, embedded within the model architecture. By directly manipulating these features, our method precisely isolates and adjusts the elements responsible for bias generation, permitting granular control over the bias levels in the generated content. Through experiments on both unconditional and conditional diffusion models across various social bias attributes, we demonstrate our method's efficacy in managing generation distribution while preserving image quality. We also dissect the discovered model mechanism, revealing different intrinsic features controlling fine-grained aspects of generation, boosting further research on mechanistic interpretability of diffusion models.
Authors:Chenwei Zhang, Khanh Dao Duc
Abstract:
Enhancing cryogenic electron microscopy (cryo-EM) 3D density maps at intermediate resolution (4-8 Ã
) is crucial in protein structure determination. Recent advances in deep learning have led to the development of automated approaches for enhancing experimental cryo-EM density maps. Yet, these methods are not optimized for intermediate-resolution maps and rely on map density features alone. To address this, we propose CryoSAMU, a novel method designed to enhance 3D cryo-EM density maps of protein structures using structure-aware multimodal U-Nets and trained on curated intermediate-resolution density maps. We comprehensively evaluate CryoSAMU across various metrics and demonstrate its competitive performance compared to state-of-the-art methods. Notably, CryoSAMU achieves significantly faster processing speed, showing promise for future practical applications. Our code is available at https://github.com/chenwei-zhang/CryoSAMU.
中文: CryoSAMU是一种新型深度学习方法,通过结构感知多模态U-Net增强中分辨率冷冻电镜密度图,相比现有方法具有更优性能和显著更快的处理速度。
English: CryoSAMU is a novel deep learning method that enhances intermediate-resolution cryo-EM density maps using structure-aware multimodal U-Nets, achieving competitive performance and significantly faster processing speeds compared to existing approaches.
Authors:Yunrui Zhang, Gustavo Batista, Salil S. Kanhere
Abstract:
Time series classification is usually regarded as a distinct task from tabular data classification due to the importance of temporal information. However, in this paper, by performing permutation tests that disrupt temporal information on the UCR time series classification archive, the most widely used benchmark for time series classification, we identify a significant proportion of datasets where temporal information has little to no impact on classification. Many of these datasets are tabular in nature or rely mainly on tabular features, leading to potentially biased evaluations of time series classifiers focused on temporal information. To address this, we propose UCR Augmented, a benchmark based on the UCR time series classification archive designed to evaluate classifiers' ability to extract and utilize temporal information. Testing classifiers from seven categories on this benchmark revealed notable shifts in performance rankings. Some previously overlooked approaches perform well, while others see their performance decline significantly when temporal information is crucial. UCR Augmented provides a more robust framework for assessing time series classifiers, ensuring fairer evaluations. Our code is available at https://github.com/YunruiZhang/Revisit-Time-Series-Classification-Benchmark.
中文: 本研究揭示UCR时间序列数据集中许多数据集无需依赖时序信息,提出UCR Augmented基准以公平评估分类器的时序特征提取能力,并观察到性能排名的显著变化。
English: This study reveals that temporal information is non-essential for many UCR time series datasets, proposing UCR Augmented to fairly evaluate classifiers' temporal feature utilization and observing significant performance ranking changes.
Authors:Han Chen, Zicong Jiang, Zining Zhang, Bingsheng He, Pingyi Luo, Mian Lu, Yuqiang Chen
Abstract:
We introduce LogQuant, a groundbreaking 2-bit quantization technique for KV Cache in large language model (LLM) inference, delivering substantial memory savings while preserving superior performance. Previous methods either assume that later tokens are more important or attempt to predict important tokens based on earlier attention patterns. Both approaches, however, can result in performance bottlenecks or frequent mispredictions.
LogQuant takes a different approach. By applying a log-based filtering mechanism, it selectively compresses the KV Cache across the entire context, achieving better performance with the same or even reduced memory footprint compared to existing methods. In benchmark tests, it enhances throughput by 25% and boosts batch size by 60% without increasing memory consumption. For challenging tasks such as Math and Code Completion, LogQuant improves accuracy by 40% to 200% at the same compression ratio, outperforming comparable techniques.LogQuant integrates effortlessly with popular inference frameworks like Python's transformers library. Implementation can be available in https://github.com/Concyclics/LogQuantKV.
中文: LogQuant是一种创新的2位KV缓存量化技术,可在保持高性能的同时大幅降低大语言模型推理的内存占用,在同等压缩率下将复杂任务的准确率提升40%至200%,并提高吞吐量25%。
English: LogQuant is an innovative 2-bit KV Cache quantization method that significantly reduces memory usage in LLM inference while maintaining high performance, achieving up to 25% throughput improvement and 40-200% accuracy gains on complex tasks.
Authors:Mengqi Lou, Kabir Aladin Verchand, Sara Fridovich-Keil, Ashwin Pananjady
Abstract:
We consider the problem of signal reconstruction for computed tomography (CT) under a nonlinear forward model that accounts for exponential signal attenuation, a polychromatic X-ray source, general measurement noise (e.g. Poisson shot noise), and observations acquired over multiple wavelength windows. We develop a simple iterative algorithm for single-material reconstruction, which we call EXACT (EXtragradient Algorithm for Computed Tomography), based on formulating our estimate as the fixed point of a monotone variational inequality. We prove guarantees on the statistical and computational performance of EXACT under practical assumptions on the measurement process. We also consider a recently introduced variant of this model with Gaussian measurements, and present sample and iteration complexity bounds for EXACT that improve upon those of existing algorithms. We apply our EXACT algorithm to a CT phantom image recovery task and show that it often requires fewer X-ray projection exposures, lower source intensity, and less computation time to achieve similar reconstruction quality to existing methods.
中文摘要:EXACT算法针对非线性前向模型的计算机断层扫描重建问题,相比现有方法能以更少的X射线投影曝光、更低辐射剂量和更短计算时间实现相近重建质量。
English Summary: The EXACT algorithm is developed for computed tomography reconstruction under a nonlinear forward model, offering improved statistical and computational performance with fewer exposures and less computation time compared to existing methods.
Authors:Xiang Xu, Lingdong Kong, Hui Shuai, Wenwei Zhang, Liang Pan, Kai Chen, Ziwei Liu, Qingshan Liu
Abstract:
LiDAR representation learning has emerged as a promising approach to reducing reliance on costly and labor-intensive human annotations. While existing methods primarily focus on spatial alignment between LiDAR and camera sensors, they often overlook the temporal dynamics critical for capturing motion and scene continuity in driving scenarios. To address this limitation, we propose SuperFlow++, a novel framework that integrates spatiotemporal cues in both pretraining and downstream tasks using consecutive LiDAR-camera pairs. SuperFlow++ introduces four key components: (1) a view consistency alignment module to unify semantic information across camera views, (2) a dense-to-sparse consistency regularization mechanism to enhance feature robustness across varying point cloud densities, (3) a flow-based contrastive learning approach that models temporal relationships for improved scene understanding, and (4) a temporal voting strategy that propagates semantic information across LiDAR scans to improve prediction consistency. Extensive evaluations on 11 heterogeneous LiDAR datasets demonstrate that SuperFlow++ outperforms state-of-the-art methods across diverse tasks and driving conditions. Furthermore, by scaling both 2D and 3D backbones during pretraining, we uncover emergent properties that provide deeper insights into developing scalable 3D foundation models. With strong generalizability and computational efficiency, SuperFlow++ establishes a new benchmark for data-efficient LiDAR-based perception in autonomous driving. The code is publicly available at https://github.com/Xiangxu-0103/SuperFlow
中文摘要:SuperFlow++ 是一种新颖的激光雷达表示学习框架,通过整合连续激光雷达-摄像头对的时空线索,在11个数据集上超越现有最优方法,为自动驾驶感知建立了新基准。
English Summary: SuperFlow++ is a novel LiDAR representation learning framework that integrates spatiotemporal cues using consecutive LiDAR-camera pairs, outperforming state-of-the-art methods across 11 datasets while establishing new benchmarks for autonomous driving perception.
Authors:Aaron Serianni, Tyler Zhu, Olga Russakovsky, Vikram V. Ramaswamy
Abstract:
Computer vision models have been shown to exhibit and amplify biases across a wide array of datasets and tasks. Existing methods for quantifying bias in classification models primarily focus on dataset distribution and model performance on subgroups, overlooking the internal workings of a model. We introduce the Attention-IoU (Attention Intersection over Union) metric and related scores, which use attention maps to reveal biases within a model's internal representations and identify image features potentially causing the biases. First, we validate Attention-IoU on the synthetic Waterbirds dataset, showing that the metric accurately measures model bias. We then analyze the CelebA dataset, finding that Attention-IoU uncovers correlations beyond accuracy disparities. Through an investigation of individual attributes through the protected attribute of Male, we examine the distinct ways biases are represented in CelebA. Lastly, by subsampling the training set to change attribute correlations, we demonstrate that Attention-IoU reveals potential confounding variables not present in dataset labels.
Chinese: Attention-IoU指标通过注意力图揭示模型内部表征中的偏见并识别导致偏见的图像特征,在Waterbirds和CelebA数据集上的验证表明,它能发现超出准确率差异的相关性和数据标签中未体现的潜在混杂变量。
English: The Attention-IoU metric uses attention maps to uncover biases in a model's internal representations and identify image features causing them, validated on datasets like Waterbirds and CelebA to reveal correlations and confounding variables beyond accuracy disparities.
Authors:Matthew Greenig, Haowen Zhao, Vladimir Radenkovic, Aubin Ramon, Pietro Sormanni
Abstract:
Designing antibody sequences to better resemble those observed in natural human repertoires is a key challenge in biologics development. We introduce IgCraft: a multi-purpose model for paired human antibody sequence generation, built on Bayesian Flow Networks. IgCraft presents one of the first unified generative modeling frameworks capable of addressing multiple antibody sequence design tasks with a single model, including unconditional sampling, sequence inpainting, inverse folding, and CDR motif scaffolding. Our approach achieves competitive results across the full spectrum of these tasks while constraining generation to the space of human antibody sequences, exhibiting particular strengths in CDR motif scaffolding (grafting) where we achieve state-of-the-art performance in terms of humanness and preservation of structural properties. By integrating previously separate tasks into a single scalable generative model, IgCraft provides a versatile platform for sampling human antibody sequences under a variety of contexts relevant to antibody discovery and engineering. Model code and weights are publicly available at https://github.com/mgreenig/IgCraft.
中文: IgCraft 提出了一种基于贝叶斯流网络的多功能生成模型,能够统一执行多种抗体序列设计任务,尤其在CDR基序支架方面表现出色,确保了序列的人源化和结构特性的保留。
English: IgCraft introduces a unified generative model using Bayesian Flow Networks to design human-like antibody sequences, excelling in tasks like CDR motif scaffolding while ensuring structural integrity and naturalness.
Authors:Zhuoming Liu, Yiquan Li, Khoi Duc Nguyen, Yiwu Zhong, Yin Li
Abstract:
Pre-trained video large language models (Video LLMs) exhibit remarkable reasoning capabilities, yet adapting these models to new tasks involving additional modalities or data types (e.g., audio or 3D information) remains challenging. In this paper, we present PAVE, a flexible framework for adapting pre-trained Video LLMs to downstream tasks with side-channel signals, such as audio, 3D cues, or multi-view videos. PAVE introduces lightweight adapters, referred to as "patches," which add a small number of parameters and operations to a base model without modifying its architecture or pre-trained weights. In doing so, PAVE can effectively adapt the pre-trained base model to support diverse downstream tasks, including audio-visual question answering, 3D reasoning, multi-view video recognition, and high frame rate video understanding. Across these tasks, PAVE significantly enhances the performance of the base model, surpassing state-of-the-art task-specific models while incurring a minor cost of ~0.1% additional FLOPs and parameters. Further, PAVE supports multi-task learning and generalizes well across different Video LLMs. Our code is available at https://github.com/dragonlzm/PAVE.
Chinese: PAVE是一种灵活的框架,通过为预训练的视频大语言模型添加轻量级适配器,使其能够处理包含音频和3D线索等辅助信号的任务,在显著提升性能的同时仅增加极少的计算开销。
English: PAVE is a flexible framework that enhances pre-trained Video LLMs by adding lightweight adapters to handle tasks with side-channel signals like audio and 3D cues, significantly improving performance with minimal computational overhead.
Authors:Vladan StojniÄ, Yannis Kalantidis, JiÅÃ Matas, Giorgos Tolias
Abstract:
We propose a training-free method for open-vocabulary semantic segmentation using Vision-and-Language Models (VLMs). Our approach enhances the initial per-patch predictions of VLMs through label propagation, which jointly optimizes predictions by incorporating patch-to-patch relationships. Since VLMs are primarily optimized for cross-modal alignment and not for intra-modal similarity, we use a Vision Model (VM) that is observed to better capture these relationships. We address resolution limitations inherent to patch-based encoders by applying label propagation at the pixel level as a refinement step, significantly improving segmentation accuracy near class boundaries. Our method, called LPOSS+, performs inference over the entire image, avoiding window-based processing and thereby capturing contextual interactions across the full image. LPOSS+ achieves state-of-the-art performance among training-free methods, across a diverse set of datasets. Code: https://github.com/vladan-stojnic/LPOSS
Chinese: LPOSS+ 提出了一种无需训练的开集词汇语义分割方法,通过像素级标签传播优化视觉语言模型的预测结果,并利用视觉模型捕捉模态内关系,在不依赖窗口处理的情况下实现了最先进的性能。
English: LPOSS+ introduces a training-free method for open-vocabulary semantic segmentation by refining Vision-and-Language Model predictions through pixel-level label propagation and leveraging Vision Models to capture intra-modal relationships, achieving state-of-the-art accuracy without window-based processing.
Authors:Jan Kohút, Martin DoÄekal, Michal HradiÅ¡, Marek VaÅ¡ko
Abstract:
Manual digitization of bibliographic metadata is time consuming and labor intensive, especially for historical and real-world archives with highly variable formatting across documents. Despite advances in machine learning, the absence of dedicated datasets for metadata extraction hinders automation. To address this gap, we introduce BiblioPage, a dataset of scanned title pages annotated with structured bibliographic metadata. The dataset consists of approximately 2,000 monograph title pages collected from 14 Czech libraries, spanning a wide range of publication periods, typographic styles, and layout structures. Each title page is annotated with 16 bibliographic attributes, including title, contributors, and publication metadata, along with precise positional information in the form of bounding boxes. To extract structured information from this dataset, we valuated object detection models such as YOLO and DETR combined with transformer-based OCR, achieving a maximum mAP of 52 and an F1 score of 59. Additionally, we assess the performance of various visual large language models, including LlamA 3.2-Vision and GPT-4o, with the best model reaching an F1 score of 67. BiblioPage serves as a real-world benchmark for bibliographic metadata extraction, contributing to document understanding, document question answering, and document information extraction. Dataset and evaluation scripts are availible at: https://github.com/DCGM/biblio-dataset
Chinese: BiblioPage数据集通过提供2000份带注释的专著扉页,解决了书目元数据提取领域专用资源匮乏的问题,评估显示基于Transformer的OCR和视觉大模型最高F1值达67%,为文档理解任务提供了基准。
English: The BiblioPage dataset addresses the lack of dedicated resources for bibliographic metadata extraction by providing 2,000 annotated monograph title pages, with evaluations showing transformer-based OCR and visual language models achieving up to 67 F1 score, serving as a benchmark for document understanding tasks.
Authors:Mohammad Daffa Robani, Paul Saves, Pramudita Satria Palar, Lavi Rizki Zuhal, oseph Morlier
Abstract:
Surrogate models are of high interest for many engineering applications, serving as cheap-to-evaluate time-efficient approximations of black-box functions to help engineers and practitioners make decisions and understand complex systems. As such, the need for explainability methods is rising and many studies have been performed to facilitate knowledge discovery from surrogate models. To respond to these enquiries, this paper introduces SMT-EX, an enhancement of the open-source Python Surrogate Modeling Toolbox (SMT) that integrates explainability techniques into a state-of-the-art surrogate modelling framework. More precisely, SMT-EX includes three key explainability methods: Shapley Additive Explanations, Partial Dependence Plot, and Individual Conditional Expectations. A peculiar explainability dependency of SMT has been developed for such purpose that can be easily activated once the surrogate model is built, offering a user-friendly and efficient tool for swift insight extraction. The effectiveness of SMT-EX is showcased through two test cases. The first case is a 10-variable wing weight problem with purely continuous variables and the second one is a 3-variable mixed-categorical cantilever beam bending problem. Relying on SMT-EX analyses for these problems, we demonstrate its versatility in addressing a diverse range of problem characteristics. SMT-Explainability is freely available on Github: https://github.com/SMTorg/smt-explainability .
中文: 本文介绍了SMT-EX,作为Python代理建模工具箱的增强版本,它集成了三种关键可解释性方法,通过用户友好的分析工具帮助工程师理解复杂系统。
English: This paper introduces SMT-EX, an enhanced version of the Python Surrogate Modeling Toolbox that integrates three key explainability methods to help engineers understand complex systems through user-friendly analysis tools.
Authors:Akshay Kulkarni, Ge Yan, Chung-En Sun, Tuomas Oikarinen, Tsui-Wei Weng
Abstract:
Concept bottleneck models (CBM) aim to produce inherently interpretable models that rely on human-understandable concepts for their predictions. However, existing approaches to design interpretable generative models based on CBMs are not yet efficient and scalable, as they require expensive generative model training from scratch as well as real images with labor-intensive concept supervision. To address these challenges, we present two novel and low-cost methods to build interpretable generative models through post-hoc techniques and we name our approaches: concept-bottleneck autoencoder (CB-AE) and concept controller (CC). Our proposed approaches enable efficient and scalable training without the need of real data and require only minimal to no concept supervision. Additionally, our methods generalize across modern generative model families including generative adversarial networks and diffusion models. We demonstrate the superior interpretability and steerability of our methods on numerous standard datasets like CelebA, CelebA-HQ, and CUB with large improvements (average ~25%) over the prior work, while being 4-15x faster to train. Finally, a large-scale user study is performed to validate the interpretability and steerability of our methods.
Authors:Yuxuan Hu, Xiaodong Chen, Cuiping Li, Hong Chen, Jing Zhang
Abstract:
Large Language Models (LLMs) excel in diverse applications but suffer inefficiency due to massive scale. While quantization reduces computational costs, existing methods degrade accuracy in medium-sized LLMs (e.g., Llama-3-8B) due to activation outliers. To address this, we propose QUAD (Quantization with Activation Decomposition), a framework leveraging Singular Value Decomposition (SVD) to suppress activation outliers for effective 4-bit quantization. QUAD estimates activation singular vectors offline using calibration data to construct an orthogonal transformation matrix P, shifting outliers to additional dimensions in full precision while quantizing rest components to 4-bit. Additionally, QUAD enables parameter-efficient fine-tuning via adaptable full-precision outlier weights, narrowing the accuracy gap between quantized and full-precision models. Experiments demonstrate that QUAD achieves 94% ~ 96% accuracy under W4A4 quantization and 98% accuracy with W4A4/A8 and parameter-efficient fine-tuning for Llama-3 and Qwen-2.5 models. Our code is available at \href{https://github.com/hyx1999/Quad}{repository}.
中文摘要:QUAD框架利用奇异值分解抑制激活异常值,通过参数高效微调实现中型大语言模型的高效4位量化,同时保持高精度。
English Summary: The QUAD framework uses Singular Value Decomposition to suppress activation outliers, enabling efficient 4-bit quantization of medium-sized LLMs while maintaining high accuracy through parameter-efficient fine-tuning.
Authors:Songyi Gao, Zuolin Tu, Rong-Jun Qin, Yi-Hao Sun, Xiong-Hui Chen, Yang Yu
Abstract:
Offline reinforcement learning (RL) aims to learn from historical data without requiring (costly) access to the environment. To facilitate offline RL research, we previously introduced NeoRL, which highlighted that datasets from real-world tasks are often conservative and limited. With years of experience applying offline RL to various domains, we have identified additional real-world challenges. These include extremely conservative data distributions produced by deployed control systems, delayed action effects caused by high-latency transitions, external factors arising from the uncontrollable variance of transitions, and global safety constraints that are difficult to evaluate during the decision-making process. These challenges are underrepresented in previous benchmarks but frequently occur in real-world tasks. To address this, we constructed the extended Near Real-World Offline RL Benchmark (NeoRL-2), which consists of 7 datasets from 7 simulated tasks along with their corresponding evaluation simulators. Benchmarking results from state-of-the-art offline RL approaches demonstrate that current methods often struggle to outperform the data-collection behavior policy, highlighting the need for more effective methods. We hope NeoRL-2 will accelerate the development of reinforcement learning algorithms for real-world applications. The benchmark project page is available at https://github.com/polixir/NeoRL2.
中文摘要:NeoRL-2作为扩展基准,针对现实世界中保守数据分布和延迟动作效应等挑战而构建,现有方法常难以超越原始行为策略,旨在推动实用强化学习算法的发展。
English Summary: NeoRL-2 is an extended benchmark addressing real-world offline RL challenges like conservative data distributions and delayed action effects, where current methods often fail to surpass the original behavior policies, aiming to advance practical algorithm development.
Authors:Zhen Zhang, Ignavier Ng, Dong Gong, Yuhang Liu, Mingming Gong, Biwei Huang, Kun Zhang, Anton van den Hengel, Javen Qinfeng Shi
Abstract:
Recovering the underlying Directed Acyclic Graph (DAG) structures from observational data presents a formidable challenge, partly due to the combinatorial nature of the DAG-constrained optimization problem. Recently, researchers have identified gradient vanishing as one of the primary obstacles in differentiable DAG learning and have proposed several DAG constraints to mitigate this issue. By developing the necessary theory to establish a connection between analytic functions and DAG constraints, we demonstrate that analytic functions from the set $\{f(x) = c_0 + \sum_{i=1}^{\infty}c_ix^i | \forall i > 0, c_i > 0; r = \lim_{i\rightarrow \infty}c_{i}/c_{i+1} > 0\}$ can be employed to formulate effective DAG constraints. Furthermore, we establish that this set of functions is closed under several functional operators, including differentiation, summation, and multiplication. Consequently, these operators can be leveraged to create novel DAG constraints based on existing ones. Using these properties, we design a series of DAG constraints and develop an efficient algorithm to evaluate them. Experiments in various settings demonstrate that our DAG constraints outperform previous state-of-the-art comparators. Our implementation is available at https://github.com/zzhang1987/AnalyticDAGLearning.
中文: 本研究提出了一类新的解析函数来构建有效的有向无环图约束,以解决可微分DAG学习中的梯度消失问题,并通过实验验证和高效评估算法证明了其优于现有方法的性能。
English: This study introduces a novel class of analytic functions to formulate effective DAG constraints that address gradient vanishing in differentiable DAG learning, demonstrating superior performance over existing methods through experimental validation and an efficient evaluation algorithm.
Authors:Jianren Wang, Yifan Su, Abhinav Gupta, Deepak Pathak
Abstract:
On-policy reinforcement learning (RL) algorithms are widely used for their strong asymptotic performance and training stability, but they struggle to scale with larger batch sizes, as additional parallel environments yield redundant data due to limited policy-induced diversity. In contrast, Evolutionary Algorithms (EAs) scale naturally and encourage exploration via randomized population-based search, but are often sample-inefficient. We propose Evolutionary Policy Optimization (EPO), a hybrid algorithm that combines the scalability and diversity of EAs with the performance and stability of policy gradients. EPO maintains a population of agents conditioned on latent variables, shares actor-critic network parameters for coherence and memory efficiency, and aggregates diverse experiences into a master agent. Across tasks in dexterous manipulation, legged locomotion, and classic control, EPO outperforms state-of-the-art baselines in sample efficiency, asymptotic performance, and scalability.
Authors:Haoliang Shang, Hanyu Wu, Guangyao Zhai, Boyang Sun, Fangjinhua Wang, Federico Tombari, Marc Pollefeys
Abstract:
Scene graphs capture complex relationships among objects, serving as strong priors for content generation and manipulation. Yet, reasonably manipulating scene graphs -- whether by adding nodes or modifying edges -- remains a challenging and untouched task. Tasks such as adding a node to the graph or reasoning about a node's relationships with all others are computationally intractable, as even a single edge modification can trigger conflicts due to the intricate interdependencies within the graph. To address these challenges, we introduce SG-Tailor, an autoregressive model that predicts the conflict-free relationship between any two nodes. SG-Tailor not only infers inter-object relationships, including generating commonsense edges for newly added nodes but also resolves conflicts arising from edge modifications to produce coherent, manipulated graphs for downstream tasks. For node addition, the model queries the target node and other nodes from the graph to predict the appropriate relationships. For edge modification, SG-Tailor employs a Cut-And-Stitch strategy to solve the conflicts and globally adjust the graph. Extensive experiments demonstrate that SG-Tailor outperforms competing methods by a large margin and can be seamlessly integrated as a plug-in module for scene generation and robotic manipulation tasks.
中文: SG-Tailor是一种自回归模型,能预测场景图中节点间无冲突的关系,通过节点添加和边修改实现连贯的图操作,并在性能上大幅超越现有方法。
English: SG-Tailor is an autoregressive model that predicts conflict-free relationships between nodes in scene graphs, enabling coherent graph manipulation through node addition and edge modification while outperforming other methods significantly.
Authors:Shenyuan Gao, Siyuan Zhou, Yilun Du, Jun Zhang, Chuang Gan
Abstract:
World models aim to learn action-controlled future prediction and have proven essential for the development of intelligent agents. However, most existing world models rely heavily on substantial action-labeled data and costly training, making it challenging to adapt to novel environments with heterogeneous actions through limited interactions. This limitation can hinder their applicability across broader domains. To overcome this limitation, we propose AdaWorld, an innovative world model learning approach that enables efficient adaptation. The key idea is to incorporate action information during the pretraining of world models. This is achieved by extracting latent actions from videos in a self-supervised manner, capturing the most critical transitions between frames. We then develop an autoregressive world model that conditions on these latent actions. This learning paradigm enables highly adaptable world models, facilitating efficient transfer and learning of new actions even with limited interactions and finetuning. Our comprehensive experiments across multiple environments demonstrate that AdaWorld achieves superior performance in both simulation quality and visual planning.
中文摘要:AdaWorld是一种创新的世界模型,通过自监督方式从视频中提取潜在动作,能够在有限数据和微调下高效适应新环境。
English Summary: AdaWorld is an innovative world model that learns latent actions from videos in a self-supervised way, enabling efficient adaptation to new environments with limited data and fine-tuning.
Authors:Moussa Kassem Sbeyti, Nadja Klein, Azarm Nowzad, Fikret Sivrikaya, Sahin Albayrak
Abstract:
Semi-supervised object detection (SSOD) based on pseudo-labeling significantly reduces dependence on large labeled datasets by effectively leveraging both labeled and unlabeled data. However, real-world applications of SSOD often face critical challenges, including class imbalance, label noise, and labeling errors. We present an in-depth analysis of SSOD under real-world conditions, uncovering causes of suboptimal pseudo-labeling and key trade-offs between label quality and quantity. Based on our findings, we propose four building blocks that can be seamlessly integrated into an SSOD framework. Rare Class Collage (RCC): a data augmentation method that enhances the representation of rare classes by creating collages of rare objects. Rare Class Focus (RCF): a stratified batch sampling strategy that ensures a more balanced representation of all classes during training. Ground Truth Label Correction (GLC): a label refinement method that identifies and corrects false, missing, and noisy ground truth labels by leveraging the consistency of teacher model predictions. Pseudo-Label Selection (PLS): a selection method for removing low-quality pseudo-labeled images, guided by a novel metric estimating the missing detection rate while accounting for class rarity. We validate our methods through comprehensive experiments on autonomous driving datasets, resulting in up to 6% increase in SSOD performance. Overall, our investigation and novel, data-centric, and broadly applicable building blocks enable robust and effective SSOD in complex, real-world scenarios. Code is available at https://mos-ks.github.io/publications.
Authors:Chi-Chih Chang, Chien-Yu Lin, Yash Akhauri, Wei-Cheng Lin, Kai-Chiang Wu, Luis Ceze, Mohamed S. Abdelfattah
Abstract:
Large Language Models (LLMs) with long context windows enable powerful applications but come at the cost of high memory consumption to store the Key and Value states (KV-Cache). Recent studies attempted to merge KV-cache from multiple layers into shared representations, yet these approaches either require expensive pretraining or rely on assumptions of high per-token cosine similarity across layers which generally does not hold in practice. We find that the dominant singular vectors are remarkably well-aligned across multiple layers of the KV-Cache. Exploiting this insight, we propose xKV, a simple post-training method that applies Singular Value Decomposition (SVD) on the KV-Cache of grouped layers. xKV consolidates the KV-Cache of multiple layers into a shared low-rank subspace, significantly reducing KV-Cache sizes. Through extensive evaluations on the RULER long-context benchmark with widely-used LLMs (e.g., Llama-3.1 and Qwen2.5), xKV achieves up to 6.8x higher compression rates than state-of-the-art inter-layer technique while improving accuracy by 2.7%. Moreover, xKV is compatible with the emerging Multi-Head Latent Attention (MLA) (e.g., DeepSeek-Coder-V2), yielding a notable 3x compression rates on coding tasks without performance degradation. These results highlight xKV's strong capability and versatility in addressing memory bottlenecks for long-context LLM inference. Our code is publicly available at: https://github.com/abdelfattah-lab/xKV.
Chinese: xKV是一种后训练方法,通过奇异值分解将多层KV缓存压缩至共享低秩子空间,在长上下文基准测试中实现6.8倍压缩率提升,同时准确率提高2.7%。
English: xKV is a post-training method that uses Singular Value Decomposition to compress the KV-Cache across multiple layers, achieving up to 6.8x higher compression than existing techniques while improving accuracy by 2.7% on long-context benchmarks.
Authors:Yuhang Wang, Hanwei Guo, Sizhe Wang, Long Qian, Xuguang Lan
Abstract:
Model Predictive Control (MPC) has been demonstrated to be effective in continuous control tasks. When a world model and a value function are available, planning a sequence of actions ahead of time leads to a better policy. Existing methods typically obtain the value function and the corresponding policy in a model-free manner. However, we find that such an approach struggles with complex tasks, resulting in poor policy learning and inaccurate value estimation. To address this problem, we leverage the strengths of MPC itself. In this work, we introduce Bootstrapped Model Predictive Control (BMPC), a novel algorithm that performs policy learning in a bootstrapped manner. BMPC learns a network policy by imitating an MPC expert, and in turn, uses this policy to guide the MPC process. Combined with model-based TD-learning, our policy learning yields better value estimation and further boosts the efficiency of MPC. We also introduce a lazy reanalyze mechanism, which enables computationally efficient imitation learning. Our method achieves superior performance over prior works on diverse continuous control tasks. In particular, on challenging high-dimensional locomotion tasks, BMPC significantly improves data efficiency while also enhancing asymptotic performance and training stability, with comparable training time and smaller network sizes. Code is available at https://github.com/wertyuilife2/bmpc.
Chinese: 自举模型预测控制(BMPC)通过模仿MPC专家策略并结合基于模型的时序差分学习,在复杂任务中显著提升了数据效率和性能表现。
English: Bootstrapped Model Predictive Control (BMPC) enhances continuous control by integrating policy imitation of an MPC expert with model-based TD-learning, achieving superior data efficiency and performance on complex tasks.
Authors:Daniel Lepe-Soltero, Thierry Artières, Anaïs Baudot, Paul Villoutreix
Abstract:
An important objective in computational biology is the efficient integration of multi-omics data. The task of integration comes with challenges: multi-omics data are most often unpaired (requiring diagonal integration), partially labeled with information about biological conditions, and in some situations such as rare diseases, only very small datasets are available. We present MODIS, a semi supervised framework designed to account for these particular challenges. To address the challenge of very small datasets, we propose to exploit information contained in larger multi-omics databases by training our model on a large reference database and a small target dataset simultaneously, effectively turning the problem of transfer learning into a problem of learning with class imbalance. MODIS performs diagonal integration on unpaired samples, leveraging class-labels to align modalities despite class imbalance and data scarcity. The architecture combines multiple variational auto-encoders, a class classifier and an adversarially trained modality classifier. To ensure training stability, we adapted a regularized relativistic GAN loss to this setting. We first validate MODIS on a synthetic dataset to assess the level of supervision needed for accurate alignment and to quantify the impact of class imbalance on predictive performance. We then apply our approach to the large public TCGA database, considering between 10 and 34 classes (cancer types and normal tissue). MODIS demonstrates high prediction accuracy, robust performance with limited supervision, and stability to class imbalance. These results position MODIS as a promising solution for challenging integration scenarios, particularly diagonal integration with a small number of samples, typical of rare diseases studies. The code is available at https://github.com/VILLOUTREIXLab/MODIS.
中文: MODIS是一个半监督框架,能有效整合未配对的多组学数据,通过利用大型参考数据库和小型目标数据集,在类别不平衡和数据稀缺的情况下仍能实现稳健的对角整合性能。
English: MODIS is a semi-supervised framework that effectively integrates unpaired multi-omics data by leveraging large reference databases and small target datasets, demonstrating robust performance in diagonal integration despite class imbalance and data scarcity.
Authors:Nathan Darjana, Ryo Fujii, Hideo Saito, Hiroki Kajita
Abstract:
Egocentric open-surgery videos capture rich, fine-grained details essential for accurately modeling surgical procedures and human behavior in the operating room. A detailed, pixel-level understanding of hands and surgical tools is crucial for interpreting a surgeon's actions and intentions. We introduce EgoSurgery-HTS, a new dataset with pixel-wise annotations and a benchmark suite for segmenting surgical tools, hands, and interacting tools in egocentric open-surgery videos. Specifically, we provide a labeled dataset for (1) tool instance segmentation of 14 distinct surgical tools, (2) hand instance segmentation, and (3) hand-tool segmentation to label hands and the tools they manipulate. Using EgoSurgery-HTS, we conduct extensive evaluations of state-of-the-art segmentation methods and demonstrate significant improvements in the accuracy of hand and hand-tool segmentation in egocentric open-surgery videos compared to existing datasets. The dataset will be released at https://github.com/Fujiry0/EgoSurgery.
Chinese: EgoSurgery-HTS数据集为开放手术视频中的手术工具和手部提供了像素级标注,显著提升了分割精度,有助于深入分析手术操作。
English: The EgoSurgery-HTS dataset provides pixel-level annotations for surgical tools and hands in egocentric open-surgery videos, significantly improving segmentation accuracy and enabling detailed analysis of surgical actions.
Authors:Chengxiang Huang, Yake Wei, Zequn Yang, Di Hu
Abstract:
Sensory training during the early ages is vital for human development. Inspired by this cognitive phenomenon, we observe that the early training stage is also important for the multimodal learning process, where dataset information is rapidly acquired. We refer to this stage as the prime learning window. However, based on our observation, this prime learning window in multimodal learning is often dominated by information-sufficient modalities, which in turn suppresses the information acquisition of information-insufficient modalities. To address this issue, we propose Information Acquisition Regulation (InfoReg), a method designed to balance information acquisition among modalities. Specifically, InfoReg slows down the information acquisition process of information-sufficient modalities during the prime learning window, which could promote information acquisition of information-insufficient modalities. This regulation enables a more balanced learning process and improves the overall performance of the multimodal network. Experiments show that InfoReg outperforms related multimodal imbalanced methods across various datasets, achieving superior model performance. The code is available at https://github.com/GeWu-Lab/InfoReg_CVPR2025.
Chinese: 本研究提出InfoReg方法,通过在关键学习窗口减缓信息充足模态的学习速度来促进信息不足模态的信息获取,从而平衡多模态学习过程并提升整体性能。
English: The study introduces InfoReg, a method that balances multimodal learning by slowing dominant modalities during the prime learning window to enhance weaker ones, improving overall performance across datasets.
Authors:Xudong Mou, Rui Wang, Bo Li, Tianyu Wo, Jie Sun, Hui Wang, Xudong Liu
Abstract:
The accumulation of time-series signals and the absence of labels make time-series Anomaly Detection (AD) a self-supervised task of deep learning. Methods based on normality assumptions face the following three limitations: (1) A single assumption could hardly characterize the whole normality or lead to some deviation. (2) Some assumptions may go against the principle of AD. (3) Their basic assumption is that the training data is uncontaminated (free of anomalies), which is unrealistic in practice, leading to a decline in robustness. This paper proposes a novel robust approach, RoCA, which is the first to address all of the above three challenges, as far as we are aware. It fuses the separated assumptions of one-class classification and contrastive learning in a single training process to characterize a more complete so-called normality. Additionally, it monitors the training data and computes a carefully designed anomaly score throughout the training process. This score helps identify latent anomalies, which are then used to define the classification boundary, inspired by the concept of outlier exposure. The performance on AIOps datasets improved by 6% compared to when contamination was not considered (COCA). On two large and high-dimensional multivariate datasets, the performance increased by 5% to 10%. RoCA achieves the highest average performance on both univariate and multivariate datasets. The source code is available at https://github.com/ruiking04/RoCA.
中文摘要:时序异常检测作为自监督学习任务,传统基于正态性假设的方法存在三大局限导致鲁棒性下降;本文提出的RoCA方法通过融合单分类与对比学习假设并识别潜在异常,在各类数据集上实现性能提升5-10%。
English Summary: Time-series anomaly detection is a self-supervised learning task where traditional methods relying on normality assumptions face three key limitations, leading to reduced robustness; the proposed RoCA method addresses these by integrating one-class classification and contrastive learning while identifying latent anomalies to improve performance by 5-10% across datasets.
Authors:Hankyul Kang, Gregor Seifer, Donghyun Lee, Jongbin Ryu
Abstract:
According to the forgetting curve theory, we can enhance memory retention by learning extensive data and taking adequate rest. This means that in order to effectively retain new knowledge, it is essential to learn it thoroughly and ensure sufficient rest so that our brain can memorize without forgetting. The main takeaway from this theory is that learning extensive data at once necessitates sufficient rest before learning the same data again. This aspect of human long-term memory retention can be effectively utilized to address the continual learning of neural networks. Retaining new knowledge for a long period of time without catastrophic forgetting is the critical problem of continual learning. Therefore, based on Ebbinghaus' theory, we introduce the view-batch model that adjusts the learning schedules to optimize the recall interval between retraining the same samples. The proposed view-batch model allows the network to get enough rest to learn extensive knowledge from the same samples with a recall interval of sufficient length. To this end, we specifically present two approaches: 1) a replay method that guarantees the optimal recall interval, and 2) a self-supervised learning that acquires extensive knowledge from a single training sample at a time. We empirically show that these approaches of our method are aligned with the forgetting curve theory, which can enhance long-term memory. In our experiments, we also demonstrate that our method significantly improves many state-of-the-art continual learning methods in various protocols and scenarios. We open-source this project at https://github.com/hankyul2/ViewBatchModel.
中文: 本研究基于遗忘曲线理论,提出视图批次模型,通过优化回忆间隔并结合回放方法与自监督学习,有效增强持续学习系统中的长期记忆保持能力。
English: Based on the forgetting curve theory, this study proposes a view-batch model that optimizes recall intervals and incorporates replay methods with self-supervised learning to enhance long-term memory retention in continual learning systems.
Authors:Xu Han, Yuan Tang, Jinfeng Xu, Xianzhi Li
Abstract:
We introduce Monarch Sparse Tuning (MoST), the first reparameterization-based parameter-efficient fine-tuning (PEFT) method tailored for 3D representation learning. Unlike existing adapter-based and prompt-tuning 3D PEFT methods, MoST introduces no additional inference overhead and is compatible with many 3D representation learning backbones. At its core, we present a new family of structured matrices for 3D point clouds, Point Monarch, which can capture local geometric features of irregular points while offering high expressiveness. MoST reparameterizes the dense update weight matrices as our sparse Point Monarch matrices, significantly reducing parameters while retaining strong performance. Experiments on various backbones show that MoST is simple, effective, and highly generalizable. It captures local features in point clouds, achieving state-of-the-art results on multiple benchmarks, e.g., 97.5% acc. on ScanObjectNN (PB_50_RS) and 96.2% on ModelNet40 classification, while it can also combine with other matrix decompositions (e.g., Low-rank, Kronecker) to further reduce parameters.
中文: MoST是一种专为三维表示学习设计的参数高效微调方法,通过稀疏Point Monarch矩阵捕捉不规则点的局部几何特征,无需额外推理开销,在ScanObjectNN和ModelNet40等基准测试中达到了最先进性能。
English: MoST is a novel parameter-efficient fine-tuning method for 3D representation learning that uses sparse Point Monarch matrices to capture local geometric features without inference overhead, achieving state-of-the-art performance on benchmarks like ScanObjectNN and ModelNet40.
Authors:Christoforos N. Spartalis, Theodoros Semertzidis, Efstratios Gavves, Petros Daras
Abstract:
We present LoTUS, a novel Machine Unlearning (MU) method that eliminates the influence of training samples from pre-trained models, avoiding retraining from scratch. LoTUS smooths the prediction probabilities of the model up to an information-theoretic bound, mitigating its over-confidence stemming from data memorization. We evaluate LoTUS on Transformer and ResNet18 models against eight baselines across five public datasets. Beyond established MU benchmarks, we evaluate unlearning on ImageNet1k, a large-scale dataset, where retraining is impractical, simulating real-world conditions. Moreover, we introduce the novel Retrain-Free Jensen-Shannon Divergence (RF-JSD) metric to enable evaluation under real-world conditions. The experimental results show that LoTUS outperforms state-of-the-art methods in terms of both efficiency and effectiveness. Code: https://github.com/cspartalis/LoTUS.
中文: LoTUS是一种新颖的机器遗忘方法,无需重新训练即可高效消除预训练模型中训练数据的影响,在多个基准测试中效率和效果均优于现有方法。
English: LoTUS is a novel Machine Unlearning method that efficiently removes the influence of training data from pre-trained models without retraining, outperforming existing methods in both efficiency and effectiveness across multiple benchmarks.
Authors:Massimo Bini, Leander Girrbach, Zeynep Akata
Abstract:
Parameter-Efficient FineTuning (PEFT) methods have recently gained significant popularity thanks to the widespread availability of large-scale pretrained models. These methods allow for quick adaptation to downstream tasks with minimal computational cost. However, popular finetuning methods such as LoRA exhibit limited robustness when it comes to hyperparameter choices or extended training regimes, preventing optimal out-of-the-box performance. In contrast, bounded approaches, such as ETHER, provide greater robustness but are limited to extremely low-rank adaptations and fixed-strength transformations, reducing their adaptation expressive power. In this work, we propose Decoupled Low-rank Adaptation (DeLoRA), a novel finetuning method that normalizes and scales learnable low-rank matrices. By bounding the distance of the transformation, DeLoRA effectively decouples the angular learning from the adaptation strength, enhancing robustness without compromising performance. Through evaluations on subject-driven image generation, natural language understanding, and instruction tuning, we show that DeLoRA matches or surpasses performance of competing PEFT methods, while exhibiting stronger robustness. Code is available at https://github.com/ExplainableML/DeLoRA.
中文: DeLoRA是一种新颖的参数高效微调方法,通过将角度学习与适应强度解耦来增强鲁棒性,在多项任务中实现优于或持平现有方法的性能,同时保持更高的稳定性。
English: DeLoRA is a novel parameter-efficient fine-tuning method that enhances robustness by decoupling angular learning from adaptation strength, achieving superior or comparable performance across multiple tasks while maintaining greater stability than existing approaches.
Authors:Suman Adhya, Avishek Lahiri, Debarshi Kumar Sanyal, Partha Pratim Das
Abstract:
Negative sampling has emerged as an effective technique that enables deep learning models to learn better representations by introducing the paradigm of learn-to-compare. The goal of this approach is to add robustness to deep learning models to learn better representation by comparing the positive samples against the negative ones. Despite its numerous demonstrations in various areas of computer vision and natural language processing, a comprehensive study of the effect of negative sampling in an unsupervised domain like topic modeling has not been well explored. In this paper, we present a comprehensive analysis of the impact of different negative sampling strategies on neural topic models. We compare the performance of several popular neural topic models by incorporating a negative sampling technique in the decoder of variational autoencoder-based neural topic models. Experiments on four publicly available datasets demonstrate that integrating negative sampling into topic models results in significant enhancements across multiple aspects, including improved topic coherence, richer topic diversity, and more accurate document classification. Manual evaluations also indicate that the inclusion of negative sampling into neural topic models enhances the quality of the generated topics. These findings highlight the potential of negative sampling as a valuable tool for advancing the effectiveness of neural topic models.
中文摘要:负采样通过对比正负样本提升神经主题模型的性能,实验证明其能显著增强主题连贯性、多样性及分类准确性,推动模型效果进步。
English Summary: Negative sampling enhances neural topic models by comparing positive and negative samples, significantly improving topic coherence, diversity, and classification accuracy as demonstrated in comprehensive experiments.
Authors:Xiaoming Qi, Jingyang Zhang, Huazhu Fu, Guanyu Yang, Shuo Li, Yueming Jin
Abstract:
Federated continual learning (FCL) offers an emerging pattern to facilitate the applicability of federated learning (FL) in real-world scenarios, where tasks evolve dynamically and asynchronously across clients, especially in medical scenario. Existing server-side FCL methods in nature domain construct a continually learnable server model by client aggregation on all-involved tasks. However, they are challenged by: (1) Catastrophic forgetting for previously learned tasks, leading to error accumulation in server model, making it difficult to sustain comprehensive knowledge across all tasks. (2) Biased optimization due to asynchronous tasks handled across different clients, leading to the collision of optimization targets of different clients at the same time steps. In this work, we take the first step to propose a novel server-side FCL pattern in medical domain, Dynamic Allocation Hypernetwork with adaptive model recalibration (\textbf{FedDAH}). It is to facilitate collaborative learning under the distinct and dynamic task streams across clients. To alleviate the catastrophic forgetting, we propose a dynamic allocation hypernetwork (DAHyper) where a continually updated hypernetwork is designed to manage the mapping between task identities and their associated model parameters, enabling the dynamic allocation of the model across clients. For the biased optimization, we introduce a novel adaptive model recalibration (AMR) to incorporate the candidate changes of historical models into current server updates, and assign weights to identical tasks across different time steps based on the similarity for continual optimization. Extensive experiments on the AMOS dataset demonstrate the superiority of our FedDAH to other FCL methods on sites with different task streams. The code is available:https://github.com/jinlab-imvr/FedDAH.
中文: 联邦持续学习在动态医疗场景下面临灾难性遗忘和优化偏差的挑战,FedDAH方法通过动态分配超网络和自适应模型重校准技术,有效提升了跨客户端的协同学习能力。
English: Federated continual learning (FCL) faces challenges of catastrophic forgetting and biased optimization in dynamic medical scenarios, which the proposed FedDAH method addresses through a dynamic allocation hypernetwork and adaptive model recalibration to enhance collaborative learning across clients.
Authors:Hongshu Guo, Sijie Ma, Zechuan Huang, Yuzhi Hu, Zeyuan Ma, Xinglin Zhang, Yue-Jiao Gong
Abstract:
Recently, Meta-Black-Box-Optimization (MetaBBO) methods significantly enhance the performance of traditional black-box optimizers through meta-learning flexible and generalizable meta-level policies that excel in dynamic algorithm configuration (DAC) tasks within the low-level optimization, reducing the expertise required to adapt optimizers for novel optimization tasks. Though promising, existing MetaBBO methods heavily rely on human-crafted feature extraction approach to secure learning effectiveness. To address this issue, this paper introduces a novel MetaBBO method that supports automated feature learning during the meta-learning process, termed as RLDE-AFL, which integrates a learnable feature extraction module into a reinforcement learning-based DE method to learn both the feature encoding and meta-level policy. Specifically, we design an attention-based neural network with mantissa-exponent based embedding to transform the solution populations and corresponding objective values during the low-level optimization into expressive landscape features. We further incorporate a comprehensive algorithm configuration space including diverse DE operators into a reinforcement learning-aided DAC paradigm to unleash the behavior diversity and performance of the proposed RLDE-AFL. Extensive benchmark results show that co-training the proposed feature learning module and DAC policy contributes to the superior optimization performance of RLDE-AFL to several advanced DE methods and recent MetaBBO baselines over both synthetic and realistic BBO scenarios. The source codes of RLDE-AFL are available at https://github.com/GMC-DRL/RLDE-AFL.
中文摘要:本文提出RLDE-AFL方法,通过基于注意力的神经网络和强化学习实现特征自动提取,在多种优化场景中显著优于现有元黑盒优化方法。
English Summary: This paper introduces RLDE-AFL, a Meta-Black-Box-Optimization method that automates feature learning through an attention-based neural network and reinforcement learning, outperforming existing methods in diverse optimization scenarios.
Authors:Zeyuan Ma, Zhiyang Huang, Jiacheng Chen, Zhiguang Cao, Yue-Jiao Gong
Abstract:
Recent Meta-Black-Box Optimization (MetaBBO) approaches have shown possibility of enhancing the optimization performance through learning meta-level policies to dynamically configure low-level optimizers. However, existing MetaBBO approaches potentially consume massive function evaluations to train their meta-level policies. Inspired by the recent trend of using surrogate models for cost-friendly evaluation of expensive optimization problems, in this paper, we propose a novel MetaBBO framework which combines surrogate learning process and reinforcement learning-aided Differential Evolution algorithm, namely Surr-RLDE, to address the intensive function evaluation in MetaBBO. Surr-RLDE comprises two learning stages: surrogate learning and policy learning. In surrogate learning, we train a Kolmogorov-Arnold Networks (KAN) with a novel relative-order-aware loss to accurately approximate the objective functions of the problem instances used for subsequent policy learning. In policy learning, we employ reinforcement learning (RL) to dynamically configure the mutation operator in DE. The learned surrogate model is integrated into the training of the RL-based policy to substitute for the original objective function, which effectively reduces consumed evaluations during policy learning. Extensive benchmark results demonstrate that Surr-RLDE not only shows competitive performance to recent baselines, but also shows compelling generalization for higher-dimensional problems. Further ablation studies underscore the effectiveness of each technical components in Surr-RLDE. We open-source Surr-RLDE at https://github.com/GMC-DRL/Surr-RLDE.
中文: 提出的Surr-RLDE框架结合了基于科尔莫戈罗夫-阿诺德网络的代理学习和强化学习来动态配置差分进化算子,在保持竞争优势和强大泛化能力的同时,显著减少了函数评估次数。
English: The proposed Surr-RLDE framework combines surrogate learning using Kolmogorov-Arnold Networks with reinforcement learning to dynamically configure Differential Evolution operators, significantly reducing function evaluations while maintaining competitive optimization performance and strong generalization capabilities.
Authors:Aabid Karim, Abdul Karim, Bhoomika Lohana, Matt Keon, Jaswinder Singh, Abdul Sattar
Abstract:
Large Language Models (LLMs) have significantly advanced various fields, particularly coding, mathematical reasoning, and logical problem solving. However, a critical question remains: Do these mathematical reasoning abilities persist when LLMs are presented with culturally adapted math problems? Specifically, how do LLMs perform when faced with math problems embedded in cultural contexts that have no significant representation in main stream web-scale AI training data? To explore this, we generated six synthetic cultural datasets from GSM8K, a widely used benchmark for assessing LLMs' mathematical reasoning skills. While preserving the mathematical logic and numerical values of the original GSM8K test set, we modify cultural elements such as personal names, food items, place names, etc. These culturally adapted datasets provide a more reliable framework for evaluating LLMs' mathematical reasoning under shifting cultural contexts. Our findings reveal that LLMs struggle with math problems when cultural references change, even though the underlying mathematical structure remains constant. Smaller models exhibit greater performance drops compared to larger models. Interestingly, our results also suggest that cultural familiarity can enhance mathematical reasoning. Even models with no explicit mathematical training but exposure to relevant cultural contexts sometimes outperform larger, mathematically proficient models on culturally embedded math problems. This study highlights the impact of cultural context on the mathematical reasoning abilities of LLMs, underscoring the need for more diverse and representative training data to improve robustness in real-world applications. The benchmark data sets and script for reproducing the results are available at https://github.com/akarim23131/Lost_in_Cultural_Translation
大型语言模型在文化适应数学问题上表现不佳,尽管数学逻辑未变,尤其小型模型性能下降更明显,表明文化熟悉度可提升推理能力。
Large Language Models struggle with culturally adapted math problems despite unchanged mathematical logic, revealing performance drops especially in smaller models and suggesting cultural familiarity can enhance reasoning.
Authors:Yali Fu, Jindong Li, Qi Wang, Qianli Xing
Abstract:
Unsupervised graph-level anomaly detection (UGLAD) is a critical and challenging task across various domains, such as social network analysis, anti-cancer drug discovery, and toxic molecule identification. However, existing methods often struggle to capture the long-range dependencies efficiently and neglect the spectral information. Recently, selective State Space Models (SSMs), particularly Mamba, have demonstrated remarkable advantages in capturing long-range dependencies with linear complexity and a selection mechanism. Motivated by their success across various domains, we propose GLADMamba, a novel framework that adapts the selective state space model into UGLAD field. We design View-Fused Mamba (VFM) with a Mamba-Transformer-style architecture to efficiently fuse information from different views with a selective state mechanism. We also design Spectrum-Guided Mamba (SGM) with a Mamba-Transformer-style architecture to leverage the Rayleigh quotient to guide the embedding refining process. GLADMamba can dynamically focus on anomaly-related information while discarding irrelevant information for anomaly detection. To the best of our knowledge, this is the first work to introduce Mamba and explicit spectral information to UGLAD. Extensive experiments on 12 real-world datasets demonstrate that GLADMamba outperforms existing state-of-the-art methods, achieving superior performance in UGLAD. The code is available at https://github.com/Yali-F/GLADMamba.
Chinese: GLADMamba提出了一种新颖框架,将选择性状态空间模型(特别是Mamba)引入无监督图级异常检测领域,通过有效捕获长程依赖并结合频谱信息,在多个数据集上实现了卓越性能。
English: GLADMamba introduces a novel framework that adapts selective state space models, specifically Mamba, into unsupervised graph-level anomaly detection by efficiently capturing long-range dependencies and incorporating spectral information, achieving superior performance across multiple datasets.
Authors:Peijin Guo, Minghui Li, Hewen Pan, Ruixiang Huang, Lulu Xue, Shengqing Hu, Zikang Guo, Wei Wan, Shengshan Hu
Abstract:
While deep learning models play a crucial role in predicting antibody-antigen interactions (AAI), the scarcity of publicly available sequence-structure pairings constrains their generalization. Current AAI methods often focus on residue-level static details, overlooking fine-grained structural representations of antibodies and their inter-antibody similarities. To tackle this challenge, we introduce a multi-modality representation approach that integates 3D structural and 1D sequence data to unravel intricate intra-antibody hierarchical relationships. By harnessing these representations, we present MuLAAIP, an AAI prediction framework that utilizes graph attention networks to illuminate graph-level structural features and normalized adaptive graph convolution networks to capture inter-antibody sequence associations. Furthermore, we have curated an AAI benchmark dataset comprising both structural and sequence information along with interaction labels. Through extensive experiments on this benchmark, our results demonstrate that MuLAAIP outperforms current state-of-the-art methods in terms of predictive performance. The implementation code and dataset are publicly available at https://github.com/trashTian/MuLAAIP for reproducibility.
Chinese: 针对现有抗体-抗原相互作用预测方法的不足,我们开发了MuLAAIP多模态框架,通过图网络整合三维结构和一维序列数据,在我们新构建的基准数据集上展现出卓越的预测性能。
English: To address the limitations of existing antibody-antigen interaction prediction methods, we developed MuLAAIP, a multi-modality framework that integrates 3D structural and 1D sequence data using graph networks, which demonstrates superior predictive performance on our newly curated benchmark dataset.
Authors:Jaeyeon Lee, Guantong Qi, Matthew Brady Neeley, Zhandong Liu, Hyun-Hwan Jeong
Abstract:
Recent advancements in large language models (LLMs) integrating explicit reasoning, such as OpenAI's o3-mini, DeepSeek-R1, and QWQ-32B, enable smaller models to solve complex tasks by generating intermediate reasoning steps prior to providing answers. However, this approach significantly increases computational costs, both monetarily and environmentally. The widely-used self-consistency method further exacerbates these costs by aggregating multiple reasoning paths to improve accuracy, often requiring between 40 to 64 samples per task. Although aggregation effectively reduces variance and bias, additional sampling can lead to diminishing returns when early samples yield consistent results. To address inefficiencies, we propose leveraging Sequential Probability Ratio Testing (SPRT) to dynamically terminate sampling once sufficient consistency is achieved. We calibrate SPRT parameters specifically for LLM applications, accounting for sensitivity to detect the mode of the distribution. Our experiments demonstrate that incorporating SPRT significantly enhances token efficiency, achieving comparable accuracy to self-consistency methods but at a substantially reduced computational cost. To promote transparency and facilitate reproducibility, we have made the source code and datasets used in our experiments publicly available at our GitHub repository: https://github.com/LiuzLab/consol, or available as a PyPI package: pip install consol. We hope that this resource will support further research and encourage the development of new methods building upon our work.
中文摘要:本研究提出采用序贯概率比检验(SPRT)动态终止大语言模型采样,在保持与传统自洽方法相当准确度的同时,显著降低了计算成本。
English Summary: This study introduces Sequential Probability Ratio Testing (SPRT) to dynamically halt sampling in large language models once sufficient consistency is reached, significantly reducing computational costs while maintaining accuracy comparable to traditional self-consistency methods.
Authors:Louis Owen, Abhay Kumar, Nilabhra Roy Chowdhury, Fabian Güra
Abstract:
The outcome of Large Language Model (LLM) pre-training strongly depends on weight initialization and variance control strategies. Although the importance of initial variance control has been well documented in neural networks in general, the literature on initialization and management of its growth during LLM pre-training, specifically, is somewhat sparse. In this paper, we introduce the Layer Index Rescaling (LIR) weight initialization scheme, and the Target Variance Rescaling (TVR) variance control strategy. Experiments on a 1B parameter LLaMA model demonstrate that better variance management using these techniques yields substantial improvements in downstream task performance (up to 4.6% on common pre-training benchmarks) and reduces extreme activation values, thus mitigating challenges associated with quantization and low-precision training. Our code is available at: https://github.com/bluorion-com/weight_rescaling.
中文: 层索引重缩放(LIR)和目标方差重缩放(TVR)技术通过优化权重初始化和方差控制,显著提升大语言模型预训练效果,改善下游任务表现并缓解激活异常问题。
English: The Layer Index Rescaling (LIR) and Target Variance Rescaling (TVR) techniques improve LLM pre-training by optimizing weight initialization and variance control, leading to enhanced downstream task performance and reduced activation issues.
Authors:Zhuoshi Pan, Yu Li, Honglin Lin, Qizhi Pei, Zinan Tang, Wei Wu, Chenlin Ming, H. Vicky Zhao, Conghui He, Lijun Wu
Abstract:
Large language models (LLMs) have demonstrated remarkable reasoning capability in solving mathematical problems. However, existing approaches primarily focus on improving the quality of correct training data, e.g., distilling high-quality correct solutions from advanced models, neglecting the value contained in error data, potentially hindering the model's reflective ability. Though some studies attempt to leverage error data, they often involve complex mechanisms, such as Monte Carlo Tree Search (MCTS) to explore error nodes. In this work, we propose to enhance LLMs' reasoning ability by Learning from Errors for Mathematical Advancement (LEMMA). LEMMA constructs data consisting of an incorrect solution with an erroneous step and a reflection connection to a correct solution for fine-tuning. Specifically, we systematically analyze the model-generated error types and introduce an error-type grounded mistake augmentation method to collect diverse and representative errors. Correct solutions are either from fixing the errors or generating a fresh start. Through a model-aware smooth reflection connection, the erroneous solution is transferred to the correct one. By fine-tuning on the constructed dataset, the model is able to self-correct errors autonomously within the generation process without relying on external critique models. Experimental results demonstrate that LEMMA achieves significant performance improvements over other strong baselines.
Chinese: LEMMA方法通过利用包含错误步骤及其反思后修正为正确答案的数据进行微调,提升了大型语言模型的数学推理能力,使其能在生成过程中自主纠错,无需依赖外部评判模型,并实现了显著的性能提升。
English: The LEMMA approach enhances large language models' mathematical reasoning by fine-tuning them on data that includes incorrect solutions with errors and reflections leading to correct answers, enabling self-correction without external models and achieving superior performance.
Authors:Jiaheng Liu, Dawei Zhu, Zhiqi Bai, Yancheng He, Huanxuan Liao, Haoran Que, Zekun Wang, Chenchen Zhang, Ge Zhang, Jiebin Zhang, Yuanxing Zhang, Zhuo Chen, Hangyu Guo, Shilong Li, Ziqiang Liu, Yong Shan, Yifan Song, Jiayi Tian, Wenhao Wu, Zhejian Zhou, Ruijie Zhu, Junlan Feng, Yang Gao, Shizhu He, Zhoujun Li, Tianyu Liu, Fanyu Meng, Wenbo Su, Yingshui Tan, Zili Wang, Jian Yang, Wei Ye, Bo Zheng, Wangchunshu Zhou, Wenhao Huang, Sujian Li, Zhaoxiang Zhang
Abstract:
Efficient processing of long contexts has been a persistent pursuit in Natural Language Processing. With the growing number of long documents, dialogues, and other textual data, it is important to develop Long Context Language Models (LCLMs) that can process and analyze extensive inputs in an effective and efficient way. In this paper, we present a comprehensive survey on recent advances in long-context modeling for large language models. Our survey is structured around three key aspects: how to obtain effective and efficient LCLMs, how to train and deploy LCLMs efficiently, and how to evaluate and analyze LCLMs comprehensively. For the first aspect, we discuss data strategies, architectural designs, and workflow approaches oriented with long context processing. For the second aspect, we provide a detailed examination of the infrastructure required for LCLM training and inference. For the third aspect, we present evaluation paradigms for long-context comprehension and long-form generation, as well as behavioral analysis and mechanism interpretability of LCLMs. Beyond these three key aspects, we thoroughly explore the diverse application scenarios where existing LCLMs have been deployed and outline promising future development directions. This survey provides an up-to-date review of the literature on long-context LLMs, which we wish to serve as a valuable resource for both researchers and engineers. An associated GitHub repository collecting the latest papers and repos is available at: \href{https://github.com/LCLM-Horizon/A-Comprehensive-Survey-For-Long-Context-Language-Modeling}{\color[RGB]{175,36,67}{LCLM-Horizon}}.
中文: 本文对长上下文语言模型的最新进展进行了全面综述,涵盖其开发、训练、部署和评估,并探讨了应用场景与未来发展方向。
English: This paper presents a comprehensive survey on recent advances in long-context language models, covering their development, training, deployment, and evaluation while exploring applications and future directions.
Authors:Alex Reneau, Jerry Yao-Chieh Hu, Zhongfang Zhuang, Ting-Chun Liu, Xiang He, Judah Goldfeder, Nadav Timor, Allen G Roush, Ravid Shwartz-Ziv
Abstract:
Many high-impact machine learning tasks involve multi-dimensional data such as images, volumetric medical scans, and multivariate time-series. Yet, most neural architectures flatten these inputs, discarding critical cross-dimension information. We introduce $\textbf{NdLinear}$, a novel linear transformation that circumvents this destructive flattening by operating directly on tensors. NdLinear applies transformations separately along each data dimension, thereby preserving the native data structure. Extensive experiments demonstrate NdLinear's capacity to significantly enhance representational power, achieve dramatic parameter reductions (often by orders of magnitude), and maintain a favorable computational profile. For instance, when applied to Large Language Model finetuning, our $\textbf{NdLinear-LoRA}$ delivers comparable or improved accuracy on reasoning tasks using up to $9\times$ fewer trainable parameters than standard LoRA. These broad advantages of NdLinear are consistently validated across diverse neural architectures (CNNs, RNNs, Transformers, MLPs) and data domains, including vision, language, time-series, and tabular tasks. As a versatile, drop-in replacement for standard linear layers, NdLinear processes data in its original N-dimensional form, offering a foundational component for developing more efficient and powerful next-generation neural architectures.
中文: NdLinear是一种基于张量的线性层,无需对输入进行扁平化处理,能在保持数据结构的同时大幅减少参数并提升计算效率,适用于多种深度学习任务。
English: NdLinear is a tensor-based linear layer that eliminates the need for input flattening, achieving significant parameter reduction and computational efficiency while preserving data structure across various deep learning tasks.
Authors:Xianghan Meng, Zhiyuan Huang, Wei He, Xianbiao Qi, Rong Xiao, Chun-Guang Li
Abstract:
Subspace clustering is a classical unsupervised learning task, built on a basic assumption that high-dimensional data can be approximated by a union of subspaces (UoS). Nevertheless, the real-world data are often deviating from the UoS assumption. To address this challenge, state-of-the-art deep subspace clustering algorithms attempt to jointly learn UoS representations and self-expressive coefficients. However, the general framework of the existing algorithms suffers from a catastrophic feature collapse and lacks a theoretical guarantee to learn desired UoS representation. In this paper, we present a Principled fRamewOrk for Deep Subspace Clustering (PRO-DSC), which is designed to learn structured representations and self-expressive coefficients in a unified manner. Specifically, in PRO-DSC, we incorporate an effective regularization on the learned representations into the self-expressive model, prove that the regularized self-expressive model is able to prevent feature space collapse, and demonstrate that the learned optimal representations under certain condition lie on a union of orthogonal subspaces. Moreover, we provide a scalable and efficient approach to implement our PRO-DSC and conduct extensive experiments to verify our theoretical findings and demonstrate the superior performance of our proposed deep subspace clustering approach. The code is available at https://github.com/mengxianghan123/PRO-DSC.
中文:本文提出了PRO-DSC原则性深度子空间聚类框架,通过有效正则化防止特征坍缩,理论上保证学习到的表示位于正交子空间上,实验证明其具有优越性能。
English: This paper introduces PRO-DSC, a principled deep subspace clustering framework that prevents feature collapse through effective regularization and theoretically guarantees learned representations lie on orthogonal subspaces, demonstrating superior performance in experiments.
Authors:Shuang Guo, Friedhelm Hamann, Guillermo Gallego
Abstract:
Event cameras rely on motion to obtain information about scene appearance. This means that appearance and motion are inherently linked: either both are present and recorded in the event data, or neither is captured. Previous works treat the recovery of these two visual quantities as separate tasks, which does not fit with the above-mentioned nature of event cameras and overlooks the inherent relations between them. We propose an unsupervised learning framework that jointly estimates optical flow (motion) and image intensity (appearance) using a single network. From the data generation model, we newly derive the event-based photometric error as a function of optical flow and image intensity. This error is further combined with the contrast maximization framework to form a comprehensive loss function that provides proper constraints for both flow and intensity estimation. Exhaustive experiments show our method's state-of-the-art performance: in optical flow estimation, it reduces EPE by 20% and AE by 25% compared to unsupervised approaches, while delivering competitive intensity estimation results, particularly in high dynamic range scenarios. Our method also achieves shorter inference time than all other optical flow methods and many of the image reconstruction methods, while they output only one quantity. Project page: https://github.com/tub-rip/E2FAI
Chinese: 我们提出了一种无监督学习框架,通过单一网络联合估计光流和图像强度,在精度和效率上均实现了显著提升,达到了领先性能。
English: We introduce an unsupervised learning framework that jointly estimates optical flow and image intensity using a single network, achieving state-of-the-art performance with significant improvements in accuracy and efficiency.
Authors:Yongli Xiang, Ziming Hong, Lina Yao, Dadong Wang, Tongliang Liu
Abstract:
Non-transferable learning (NTL) has been proposed to protect model intellectual property (IP) by creating a "non-transferable barrier" to restrict generalization from authorized to unauthorized domains. Recently, well-designed attack, which restores the unauthorized-domain performance by fine-tuning NTL models on few authorized samples, highlights the security risks of NTL-based applications. However, such attack requires modifying model weights, thus being invalid in the black-box scenario. This raises a critical question: can we trust the security of NTL models deployed as black-box systems? In this work, we reveal the first loophole of black-box NTL models by proposing a novel attack method (dubbed as JailNTL) to jailbreak the non-transferable barrier through test-time data disguising. The main idea of JailNTL is to disguise unauthorized data so it can be identified as authorized by the NTL model, thereby bypassing the non-transferable barrier without modifying the NTL model weights. Specifically, JailNTL encourages unauthorized-domain disguising in two levels, including: (i) data-intrinsic disguising (DID) for eliminating domain discrepancy and preserving class-related content at the input-level, and (ii) model-guided disguising (MGD) for mitigating output-level statistics difference of the NTL model. Empirically, when attacking state-of-the-art (SOTA) NTL models in the black-box scenario, JailNTL achieves an accuracy increase of up to 55.7% in the unauthorized domain by using only 1% authorized samples, largely exceeding existing SOTA white-box attacks.
中文: 本研究提出JailNTL攻击方法,通过数据伪装在测试阶段绕过黑盒非可转移学习(NTL)的安全屏障,无需修改模型权重即可将未授权域准确率最高提升55.7%,显著超越现有白盒攻击效果。
English: This study introduces JailNTL, a novel black-box attack that bypasses non-transferable learning (NTL) security by disguising unauthorized data through input-level and output-level transformations, achieving up to 55.7% accuracy improvement without modifying model weights.
Authors:Sheng Wang, Pengan Chen, Jingqi Zhou, Qintong Li, Jingwei Dong, Jiahui Gao, Boyang Xue, Jiyue Jiang, Lingpeng Kong, Chuan Wu
Abstract:
Model customization necessitates high-quality and diverse datasets, but acquiring such data remains time-consuming and labor-intensive. Despite the great potential of large language models (LLMs) for data synthesis, current approaches are constrained by limited seed data, model biases, and low-variation prompts, resulting in limited diversity and biased distributions with the increase of data scales. To tackle this challenge, we introduce TREESYNTH, a tree-guided subspace-based data synthesis approach inspired by decision trees. It constructs a spatial partitioning tree to recursively divide a task-specific full data space (i.e., root node) into numerous atomic subspaces (i.e., leaf nodes) with mutually exclusive and exhaustive attributes to ensure both distinctiveness and comprehensiveness before synthesizing samples within each atomic subspace. This globally dividing-and-synthesizing method finally collects subspace samples into a comprehensive dataset, effectively circumventing repetition and space collapse to ensure the diversity of large-scale data synthesis. Furthermore, the spatial partitioning tree enables sample allocation into atomic subspaces, allowing the rebalancing of existing datasets for more balanced and comprehensive distributions. Empirically, extensive experiments across diverse benchmarks consistently demonstrate the superior data diversity, model performance, and robust scalability of TREESYNTH compared to both human-crafted datasets and peer data synthesis methods, with an average performance gain reaching 10%. Besides, the consistent improvements of TREESYNTH-balanced datasets highlight its efficacious application to redistribute existing datasets for more comprehensive coverage and the induced performance enhancement. The code is available at https://github.com/cpa2001/TreeSynth.
Chinese: TREESYNTH是一种基于树引导空间划分的数据合成方法,通过将数据空间递归分割为原子子空间来生成多样平衡的数据集,其性能平均提升10%,显著优于现有方法。
English: TREESYNTH is a novel tree-guided data synthesis method that partitions the data space into atomic subspaces to generate diverse and balanced datasets, significantly outperforming existing approaches with a 10% average performance gain.
Authors:Davide Berasi, Matteo Farina, Massimiliano Mancini, Elisa Ricci, Nicola Strisciuglio
Abstract:
Vision-Language Models (VLMs) learn a shared feature space for text and images, enabling the comparison of inputs of different modalities. While prior works demonstrated that VLMs organize natural language representations into regular structures encoding composite meanings, it remains unclear if compositional patterns also emerge in the visual embedding space. In this work, we investigate compositionality in the image domain, where the analysis of compositional properties is challenged by noise and sparsity of visual data. We address these problems and propose a framework, called Geodesically Decomposable Embeddings (GDE), that approximates image representations with geometry-aware compositional structures in the latent space. We demonstrate that visual embeddings of pre-trained VLMs exhibit a compositional arrangement, and evaluate the effectiveness of this property in the tasks of compositional classification and group robustness. GDE achieves stronger performance in compositional classification compared to its counterpart method that assumes linear geometry of the latent space. Notably, it is particularly effective for group robustness, where we achieve higher results than task-specific solutions. Our results indicate that VLMs can automatically develop a human-like form of compositional reasoning in the visual domain, making their underlying processes more interpretable. Code is available at https://github.com/BerasiDavide/vlm_image_compositionality.
中文摘要:视觉语言模型通过提出的"测地可分解嵌入"框架,在视觉表征中自动形成了组合推理能力,并在组合分类和群体鲁棒性任务中展现出卓越性能。
English Summary: Vision-Language Models automatically develop compositional reasoning in visual representations through the proposed Geodesically Decomposable Embeddings framework, which demonstrates superior performance in compositional classification and group robustness tasks.
Authors:Robin Hesse, DoÄukan BaÄcı, Bernt Schiele, Simone Schaub-Meyer, Stefan Roth
Abstract:
Deep learning has become an essential part of computer vision, with deep neural networks (DNNs) excelling in predictive performance. However, they often fall short in other critical quality dimensions, such as robustness, calibration, or fairness. While existing studies have focused on a subset of these quality dimensions, none have explored a more general form of "well-behavedness" of DNNs. With this work, we address this gap by simultaneously studying nine different quality dimensions for image classification. Through a large-scale study, we provide a bird's-eye view by analyzing 326 backbone models and how different training paradigms and model architectures affect the quality dimensions. We reveal various new insights such that (i) vision-language models exhibit high fairness on ImageNet-1k classification and strong robustness against domain changes; (ii) self-supervised learning is an effective training paradigm to improve almost all considered quality dimensions; and (iii) the training dataset size is a major driver for most of the quality dimensions. We conclude our study by introducing the QUBA score (Quality Understanding Beyond Accuracy), a novel metric that ranks models across multiple dimensions of quality, enabling tailored recommendations based on specific user needs.
中文: 本研究通过分析326个模型的九个质量维度填补了深度神经网络综合评估的空白,发现视觉语言模型和自监督学习能提升鲁棒性与公平性,并推出QUBA评分体系实现定制化模型推荐。
English: This study addresses the gap in evaluating deep neural networks' overall quality by analyzing nine dimensions across 326 models, revealing that vision-language models and self-supervised learning enhance robustness and fairness, and introduces the QUBA score for tailored model recommendations.
Authors:Anshumann, Mohd Abbas Zaidi, Akhil Kedia, Jinwoo Ahn, Taehwak Kwon, Kangwook Lee, Haejun Lee, Joohyung Lee
Abstract:
Knowledge distillation can be a cost-effective technique to distill knowledge in Large Language Models, if the teacher output logits can be pre-computed and cached. However, successfully applying this to pre-training remains largely unexplored. In this work, we prove that naive approaches for sparse knowledge distillation such as caching Top-K probabilities, while intuitive, provide biased estimates of teacher probability distribution to the student, resulting in suboptimal performance and calibration. We propose an importance-sampling-based method `Random Sampling Knowledge Distillation', which provides unbiased estimates, preserves the gradient in expectation, and requires storing significantly sparser logits. Our method enables faster training of student models with marginal overhead (<10%) compared to cross-entropy based training, while maintaining competitive performance compared to full distillation, across a range of model sizes from 300M to 3B.
Chinese: 知识蒸馏在预计算教师模型输出时可高效传递大语言模型知识,但缓存Top-K概率等简单方法会产生偏差,导致性能不佳;我们提出的随机采样知识蒸馏方法提供无偏估计且仅需稀疏对数,能在模型规模从3亿到30亿范围内实现更快训练并保持竞争力。
English: Knowledge distillation can efficiently transfer knowledge from large language models if teacher logits are pre-computed, but naive methods like caching Top-K probabilities introduce bias, leading to suboptimal results; our proposed Random Sampling Knowledge Distillation offers unbiased estimates with sparse logits, enabling faster training and competitive performance across model sizes.
Authors:Massa Baali, Xiang Li, Hao Chen, Syed Abdul Hannan, Rita Singh, Bhiksha Raj
Abstract:
Speaker verification is a typical zero-shot learning task, where inference of unseen classes is performed by comparing embeddings of test instances to known examples. The models performing inference must hence naturally generate embeddings that cluster same-class instances compactly, while maintaining separation across classes. In order to learn to do so, they are typically trained on a large number of classes (speakers), often using specialized losses. However real-world speaker datasets often lack the class diversity needed to effectively learn this in a generalizable manner. We introduce CAARMA, a class augmentation framework that addresses this problem by generating synthetic classes through data mixing in the embedding space, expanding the number of training classes. To ensure the authenticity of the synthetic classes we adopt a novel adversarial refinement mechanism that minimizes categorical distinctions between synthetic and real classes. We evaluate CAARMA on multiple speaker verification tasks, as well as other representative zero-shot comparison-based speech analysis tasks and obtain consistent improvements: our framework demonstrates a significant improvement of 8\% over all baseline models. The code is available at: https://github.com/massabaali7/CAARMA/
中文摘要:CAARMA是一个通过嵌入空间混合生成合成类别并采用对抗性优化确保真实性的类别增强框架,在说话人验证任务中相比基线模型实现了8%的性能提升。
English Summary: CAARMA is a class augmentation framework that enhances speaker verification by generating synthetic classes through embedding space mixing and adversarial refinement, achieving an 8% performance improvement over baseline models.
Authors:Alejandro Ariza-Casabona, Nikos Kanakaris, Daniele Malitesta
Abstract:
Relational deep learning (RDL) settles among the most exciting advances in machine learning for relational databases, leveraging the representational power of message passing graph neural networks (GNNs) to derive useful knowledge and run predicting tasks on tables connected through primary-to-foreign key links. The RDL paradigm has been successfully applied to recommendation lately, through its most recent representative deep learning architecture namely, ContextGNN. While acknowledging ContextGNN's improved performance on real-world recommendation datasets and tasks, preliminary tests for the more traditional static link prediction task (aka personalized item recommendation) on the popular Amazon Book dataset have demonstrated how ContextGNN has still room for improvement compared to other state-of-the-art GNN-based recommender systems. To this end, with this paper, we integrate ContextGNN within Elliot, a popular framework for reproducibility and benchmarking analyses, counting around 50 state-of-the-art recommendation models from the literature to date. On such basis, we run preliminary experiments on three standard recommendation datasets and against six state-of-the-art GNN-based recommender systems, confirming similar trends to those observed by the authors in their original paper. The code is publicly available on GitHub: https://github.com/danielemalitesta/Rel-DeepLearning-RecSys.
Chinese: 关系深度学习,尤其是ContextGNN,在推荐系统中显示出潜力,但在静态链接预测任务上仍需改进,这一点通过将其整合到Elliot框架中并与基于GNN的先进模型在标准数据集上的比较得到了验证。
English: Relational deep learning, particularly ContextGNN, shows promise in recommendation systems but requires further enhancement for static link prediction tasks, as demonstrated through integration with the Elliot framework and comparisons with other GNN-based models on standard datasets.
Authors:Moshiur Rahman Tonmoy, Md. Mithun Hossain, Nilanjan Dey, M. F. Mridha
Abstract:
Plant diseases significantly threaten global food security by reducing crop yields and undermining agricultural sustainability. AI-driven automated classification has emerged as a promising solution, with deep learning models demonstrating impressive performance in plant disease identification. However, deploying these models on mobile and edge devices remains challenging due to high computational demands and resource constraints, highlighting the need for lightweight, accurate solutions for accessible smart agriculture systems. To address this, we propose MobilePlantViT, a novel hybrid Vision Transformer (ViT) architecture designed for generalized plant disease classification, which optimizes resource efficiency while maintaining high performance. Extensive experiments across diverse plant disease datasets of varying scales show our model's effectiveness and strong generalizability, achieving test accuracies ranging from 80% to over 99%. Notably, with only 0.69 million parameters, our architecture outperforms the smallest versions of MobileViTv1 and MobileViTv2, despite their higher parameter counts. These results underscore the potential of our approach for real-world, AI-powered automated plant disease classification in sustainable and resource-efficient smart agriculture systems. All codes will be available in the GitHub repository: https://github.com/moshiurtonmoy/MobilePlantViT
中文摘要:提出的MobilePlantViT模型通过轻量级视觉变换器架构解决了植物病害分类中的计算难题,仅用69万参数即可实现高达99%的准确率,为可持续智慧农业提供了高效解决方案。
English Summary: The proposed MobilePlantViT model addresses computational challenges in plant disease classification by combining a lightweight Vision Transformer architecture with high accuracy, achieving up to 99% performance using only 0.69 million parameters for sustainable smart agriculture.
Authors:Katja Schwarz, Denys Rozumnyi, Samuel Rota Bulò, Lorenzo Porzi, Peter Kontschieder
Abstract:
We introduce a recipe for generating immersive 3D worlds from a single image by framing the task as an in-context learning problem for 2D inpainting models. This approach requires minimal training and uses existing generative models. Our process involves two steps: generating coherent panoramas using a pre-trained diffusion model and lifting these into 3D with a metric depth estimator. We then fill unobserved regions by conditioning the inpainting model on rendered point clouds, requiring minimal fine-tuning. Tested on both synthetic and real images, our method produces high-quality 3D environments suitable for VR display. By explicitly modeling the 3D structure of the generated environment from the start, our approach consistently outperforms state-of-the-art, video synthesis-based methods along multiple quantitative image quality metrics. Project Page: https://katjaschwarz.github.io/worlds/
中文: 该方法通过将单张图像生成沉浸式3D场景构建为2D修复模型的情境学习任务,仅需少量训练即可超越现有视频合成方法,在多项图像质量指标上表现优异。
English: This method generates immersive 3D worlds from a single image using in-context learning with 2D inpainting models, requiring minimal training and outperforming existing video-based techniques in quality metrics.
Authors:Tidiane Camaret Ndir, Robin Tibor Schirrmeister, Tonio Ball
Abstract:
Deep networks for electroencephalogram (EEG) decoding are often only trained to solve one specific task, such as pathology or age decoding. A more general task-agnostic approach is to train deep networks to match a (clinical) EEG recording to its corresponding textual medical report and vice versa. This approach was pioneered in the computer vision domain matching images and their text captions and subsequently allowed to do successful zero-shot decoding using textual class prompts. In this work, we follow this approach and develop a contrastive learning framework, EEG-CLIP, that aligns the EEG time series and the descriptions of the corresponding clinical text in a shared embedding space. We investigated its potential for versatile EEG decoding, evaluating performance in a range of few-shot and zero-shot settings. Overall, we show that EEG-CLIP manages to non-trivially align text and EEG representations. Our work presents a promising approach to learn general EEG representations, which could enable easier analyses of diverse decoding questions through zero-shot decoding or training task-specific models from fewer training examples. The code for reproducing our results is available at https://github.com/tidiane-camaret/EEGClip
中文: 本研究提出EEG-CLIP对比学习框架,通过将脑电图数据与临床文本映射到共享嵌入空间,实现了多功能的零样本和少样本解码,为更广泛的脑电分析应用开辟了新途径。
English: This study introduces EEG-CLIP, a contrastive learning framework that aligns EEG data with clinical text in a shared embedding space, enabling versatile zero-shot and few-shot decoding for broader EEG analysis applications.
Authors:Ron Campos, Ashmal Vayani, Parth Parag Kulkarni, Rohit Gupta, Aizan Zafar, Aritra Dutta, Mubarak Shah
Abstract:
Image geolocalization, in which an AI model traditionally predicts the precise GPS coordinates of an image, is a challenging task with many downstream applications. However, the user cannot utilize the model to further their knowledge beyond the GPS coordinates; the model lacks an understanding of the location and the conversational ability to communicate with the user. In recent days, with the tremendous progress of large multimodal models (LMMs) -- proprietary and open-source -- researchers have attempted to geolocalize images via LMMs. However, the issues remain unaddressed; beyond general tasks, for more specialized downstream tasks, such as geolocalization, LMMs struggle. In this work, we propose solving this problem by introducing a conversational model, GAEA, that provides information regarding the location of an image as the user requires. No large-scale dataset enabling the training of such a model exists. Thus, we propose GAEA-1.4M, a comprehensive dataset comprising over 800k images and approximately 1.4M question-answer pairs, constructed by leveraging OpenStreetMap (OSM) attributes and geographical context clues. For quantitative evaluation, we propose a diverse benchmark, GAEA-Bench, comprising 3.5k image-text pairs to evaluate conversational capabilities equipped with diverse question types. We consider 11 state-of-the-art open-source and proprietary LMMs and demonstrate that GAEA significantly outperforms the best open-source model, LLaVA-OneVision, by 18.2% and the best proprietary model, GPT-4o, by 7.2%. Our dataset, model and codes are available.
Authors:Quanhao Li, Zhen Xing, Rui Wang, Hui Zhang, Qi Dai, Zuxuan Wu
Abstract:
Recent advances in video generation have led to remarkable improvements in visual quality and temporal coherence. Upon this, trajectory-controllable video generation has emerged to enable precise object motion control through explicitly defined spatial paths. However, existing methods struggle with complex object movements and multi-object motion control, resulting in imprecise trajectory adherence, poor object consistency, and compromised visual quality. Furthermore, these methods only support trajectory control in a single format, limiting their applicability in diverse scenarios. Additionally, there is no publicly available dataset or benchmark specifically tailored for trajectory-controllable video generation, hindering robust training and systematic evaluation. To address these challenges, we introduce MagicMotion, a novel image-to-video generation framework that enables trajectory control through three levels of conditions from dense to sparse: masks, bounding boxes, and sparse boxes. Given an input image and trajectories, MagicMotion seamlessly animates objects along defined trajectories while maintaining object consistency and visual quality. Furthermore, we present MagicData, a large-scale trajectory-controlled video dataset, along with an automated pipeline for annotation and filtering. We also introduce MagicBench, a comprehensive benchmark that assesses both video quality and trajectory control accuracy across different numbers of objects. Extensive experiments demonstrate that MagicMotion outperforms previous methods across various metrics. Our project page are publicly available at https://quanhaol.github.io/magicmotion-site.
中文摘要:MagicMotion是一种新型图像转视频框架,通过三种轨迹格式实现精确的多目标运动控制并保持视觉质量,同时配备了专用数据集和评估基准。
English Summary: MagicMotion is a novel image-to-video framework that enables precise multi-object motion control through three trajectory formats while maintaining visual quality, supported by a dedicated dataset and benchmark.
Authors:Liming Jiang, Qing Yan, Yumin Jia, Zichuan Liu, Hao Kang, Xin Lu
Abstract:
Achieving flexible and high-fidelity identity-preserved image generation remains formidable, particularly with advanced Diffusion Transformers (DiTs) like FLUX. We introduce InfiniteYou (InfU), one of the earliest robust frameworks leveraging DiTs for this task. InfU addresses significant issues of existing methods, such as insufficient identity similarity, poor text-image alignment, and low generation quality and aesthetics. Central to InfU is InfuseNet, a component that injects identity features into the DiT base model via residual connections, enhancing identity similarity while maintaining generation capabilities. A multi-stage training strategy, including pretraining and supervised fine-tuning (SFT) with synthetic single-person-multiple-sample (SPMS) data, further improves text-image alignment, ameliorates image quality, and alleviates face copy-pasting. Extensive experiments demonstrate that InfU achieves state-of-the-art performance, surpassing existing baselines. In addition, the plug-and-play design of InfU ensures compatibility with various existing methods, offering a valuable contribution to the broader community.
中文摘要:InfiniteYou(InfU)框架通过引入InfuseNet进行身份特征注入和多阶段训练策略,解决了扩散变换器中身份保持与生成质量的难题,实现了顶尖性能并具备即插即用的兼容性。
English Summary: The InfiniteYou (InfU) framework addresses identity preservation and quality issues in Diffusion Transformers by introducing InfuseNet for feature injection and a multi-stage training strategy, achieving state-of-the-art performance with plug-and-play compatibility.
Authors:Ananta R. Bhattarai, Xingzhe He, Alla Sheffer, Helge Rhodin
Abstract:
DreamFusion established a new paradigm for unsupervised 3D reconstruction from virtual views by combining advances in generative models and differentiable rendering. However, the underlying multi-view rendering, along with supervision from large-scale generative models, is computationally expensive and under-constrained. We propose DreamTexture, a novel Shape-from-Virtual-Texture approach that leverages monocular depth cues to reconstruct 3D objects. Our method textures an input image by aligning a virtual texture with the real depth cues in the input, exploiting the inherent understanding of monocular geometry encoded in modern diffusion models. We then reconstruct depth from the virtual texture deformation with a new conformal map optimization, which alleviates memory-intensive volumetric representations. Our experiments reveal that generative models possess an understanding of monocular shape cues, which can be extracted by augmenting and aligning texture cues -- a novel monocular reconstruction paradigm that we call Analysis by Augmentation.
Authors:Yiran Qin, Li Kang, Xiufeng Song, Zhenfei Yin, Xiaohong Liu, Xihui Liu, Ruimao Zhang, Lei Bai
Abstract:
Designing effective embodied multi-agent systems is critical for solving complex real-world tasks across domains. Due to the complexity of multi-agent embodied systems, existing methods fail to automatically generate safe and efficient training data for such systems. To this end, we propose the concept of compositional constraints for embodied multi-agent systems, addressing the challenges arising from collaboration among embodied agents. We design various interfaces tailored to different types of constraints, enabling seamless interaction with the physical world. Leveraging compositional constraints and specifically designed interfaces, we develop an automated data collection framework for embodied multi-agent systems and introduce the first benchmark for embodied multi-agent manipulation, RoboFactory. Based on RoboFactory benchmark, we adapt and evaluate the method of imitation learning and analyzed its performance in different difficulty agent tasks. Furthermore, we explore the architectures and training strategies for multi-agent imitation learning, aiming to build safe and efficient embodied multi-agent systems.
中文: 本文提出组合约束和专用接口,为具身多智能体系统实现自动化数据采集,创建了RoboFactory基准,并通过模仿学习方法评估以提升系统的安全性和效率。
English: This paper introduces compositional constraints and specialized interfaces to automate data collection for embodied multi-agent systems, establishing the RoboFactory benchmark and evaluating imitation learning methods to enhance system safety and efficiency.
Authors:Yifan Sun, Han Wang, Dongbai Li, Gang Wang, Huan Zhang
Abstract:
Benchmark Data Contamination (BDC)-the inclusion of benchmark testing samples in the training set-has raised increasing concerns in Large Language Model (LLM) evaluation, leading to falsely inflated performance estimates and undermining evaluation reliability. To address this, researchers have proposed various mitigation strategies to update existing benchmarks, including modifying original questions or generating new ones based on them. However, a rigorous examination of the effectiveness of these mitigation strategies remains lacking. In this paper, we design a systematic and controlled pipeline along with two novel metrics-fidelity and contamination resistance-to provide a fine-grained and comprehensive assessment of existing BDC mitigation strategies. Previous assessment methods, such as accuracy drop and accuracy matching, focus solely on aggregate accuracy, often leading to incomplete or misleading conclusions. Our metrics address this limitation by emphasizing question-level evaluation result matching. Extensive experiments with 10 LLMs, 5 benchmarks, 20 BDC mitigation strategies, and 2 contamination scenarios reveal that no existing strategy significantly improves resistance over the vanilla case (i.e., no benchmark update) across all benchmarks, and none effectively balances fidelity and contamination resistance. These findings underscore the urgent need for designing more effective BDC mitigation strategies. Our code repository is available at https://github.com/ASTRAL-Group/BDC_mitigation_assessment.
中文: 基准数据污染导致大语言模型评估性能虚高,尽管已有多种缓解策略,但我们的研究通过新指标发现这些策略均无法有效平衡保真度与抗污染能力,亟需开发更优方案。
English: Benchmark Data Contamination in LLM evaluation leads to inflated performance estimates, and while mitigation strategies exist, our study using novel metrics reveals that none effectively balance fidelity and contamination resistance, highlighting the need for better solutions.
Authors:Yunzhi Yao, Jizhan Fang, Jia-Chen Gu, Ningyu Zhang, Shumin Deng, Huajun Chen, Nanyun Peng
Abstract:
Knowledge Editing (KE) enables the modification of outdated or incorrect information in large language models (LLMs). While existing KE methods can update isolated facts, they often fail to generalize these updates to multi-hop reasoning tasks that rely on the modified knowledge. Through an analysis of reasoning circuits -- the neural pathways LLMs use for knowledge-based inference, we find that current layer-localized KE approaches (e.g., MEMIT, WISE), which edit only single or a few model layers, inadequately integrate updated knowledge into these reasoning pathways. To address this limitation, we present CaKE (Circuit-aware Knowledge Editing), a novel method that enhances the effective integration of updated knowledge in LLMs. By only leveraging a few curated data samples guided by our circuit-based analysis, CaKE stimulates the model to develop appropriate reasoning circuits for newly incorporated knowledge. Experiments show that CaKE enables more accurate and consistent use of edited knowledge across related reasoning tasks, achieving an average improvement of 20% in multi-hop reasoning accuracy on the MQuAKE dataset while requiring less memory than existing KE methods. We release the code and data in https://github.com/zjunlp/CaKE.
中文摘要:CaKE是一种新颖的知识编辑方法,通过基于推理电路的分析,有效提升大语言模型对新知识的整合能力,在多跳推理任务中实现更高的准确性和一致性。
English Summary: CaKE is a novel knowledge editing method that improves the integration of updated knowledge into large language models by leveraging circuit-based analysis, resulting in enhanced multi-hop reasoning accuracy and efficiency.
Authors:Shuqi Lu, Haowei Lin, Lin Yao, Zhifeng Gao, Xiaohong Ji, Weinan E, Linfeng Zhang, Guolin Ke
Abstract:
Recent advancements in large language models and their multi-modal extensions have demonstrated the effectiveness of unifying generation and understanding through autoregressive next-token prediction. However, despite the critical role of 3D structural generation and understanding (3D GU) in AI for science, these tasks have largely evolved independently, with autoregressive methods remaining underexplored. To bridge this gap, we introduce Uni-3DAR, a unified framework that seamlessly integrates 3D GU tasks via autoregressive prediction. At its core, Uni-3DAR employs a novel hierarchical tokenization that compresses 3D space using an octree, leveraging the inherent sparsity of 3D structures. It then applies an additional tokenization for fine-grained structural details, capturing key attributes such as atom types and precise spatial coordinates in microscopic 3D structures. We further propose two optimizations to enhance efficiency and effectiveness. The first is a two-level subtree compression strategy, which reduces the octree token sequence by up to 8x. The second is a masked next-token prediction mechanism tailored for dynamically varying token positions, significantly boosting model performance. By combining these strategies, Uni-3DAR successfully unifies diverse 3D GU tasks within a single autoregressive framework. Extensive experiments across multiple microscopic 3D GU tasks, including molecules, proteins, polymers, and crystals, validate its effectiveness and versatility. Notably, Uni-3DAR surpasses previous state-of-the-art diffusion models by a substantial margin, achieving up to 256\% relative improvement while delivering inference speeds up to 21.8x faster. The code is publicly available at https://github.com/dptech-corp/Uni-3DAR.
中文: Uni-3DAR提出了一种统一的自动回归框架,通过基于八叉树的粗细粒度标记器和压缩策略,实现了跨尺度的三维生成与理解,在多种任务中大幅超越现有方法并显著提升推理速度。
English: Uni-3DAR introduces a unified autoregressive framework that uses a coarse-to-fine octree tokenizer and a novel compression strategy to enable cross-scale 3D generation and understanding, achieving significant performance improvements and faster inference speeds across diverse tasks.
Authors:Max Gutbrod, David Rauber, Danilo Weber Nunes, Christoph Palm
Abstract:
The growing reliance on Artificial Intelligence (AI) in critical domains such as healthcare demands robust mechanisms to ensure the trustworthiness of these systems, especially when faced with unexpected or anomalous inputs. This paper introduces the Open Medical Imaging Benchmarks for Out-Of-Distribution Detection (OpenMIBOOD), a comprehensive framework for evaluating out-of-distribution (OOD) detection methods specifically in medical imaging contexts. OpenMIBOOD includes three benchmarks from diverse medical domains, encompassing 14 datasets divided into covariate-shifted in-distribution, near-OOD, and far-OOD categories. We evaluate 24 post-hoc methods across these benchmarks, providing a standardized reference to advance the development and fair comparison of OOD detection methods. Results reveal that findings from broad-scale OOD benchmarks in natural image domains do not translate to medical applications, underscoring the critical need for such benchmarks in the medical field. By mitigating the risk of exposing AI models to inputs outside their training distribution, OpenMIBOOD aims to support the advancement of reliable and trustworthy AI systems in healthcare. The repository is available at https://github.com/remic-othr/OpenMIBOOD.
中文: 本文提出了OpenMIBOOD这一专门评估医学影像中分布外检测方法的框架,发现自然图像基准不适用于医疗领域,并通过标准化基准推动可靠人工智能系统的发展。
English: This paper introduces OpenMIBOOD, a specialized framework for evaluating out-of-distribution detection methods in medical imaging, revealing that natural image benchmarks are inadequate for healthcare applications and providing standardized benchmarks to advance reliable AI systems.
Authors:Quy-Anh Dang, Chris Ngo
Abstract:
Enhancing the reasoning capabilities of large language models (LLMs) typically relies on massive computational resources and extensive datasets, limiting accessibility for resource-constrained settings. Our study investigates the potential of reinforcement learning (RL) to improve reasoning in small LLMs, focusing on a 1.5-billion-parameter model, DeepSeek-R1-Distill-Qwen-1.5B, under strict constraints: training on 4 NVIDIA A40 GPUs (48 GB VRAM each) within 24 hours. Adapting the Group Relative Policy Optimization (GRPO) algorithm and curating a compact, high-quality mathematical reasoning dataset, we conducted three experiments to explore model behavior and performance. Our results demonstrate rapid reasoning gains - e.g., AMC23 accuracy rising from 63% to 80% and AIME24 reaching 46.7%, surpassing o1-preview - using only 7,000 samples and a $42 training cost, compared to thousands of dollars for baseline models. However, challenges such as optimization instability and length constraints emerged with prolonged training. These findings highlight the efficacy of RL-based fine-tuning for small LLMs, offering a cost-effective alternative to large-scale approaches. We release our code and datasets as open-source resources, providing insights into trade-offs and laying a foundation for scalable, reasoning-capable LLMs in resource-limited environments. All are available at https://github.com/knoveleng/open-rs.
中文: 本研究证明,强化学习能够以极低成本有效提升小型语言模型的推理能力,相比传统方法仅用少量资源就实现了显著的准确率提升。
English: This study demonstrates that reinforcement learning can efficiently enhance reasoning in small language models using minimal resources, achieving significant accuracy improvements at a fraction of the cost compared to conventional methods.
Authors:Jiwoo Son, Zhikai Zhao, Federico Berto, Chuanbo Hua, Changhyun Kwon, Jinkyoo Park
Abstract:
Vehicle Routing Problems (VRPs) are a class of NP-hard problems ubiquitous in several real-world logistics scenarios that pose significant challenges for optimization. Neural Combinatorial Optimization (NCO) has emerged as a promising alternative to classical approaches, as it can learn fast heuristics to solve VRPs. However, most research works in NCO for VRPs focus on simplified settings, which do not account for asymmetric distances and travel durations that cannot be derived by simple Euclidean distances and unrealistic data distributions, hindering real-world deployment. This work introduces RRNCO (Real Routing NCO) to bridge the gap of NCO between synthetic and real-world VRPs in the critical aspects of both data and modeling. First, we introduce a new, openly available dataset with real-world data containing a diverse dataset of locations, distances, and duration matrices from 100 cities, considering realistic settings with actual routing distances and durations obtained from Open Source Routing Machine (OSRM). Second, we propose a novel approach that efficiently processes both node and edge features through contextual gating, enabling the construction of more informed node embedding, and we finally incorporate an Adaptation Attention Free Module (AAFM) with neural adaptive bias mechanisms that effectively integrates not only distance matrices but also angular relationships between nodes, allowing our model to capture rich structural information. RRNCO achieves state-of-the-art results in real-world VRPs among NCO methods. We make our dataset and code publicly available at https://github.com/ai4co/real-routing-nco.
中文摘要:本文提出的RRNCO方法通过采用真实世界数据集和新型建模技术,填补了神经组合优化在合成与实际车辆路径问题之间的差距,并实现了最先进的性能表现。
English Summary: This paper introduces RRNCO, a Neural Combinatorial Optimization approach that bridges the gap between synthetic and real-world Vehicle Routing Problems by using realistic datasets and novel modeling techniques to achieve state-of-the-art performance.
Authors:Abdullah Mamun, Diane J. Cook, Hassan Ghasemzadeh
Abstract:
Adherence to prescribed treatments is crucial for individuals with chronic conditions to avoid costly or adverse health outcomes. For certain patient groups, intensive lifestyle interventions are vital for enhancing medication adherence. Accurate forecasting of treatment adherence can open pathways to developing an on-demand intervention tool, enabling timely and personalized support. With the increasing popularity of smartphones and wearables, it is now easier than ever to develop and deploy smart activity monitoring systems. However, effective forecasting systems for treatment adherence based on wearable sensors are still not widely available. We close this gap by proposing Adherence Forecasting and Intervention with Machine Intelligence (AIMI). AIMI is a knowledge-guided adherence forecasting system that leverages smartphone sensors and previous medication history to estimate the likelihood of forgetting to take a prescribed medication. A user study was conducted with 27 participants who took daily medications to manage their cardiovascular diseases. We designed and developed CNN and LSTM-based forecasting models with various combinations of input features and found that LSTM models can forecast medication adherence with an accuracy of 0.932 and an F-1 score of 0.936. Moreover, through a series of ablation studies involving convolutional and recurrent neural network architectures, we demonstrate that leveraging known knowledge about future and personalized training enhances the accuracy of medication adherence forecasting. Code available: https://github.com/ab9mamun/AIMI.
中文: 本研究提出AIMI系统,通过智能手机传感器和用药记录,基于LSTM模型实现对心血管患者服药依从性的精准预测,在实验中取得了优异性能。
English: The study introduces AIMI, a machine intelligence system that uses smartphone sensors and medication history to accurately forecast medication adherence, achieving high precision with LSTM models in a cardiovascular patient trial.
Authors:Junho Kim, Gwangtak Bae, Eun Sun Lee, Young Min Kim
Abstract:
Understanding scene contexts is crucial for machines to perform tasks and adapt prior knowledge in unseen or noisy 3D environments. As data-driven learning is intractable to comprehensively encapsulate diverse ranges of layouts and open spaces, we propose teaching machines to identify relational commonalities in 3D spaces. Instead of focusing on point-wise or object-wise representations, we introduce 3D scene analogies, which are smooth maps between 3D scene regions that align spatial relationships. Unlike well-studied single instance-level maps, these scene-level maps smoothly link large scene regions, potentially enabling unique applications in trajectory transfer in AR/VR, long demonstration transfer for imitation learning, and context-aware object rearrangement. To find 3D scene analogies, we propose neural contextual scene maps, which extract descriptor fields summarizing semantic and geometric contexts, and holistically align them in a coarse-to-fine manner for map estimation. This approach reduces reliance on individual feature points, making it robust to input noise or shape variations. Experiments demonstrate the effectiveness of our approach in identifying scene analogies and transferring trajectories or object placements in diverse indoor scenes, indicating its potential for robotics and AR/VR applications. Project page including the code is available through this link: https://82magnolia.github.io/3d_scene_analogies/.
Authors:Zhiyu An, Zhibo Hou, Wan Du
Abstract:
We study aleatoric and epistemic uncertainty estimation in a learned regressive system dynamics model. Disentangling aleatoric uncertainty (the inherent randomness of the system) from epistemic uncertainty (the lack of data) is crucial for downstream tasks such as risk-aware control and reinforcement learning, efficient exploration, and robust policy transfer. While existing approaches like Gaussian Processes, Bayesian networks, and model ensembles are widely adopted, they suffer from either high computational complexity or inaccurate uncertainty estimation. To address these limitations, we propose the Compressed Data Representation Model (CDRM), a framework that learns a neural network encoding of the data distribution and enables direct sampling from the output distribution. Our approach incorporates a novel inference procedure based on Langevin dynamics sampling, allowing CDRM to predict arbitrary output distributions rather than being constrained to a Gaussian prior. Theoretical analysis provides the conditions where CDRM achieves better memory and computational complexity compared to bin-based compression methods. Empirical evaluations show that CDRM demonstrates a superior capability to identify aleatoric and epistemic uncertainties separately, achieving AUROCs of 0.8876 and 0.9981 on a single test set containing a mixture of both uncertainties. Qualitative results further show that CDRM's capability extends to datasets with multimodal output distributions, a challenging scenario where existing methods consistently fail. Code and supplementary materials are available at https://github.com/ryeii/CDRM.
Chinese: 本文提出压缩数据表示模型(CDRM),该框架通过神经网络编码和朗之万动力学采样,能精确区分系统动力学模型中的偶然不确定性和认知不确定性,在计算效率与不确定性估计方面均优于现有方法。
English: This paper introduces the Compressed Data Representation Model (CDRM), a framework that accurately distinguishes aleatoric and epistemic uncertainties in system dynamics models through neural network encoding and Langevin dynamics sampling, outperforming existing methods in both computational efficiency and uncertainty estimation.
Authors:Joanikij Chulev, Angela Mladenovska
Abstract:
Clustering high-dimensional data is a critical challenge in machine learning due to the curse of dimensionality and the presence of noise. Traditional clustering algorithms often fail to capture the intrinsic structures in such data. This paper explores a combination of clustering methods, which we called Line Space Clustering (LSC), a representation that transforms data points into lines in a newly defined feature space, enabling clustering based on the similarity of feature value patterns, essentially treating features as sequences. LSC employs a combined distance metric that uses Euclidean and Dynamic Time Warping (DTW) distances, weighted by a parameter α, allowing flexibility in emphasizing shape or magnitude similarities. We delve deeply into the mechanics of DTW and the Savitzky Golay filter, explaining their roles in the algorithm. Extensive experiments demonstrate the efficacy of LSC on synthetic and real-world datasets, showing that randomly experimenting with time-series optimized methods sometimes might surprisingly work on a complex dataset, particularly in noisy environments.
Source code and experiments are available at: https://github.com/JoanikijChulev/LSC.
中文: 本文提出线空间聚类(LSC)新方法,通过将高维数据转换为线条并采用欧几里得-DTW混合距离度量,在噪声环境中有效捕捉特征模式的内在结构,实现精准聚类。
English: This paper introduces Line Space Clustering (LSC), a novel method that transforms high-dimensional data into lines and uses a combined Euclidean-DTW distance metric to effectively cluster noisy datasets by capturing intrinsic feature patterns.
Authors:Luc McCutcheon, Bahman Gharesifard, Saber Fallah
Abstract:
Control Lyapunov functions are traditionally used to design a controller which ensures convergence to a desired state, yet deriving these functions for nonlinear systems remains a complex challenge. This paper presents a novel, sample-efficient method for neural approximation of nonlinear Lyapunov functions, leveraging self-supervised Reinforcement Learning (RL) to enhance training data generation, particularly for inaccurately represented regions of the state space. The proposed approach employs a data-driven World Model to train Lyapunov functions from off-policy trajectories. The method is validated on both standard and goal-conditioned robotic tasks, demonstrating faster convergence and higher approximation accuracy compared to the state-of-the-art neural Lyapunov approximation baseline. The code is available at: https://github.com/CAV-Research-Lab/SACLA.git
中文: 本文提出了一种样本高效的神经网络方法,通过自监督强化学习和数据驱动世界模型来近似非线性李雅普诺夫函数,在机器人任务中相比现有基准实现了更快的收敛速度和更高的近似精度。
English: This paper introduces a sample-efficient neural method for approximating nonlinear Lyapunov functions using self-supervised reinforcement learning and a data-driven World Model, achieving superior convergence speed and accuracy in robotic tasks compared to existing baselines.
Authors:Martin Ritzert, Polina Turishcheva, Laura Hansel, Paul Wollenhaupt, Marissa A. Weis, Alexander S. Ecker
Abstract:
Hierarchical clustering is an effective, interpretable method for analyzing structure in data. It reveals insights at multiple scales without requiring a predefined number of clusters and captures nested patterns and subtle relationships, which are often missed by flat clustering approaches. However, existing hierarchical clustering methods struggle with high-dimensional data, especially when there are no clear density gaps between modes. In this work, we introduce t-NEB, a probabilistically grounded hierarchical clustering method, which yields state-of-the-art clustering performance on naturalistic high-dimensional data. t-NEB consists of three steps: (1) density estimation via overclustering; (2) finding maximum density paths between clusters; (3) creating a hierarchical structure via bottom-up cluster merging. t-NEB uses a probabilistic parametric density model for both overclustering and cluster merging, which yields both high clustering performance and a meaningful hierarchy, making it a valuable tool for exploratory data analysis. Code is available at https://github.com/ecker-lab/tneb clustering.
Chinese: t-NEB方法是一种基于概率的分层聚类技术,通过密度估计和自底向上合并克服了现有方法处理高维数据的局限,实现了最优聚类性能并构建出有意义的层次结构。
English: The t-NEB method is a probabilistically grounded hierarchical clustering technique that overcomes the limitations of existing methods with high-dimensional data by using density estimation and bottom-up merging to achieve state-of-the-art performance and meaningful hierarchies.
Authors:Alba Márquez-RodrÃguez, Miguel Ãngel Mohedano-Munoz, Manuel J. MarÃn-Jiménez, Eduardo SantamarÃa-GarcÃa, Giulia Bastianelli, Pedro Jordano, Irene Mendoza
Abstract:
Passive Acoustic Monitoring is a key tool for biodiversity conservation, but the large volumes of unsupervised audio it generates present major challenges for extracting meaningful information. Deep Learning offers promising solutions. BirdNET, a widely used bird identification model, has shown success in many study systems but is limited at local scale due to biases in its training data, which focus on specific locations and target sounds rather than entire soundscapes. A key challenge in bird species identification is that many recordings either lack target species or contain overlapping vocalizations, complicating automatic identification. To address these problems, we developed a multi-stage pipeline for automatic bird vocalization identification in Doñana National Park (SW Spain), a wetland of high conservation concern. We deployed AudioMoth recorders in three main habitats across nine locations and manually annotated 461 minutes of audio, resulting in 3749 labeled segments spanning 34 classes. We first applied a Bird Song Detector to isolate bird vocalizations using spectrogram-based image processing. Then, species were classified using custom models trained at the local scale. Applying the Bird Song Detector before classification improved species identification, as all models performed better when analyzing only the segments where birds were detected. Specifically, the combination of detector and fine-tuned BirdNET outperformed the baseline without detection. This approach demonstrates the effectiveness of integrating a Bird Song Detector with local classification models. These findings highlight the need to adapt general-purpose tools to specific ecological challenges. Automatically detecting bird species helps track the health of this threatened ecosystem, given birds sensitivity to environmental change, and supports conservation planning to reduce biodiversity loss.
中文: 针对多尼亚纳国家公园开发的多阶段识别流程,通过结合鸟类鸣叫检测器与本地优化的分类模型,显著提升了鸟类自动识别的准确性,凸显了针对具体生态挑战定制化通用工具对于有效保护生物多样性的重要性。
English: A multi-stage pipeline integrating a Bird Song Detector with locally fine-tuned classification models was developed to enhance automatic bird species identification in Doñana National Park, demonstrating improved performance over general tools by adapting to specific ecological challenges for effective biodiversity conservation.
Authors:NVIDIA, :, Alisson Azzolini, Junjie Bai, Hannah Brandon, Jiaxin Cao, Prithvijit Chattopadhyay, Huayu Chen, Jinju Chu, Yin Cui, Jenna Diamond, Yifan Ding, Liang Feng, Francesco Ferroni, Rama Govindaraju, Jinwei Gu, Siddharth Gururani, Imad El Hanafi, Zekun Hao, Jacob Huffman, Jingyi Jin, Brendan Johnson, Rizwan Khan, George Kurian, Elena Lantz, Nayeon Lee, Zhaoshuo Li, Xuan Li, Maosheng Liao, Tsung-Yi Lin, Yen-Chen Lin, Ming-Yu Liu, Xiangyu Lu, Alice Luo, Andrew Mathau, Yun Ni, Lindsey Pavao, Wei Ping, David W. Romero, Misha Smelyanskiy, Shuran Song, Lyne Tchapmi, Andrew Z. Wang, Boxin Wang, Haoxiang Wang, Fangyin Wei, Jiashu Xu, Yao Xu, Dinghao Yang, Xiaodong Yang, Zhuolin Yang, Jingxu Zhang, Xiaohui Zeng, Zhe Zhang
Abstract:
Physical AI systems need to perceive, understand, and perform complex actions in the physical world. In this paper, we present the Cosmos-Reason1 models that can understand the physical world and generate appropriate embodied decisions (e.g., next step action) in natural language through long chain-of-thought reasoning processes. We begin by defining key capabilities for Physical AI reasoning, with a focus on physical common sense and embodied reasoning. To represent physical common sense, we use a hierarchical ontology that captures fundamental knowledge about space, time, and physics. For embodied reasoning, we rely on a two-dimensional ontology that generalizes across different physical embodiments. Building on these capabilities, we develop two multimodal large language models, Cosmos-Reason1-7B and Cosmos-Reason1-56B. We curate data and train our models in two stages: Physical AI supervised fine-tuning (SFT) and Physical AI reinforcement learning (RL). To evaluate our models, we build comprehensive benchmarks for physical common sense and embodied reasoning according to our ontologies. Evaluation results show that Physical AI SFT and RL bring significant improvements. To facilitate the development of Physical AI, we make our code and pre-trained models available under the NVIDIA Open Model License at https://github.com/nvidia-cosmos/cosmos-reason1.
中文:Cosmos-Reason1模型通过长链思维推理过程理解物理世界并生成具身决策,利用分层和二维本体表示物理常识和具身推理,经过监督微调和强化学习训练后展现出显著性能提升。
English: The Cosmos-Reason1 models are designed to understand the physical world and generate embodied decisions through long reasoning processes, utilizing hierarchical and two-dimensional ontologies for physical common sense and embodied reasoning, with significant improvements shown through supervised fine-tuning and reinforcement learning.
Authors:Noam Razin, Zixuan Wang, Hubert Strauss, Stanley Wei, Jason D. Lee, Sanjeev Arora
Abstract:
The success of Reinforcement Learning from Human Feedback (RLHF) critically depends on the quality of the reward model. However, while this quality is primarily evaluated through accuracy, it remains unclear whether accuracy fully captures what makes a reward model an effective teacher. We address this question from an optimization perspective. First, we prove that regardless of how accurate a reward model is, if it induces low reward variance, then the RLHF objective suffers from a flat landscape. Consequently, even a perfectly accurate reward model can lead to extremely slow optimization, underperforming less accurate models that induce higher reward variance. We additionally show that a reward model that works well for one language model can induce low reward variance, and thus a flat objective landscape, for another. These results establish a fundamental limitation of evaluating reward models solely based on accuracy or independently of the language model they guide. Experiments using models of up to 8B parameters corroborate our theory, demonstrating the interplay between reward variance, accuracy, and reward maximization rate. Overall, our findings highlight that beyond accuracy, a reward model needs to induce sufficient variance for efficient~optimization.
Chinese: 基于人类反馈的强化学习(RLHF)的成功不仅取决于奖励模型的准确性,还要求其能产生足够的奖励方差,因为低方差会导致优化景观平坦和学习缓慢,即使模型准确性很高。
English: The effectiveness of Reinforcement Learning from Human Feedback (RLHF) relies not only on the accuracy of the reward model but also on its ability to induce sufficient reward variance, as low variance can lead to a flat optimization landscape and slow learning, even with high accuracy.
Authors:Ananya Garg, Mohmmad Ayaan, Swara Parekh, Vikranth Udandarao
Abstract:
Accurate prediction of food delivery times significantly impacts customer satisfaction, operational efficiency, and profitability in food delivery services. However, existing studies primarily utilize static historical data and often overlook dynamic, real-time contextual factors crucial for precise prediction, particularly in densely populated Indian cities. This research addresses these gaps by integrating real-time contextual variables such as traffic density, weather conditions, local events, and geospatial data (restaurant and delivery location coordinates) into predictive models. We systematically compare various machine learning algorithms, including Linear Regression, Decision Trees, Bagging, Random Forest, XGBoost, and LightGBM, on a comprehensive food delivery dataset specific to Indian urban contexts. Rigorous data preprocessing and feature selection significantly enhanced model performance. Experimental results demonstrate that the LightGBM model achieves superior predictive accuracy, with an R2 score of 0.76 and Mean Squared Error (MSE) of 20.59, outperforming traditional baseline approaches. Our study thus provides actionable insights for improving logistics strategies in complex urban environments. The complete methodology and code are publicly available for reproducibility and further research.
Chinese: 本研究通过整合交通、天气等实时因素与机器学习算法,提升了印度城市外卖送达时间的预测准确性,其中LightGBM模型以R²得分0.76表现最优,为复杂城市环境下的物流优化提供了可行方案。
English: This study enhances food delivery time prediction in Indian cities by integrating real-time factors like traffic and weather with machine learning, where LightGBM achieved the highest accuracy with an R² score of 0.76, offering practical logistics improvements.
Authors:Ãlex Pujol Vidal, Sergio Escalera, Kamal Nasrollahi, Thomas B. Moeslund
Abstract:
Machine unlearning methods have become increasingly important for selective concept removal in large pre-trained models. While recent work has explored unlearning in Euclidean contrastive vision-language models, the effectiveness of concept removal in hyperbolic spaces remains unexplored. This paper investigates machine unlearning in hyperbolic contrastive learning by adapting Alignment Calibration to MERU, a model that embeds images and text in hyperbolic space to better capture semantic hierarchies. Through systematic experiments and ablation studies, we demonstrate that hyperbolic geometry offers distinct advantages for concept removal, achieving near perfect forgetting with reasonable performance on retained concepts, particularly when scaling to multiple concept removal. Our approach introduces hyperbolic-specific components including entailment calibration and norm regularization that leverage the unique properties of hyperbolic space. Comparative analysis with Euclidean models reveals fundamental differences in unlearning dynamics, with hyperbolic unlearning reorganizing the semantic hierarchy while Euclidean approaches merely disconnect cross-modal associations. These findings not only advance machine unlearning techniques but also provide insights into the geometric properties that influence concept representation and removal in multimodal models. Source code available at https://github.com/alex-pv01/HAC
中文: 本文提出一种针对双曲对比学习模型的机器遗忘方法,通过双曲空间特有的技术重组语义层次结构,在保持保留概念性能的同时实现了更优越的概念消除效果。
English: This paper introduces a machine unlearning method for hyperbolic contrastive learning models, demonstrating superior concept removal through hyperbolic-specific techniques that reorganize semantic hierarchies while maintaining performance on retained concepts.
Authors:Alejandro Almodóvar, Adrián Javaloy, Juan Parras, Santiago Zazo, Isabel Valera
Abstract:
We introduce DeCaFlow, a deconfounding causal generative model. Training once per dataset using just observational data and the underlying causal graph, DeCaFlow enables accurate causal inference on continuous variables under the presence of hidden confounders. Specifically, we extend previous results on causal estimation under hidden confounding to show that a single instance of DeCaFlow provides correct estimates for all causal queries identifiable with do-calculus, leveraging proxy variables to adjust for the causal effects when do-calculus alone is insufficient. Moreover, we show that counterfactual queries are identifiable as long as their interventional counterparts are identifiable, and thus are also correctly estimated by DeCaFlow. Our empirical results on diverse settings (including the Ecoli70 dataset, with 3 independent hidden confounders, tens of observed variables and hundreds of causal queries) show that DeCaFlow outperforms existing approaches, while demonstrating its out-of-the-box applicability to any given causal graph. An implementation can be found in https://github.com/aalmodovares/DeCaFlow
中文: DeCaFlow是一种解混因果生成模型,通过单次训练即可基于观测数据准确估计所有可识别的因果与反事实查询,在多种场景下均优于现有方法。
English: DeCaFlow is a deconfounding causal generative model that accurately estimates all identifiable causal and counterfactual queries from observational data using a single training instance, outperforming existing methods across diverse settings.
Authors:Matthew Low, Arian Prabowo, Hao Xue, Flora Salim
Abstract:
Urban traffic forecasting is a commonly encountered problem, with wide-ranging applications in fields such as urban planning, civil engineering and transport. In this paper, we study the enhancement of traffic forecasting with pre-training, focusing on spatio-temporal graph methods. While various machine learning methods to solve traffic forecasting problems have been explored and extensively studied, there is a gap of a more contextual approach: studying how relevant non-traffic data can improve prediction performance on traffic forecasting problems. We call this data spatial context. We introduce a novel method of combining road and traffic information through the notion of a traffic quotient graph, a quotient graph formed from road geometry and traffic sensors. We also define a way to encode this relationship in the form of a geometric encoder, pre-trained using contrastive learning methods and enhanced with OpenStreetMap data. We introduce and discuss ways to integrate this geometric encoder with existing graph neural network (GNN)-based traffic forecasting models, using a contrastive pre-training paradigm. We demonstrate the potential for this hybrid model to improve generalisation and performance with zero additional traffic data. Code for this paper is available at https://github.com/mattchrlw/forecasting-on-new-roads.
中文: 本文提出了一种新颖的交通预测方法,通过结合道路几何和开放街道地图数据预训练几何编码器来增强时空图模型,无需额外交通数据即可提升预测性能。
English: This paper introduces a novel traffic forecasting method that enhances spatio-temporal graph models by pre-training a geometric encoder with road geometry and OpenStreetMap data, improving performance without requiring additional traffic data.
Authors:Xiaohao Liu, Xiaobo Xia, See-Kiong Ng, Tat-Seng Chua
Abstract:
Multimodal Contrastive Learning (MCL) advances in aligning different modalities and generating multimodal representations in a joint space. By leveraging contrastive learning across diverse modalities, large-scale multimodal data enhances representational quality. However, a critical yet often overlooked challenge remains: multimodal data is rarely collected in a single process, and training from scratch is computationally expensive. Instead, emergent multimodal data can be used to optimize existing models gradually, i.e., models are trained on a sequence of modality pair data. We define this problem as Continual Multimodal Contrastive Learning (CMCL), an underexplored yet crucial research direction at the intersection of multimodal and continual learning. In this paper, we formulate CMCL through two specialized principles of stability and plasticity. We theoretically derive a novel optimization-based method, which projects updated gradients from dual sides onto subspaces where any gradient is prevented from interfering with the previously learned knowledge. Two upper bounds provide theoretical insights on both stability and plasticity in our solution. Beyond our theoretical contributions, we conduct experiments on multiple datasets by comparing our method against advanced continual learning baselines. The empirical results further support our claims and demonstrate the efficacy of our method. Our codes are available at https://github.com/Xiaohao-Liu/CMCL.
中文: 本文提出持续多模态对比学习(CMCL),通过一种新颖的优化方法解决序列多模态数据的高效模型更新难题,该方法在保持稳定性和可塑性的同时,得到了理论分析和实验验证的双重支持。
English: This paper introduces Continual Multimodal Contrastive Learning (CMCL), addressing the challenge of efficiently updating models with sequential multimodal data through a novel optimization method that ensures stability and plasticity, supported by both theoretical analysis and experimental validation.
Authors:Kevin Wang, Ishaan Javali, MichaÅ Bortkiewicz, Tomasz TrzciÅski, Benjamin Eysenbach
Abstract:
Scaling up self-supervised learning has driven breakthroughs in language and vision, yet comparable progress has remained elusive in reinforcement learning (RL). In this paper, we study building blocks for self-supervised RL that unlock substantial improvements in scalability, with network depth serving as a critical factor. Whereas most RL papers in recent years have relied on shallow architectures (around 2 - 5 layers), we demonstrate that increasing the depth up to 1024 layers can significantly boost performance. Our experiments are conducted in an unsupervised goal-conditioned setting, where no demonstrations or rewards are provided, so an agent must explore (from scratch) and learn how to maximize the likelihood of reaching commanded goals. Evaluated on simulated locomotion and manipulation tasks, our approach increases performance by $2\times$ - $50\times$. Increasing the model depth not only increases success rates but also qualitatively changes the behaviors learned.
Authors:Yuhang Liu, Wenjie Zhao, Yunhui Guo
Abstract:
Task Incremental Learning (TIL) is a specialized form of Continual Learning (CL) in which a model incrementally learns from non-stationary data streams. Existing TIL methodologies operate under the closed-world assumption, presuming that incoming data remains in-distribution (ID). However, in an open-world setting, incoming samples may originate from out-of-distribution (OOD) sources, with their task identities inherently unknown. Continually detecting OOD samples presents several challenges for current OOD detection methods: reliance on model outputs leads to excessive dependence on model performance, selecting suitable thresholds is difficult, hindering real-world deployment, and binary ID/OOD classification fails to provide task-level identification. To address these issues, we propose a novel continual OOD detection method called the Hierarchical Two-sample Tests (H2ST). H2ST eliminates the need for threshold selection through hypothesis testing and utilizes feature maps to better exploit model capabilities without excessive dependence on model performance. The proposed hierarchical architecture enables task-level detection with superior performance and lower overhead compared to non-hierarchical classifier two-sample tests. Extensive experiments and analysis validate the effectiveness of H2ST in open-world TIL scenarios and its superiority to the existing methods. Code is available at \href{https://github.com/YuhangLiuu/H2ST}{https://github.com/YuhangLiuu/H2ST}.
中文: 本文提出分层双样本检验(H2ST)这一新型持续离群检测方法,通过假设检验避免阈值选择,利用特征映射实现任务级识别,在开放世界任务增量学习场景中展现出优于现有方法的性能与更低开销。
English: This paper introduces Hierarchical Two-sample Tests (H2ST), a novel continual out-of-distribution detection method that addresses challenges in open-world Task Incremental Learning by eliminating threshold selection through hypothesis testing and enabling task-level identification with superior performance and lower overhead.
Authors:Jeff Jewett, Sandhya Saisubramanian
Abstract:
Decision-making in complex, continuous multi-task environments is often hindered by the difficulty of obtaining accurate models for planning and the inefficiency of learning purely from trial and error. While precise environment dynamics may be hard to specify, human experts can often provide high-fidelity abstractions that capture the essential high-level structure of a task and user preferences in the target environment. Existing hierarchical approaches often target discrete settings and do not generalize across tasks. We propose a hierarchical reinforcement learning approach that addresses these limitations by dynamically planning over the expert-specified abstraction to generate subgoals to learn a goal-conditioned policy. To overcome the challenges of learning under sparse rewards, we shape the reward based on the optimal state value in the abstract model. This structured decision-making process enhances sample efficiency and facilitates zero-shot generalization. Our empirical evaluation on a suite of procedurally generated continuous control environments demonstrates that our approach outperforms existing hierarchical reinforcement learning methods in terms of sample efficiency, task completion rate, scalability to complex tasks, and generalization to novel scenarios.
中文摘要:本文提出一种分层强化学习方法,利用专家提供的抽象模型规划子目标并重塑奖励机制,在连续控制任务中显著提升样本效率并实现零样本泛化能力。
English Summary: This paper introduces a hierarchical reinforcement learning method that leverages expert-provided abstractions to plan subgoals and shape rewards, improving sample efficiency and enabling zero-shot generalization in continuous control tasks.
Authors:Fatemeh Dehrouyeh, Ibrahim Shaer, Soodeh Nikan, Firouz Badrkhani Ajaei, Abdallah Shami
Abstract:
With the growing need for real-time processing on IoT devices, optimizing machine learning (ML) models' size, latency, and computational efficiency is essential. This paper investigates a pruning method for anomaly detection in resource-constrained environments, specifically targeting Electric Vehicle Charging Infrastructure (EVCI). Using the CICEVSE2024 dataset, we trained and optimized three models-Multi-Layer Perceptron (MLP), Long Short-Term Memory (LSTM), and XGBoost-through hyperparameter tuning with Optuna, further refining them using SHapley Additive exPlanations (SHAP)-based feature selection (FS) and unstructured pruning techniques. The optimized models achieved significant reductions in model size and inference times, with only a marginal impact on their performance. Notably, our findings indicate that, in the context of EVCI, pruning and FS can enhance computational efficiency while retaining critical anomaly detection capabilities.
中文: 本研究通过采用基于SHAP的特征选择和剪枝技术,开发了针对电动汽车充电基础设施异常检测的优化机器学习模型,在性能损失极小的前提下显著减小了模型规模并缩短了推理时间。
English: This study develops optimized machine learning models for anomaly detection in Electric Vehicle Charging Infrastructure by employing SHAP-based feature selection and pruning techniques, achieving substantial reductions in model size and inference time with minimal performance loss.
Authors:Jake Fawkes, Michael O'Riordan, Athanasios Vlontzos, Oriol Corcoll, Ciarán Mark Gilligan-Lee
Abstract:
Observational data is often readily available in large quantities, but can lead to biased causal effect estimates due to the presence of unobserved confounding. Recent works attempt to remove this bias by supplementing observational data with experimental data, which, when available, is typically on a smaller scale due to the time and cost involved in running a randomised controlled trial. In this work, we prove a theorem that places fundamental limits on this ``best of both worlds'' approach. Using the framework of impossible inference, we show that although it is possible to use experimental data to \emph{falsify} causal effect estimates from observational data, in general it is not possible to \emph{validate} such estimates. Our theorem proves that while experimental data can be used to detect bias in observational studies, without additional assumptions on the smoothness of the correction function, it can not be used to remove it. We provide a practical example of such an assumption, developing a novel Gaussian Process based approach to construct intervals which contain the true treatment effect with high probability, both inside and outside of the support of the experimental data. We demonstrate our methodology on both simulated and semi-synthetic datasets and make the \href{https://github.com/Jakefawkes/Obs_and_exp_data}{code available}.
中文: 本研究证明,尽管实验数据能够识别观察性因果估计中的偏差,但若无额外假设通常无法验证或修正这些估计,并提出了一种高斯过程方法来高概率地估计处理效应。
English: This study demonstrates that while experimental data can identify bias in observational causal estimates, it generally cannot validate or correct them without additional assumptions, proposing a Gaussian Process method to estimate treatment effects with high probability.
Authors:Merkourios Simos, Alberto Silvio Chiappa, Alexander Mathis
Abstract:
How do humans move? The quest to understand human motion has broad applications in numerous fields, ranging from computer animation and motion synthesis to neuroscience, human prosthetics and rehabilitation. Although advances in reinforcement learning (RL) have produced impressive results in capturing human motion using simplified humanoids, controlling physiologically accurate models of the body remains an open challenge. In this work, we present a model-free motion imitation framework (KINESIS) to advance the understanding of muscle-based motor control. Using a musculoskeletal model of the lower body with 80 muscle actuators and 20 DoF, we demonstrate that KINESIS achieves strong imitation performance on 1.9 hours of motion capture data, is controllable by natural language through pre-trained text-to-motion generative models, and can be fine-tuned to carry out high-level tasks such as target goal reaching. Importantly, KINESIS generates muscle activity patterns that correlate well with human EMG activity. The physiological plausibility makes KINESIS a promising model for tackling challenging problems in human motor control theory, which we highlight by investigating Bernstein's redundancy problem in the context of locomotion. Code, videos and benchmarks will be available at https://github.com/amathislab/Kinesis.
中文: 本研究提出的KINESIS无模型框架通过精确的肌肉骨骼系统成功模拟人体运动,不仅能通过自然语言控制生成与肌电信号高度吻合的肌肉活动模式,还为探索运动控制理论提供了新途径。
English: This study introduces KINESIS, a model-free framework that effectively imitates human motion using a detailed musculoskeletal model, demonstrates controllability via natural language, and generates physiologically plausible muscle activity patterns for advancing motor control research.
Authors:Justus Westerhoff, Golzar Atefi, Mario Koddenbrock, Alexei Figueroa, Alexander Löser, Erik Rodner, Felix A. Gers
Abstract:
The capacity of a foundation model allows for adaptation to new downstream tasks. Weight imprinting is a universal and efficient method to fulfill this purpose. It has been reinvented several times, but it has not been systematically studied. In this paper, we propose a framework for imprinting, identifying three main components: generation, normalization, and aggregation. This allows us to conduct an in-depth analysis of imprinting and a comparison of the existing work. We reveal the benefits of representing novel data with multiple proxies in the generation step and show the importance of proper normalization. We determine proxies through clustering and propose a novel variant of imprinting that outperforms previous work. We motivate this by the neural collapse phenomenon -- an important connection that we can draw for the first time. Our results show an increase of up to 4\% in challenging scenarios with complex data distributions for new classes. Finally, we publicly release our code at https://github.com/DATEXIS/multi-imprinting/.
基础模型能够通过权重印记高效适应新任务,本文系统分析了该方法并提出了一个框架,揭示了使用多个代理和适当归一化的优势,在复杂场景下性能提升高达4%。
Foundation models can efficiently adapt to new tasks through weight imprinting, a method systematically analyzed in this paper, which introduces a framework revealing the benefits of multiple proxies and proper normalization, achieving up to 4% improvement in complex scenarios.
Authors:Guowei Wang, Changxing Ding
Abstract:
Long-term test-time adaptation (TTA) is a challenging task due to error accumulation. Recent approaches tackle this issue by actively labeling a small proportion of samples in each batch, yet the annotation burden quickly grows as the batch number increases. In this paper, we investigate how to achieve effortless active labeling so that a maximum of one sample is selected for annotation in each batch. First, we annotate the most valuable sample in each batch based on the single-step optimization perspective in the TTA context. In this scenario, the samples that border between the source- and target-domain data distributions are considered the most feasible for the model to learn in one iteration. Then, we introduce an efficient strategy to identify these samples using feature perturbation. Second, we discover that the gradient magnitudes produced by the annotated and unannotated samples have significant variations. Therefore, we propose balancing their impact on model optimization using two dynamic weights. Extensive experiments on the popular ImageNet-C, -R, -K, -A and PACS databases demonstrate that our approach consistently outperforms state-of-the-art methods with significantly lower annotation costs.
中文: 本文提出了一种轻松主动标注策略,通过特征扰动识别每个批次中最有价值的样本进行标注,并利用动态权重平衡标注与未标注样本对模型优化的影响,在多个数据集上以极低标注成本实现了优于现有方法的性能。
English: This paper introduces an effortless active labeling strategy for long-term test-time adaptation that selects at most one sample per batch for annotation based on feature perturbation and balances optimization impact with dynamic weights, achieving superior performance with minimal annotation costs across multiple datasets.
Authors:Kasra Borazjani, Payam Abdisarabshali, Naji Khosravan, Seyyedali Hosseinalipour
Abstract:
Federated Learning (FL) represents a paradigm shift in distributed machine learning (ML), enabling clients to train models collaboratively while keeping their raw data private. This paradigm shift from traditional centralized ML introduces challenges due to the non-iid (non-independent and identically distributed) nature of data across clients, significantly impacting FL's performance. Existing literature, predominantly model data heterogeneity by imposing label distribution skew across clients. In this paper, we show that label distribution skew fails to fully capture the real-world data heterogeneity among clients in computer vision tasks beyond classification. Subsequently, we demonstrate that current approaches overestimate FL's performance by relying on label/class distribution skew, exposing an overlooked gap in the literature. By utilizing pre-trained deep neural networks to extract task-specific data embeddings, we define task-specific data heterogeneity through the lens of each vision task and introduce a new level of data heterogeneity called embedding-based data heterogeneity. Our methodology involves clustering data points based on embeddings and distributing them among clients using the Dirichlet distribution. Through extensive experiments, we evaluate the performance of different FL methods under our revamped notion of data heterogeneity, introducing new benchmark performance measures to the literature. We further unveil a series of open research directions that can be pursued.
中文摘要:本文挑战了联邦学习中标签分布偏斜在捕捉计算机视觉任务中真实数据异质性方面的不足,提出了基于嵌入的异质性概念和新基准,揭示了现有文献对联邦学习性能的高估。
English Summary: This paper challenges the adequacy of label distribution skew for capturing real-world data heterogeneity in federated learning for computer vision tasks, proposing embedding-based heterogeneity and new benchmarks that reveal performance overestimations in existing literature.
Authors:Yanjia Huang, Renjie Li, Zhengzhong Tu
Abstract:
We present PANDORA, a novel diffusion-based policy learning framework designed specifically for dexterous robotic piano performance. Our approach employs a conditional U-Net architecture enhanced with FiLM-based global conditioning, which iteratively denoises noisy action sequences into smooth, high-dimensional trajectories. To achieve precise key execution coupled with expressive musical performance, we design a composite reward function that integrates task-specific accuracy, audio fidelity, and high-level semantic feedback from a large language model (LLM) oracle. The LLM oracle assesses musical expressiveness and stylistic nuances, enabling dynamic, hand-specific reward adjustments. Further augmented by a residual inverse-kinematics refinement policy, PANDORA achieves state-of-the-art performance in the ROBOPIANIST environment, significantly outperforming baselines in both precision and expressiveness. Ablation studies validate the critical contributions of diffusion-based denoising and LLM-driven semantic feedback in enhancing robotic musicianship. Videos available at: https://taco-group.github.io/PANDORA
Authors:Susung Hong, Ira Kemelmacher-Shlizerman, Brian Curless, Steven M. Seitz
Abstract:
We introduce MusicInfuser, an approach for generating high-quality dance videos that are synchronized to a specified music track. Rather than attempting to design and train a new multimodal audio-video model, we show how existing video diffusion models can be adapted to align with musical inputs by introducing lightweight music-video cross-attention and a low-rank adapter. Unlike prior work requiring motion capture data, our approach fine-tunes only on dance videos. MusicInfuser achieves high-quality music-driven video generation while preserving the flexibility and generative capabilities of the underlying models. We introduce an evaluation framework using Video-LLMs to assess multiple dimensions of dance generation quality. The project page and code are available at https://susunghong.github.io/MusicInfuser.
Authors:Jiacheng Guo, Yue Wu, Jiahao Qiu, Kaixuan Huang, Xinzhe Juan, Ling Yang, Mengdi Wang
Abstract:
Verification is crucial for effective mathematical reasoning. We present a new temporal consistency method where verifiers iteratively refine their judgments based on the previous assessment. Unlike one-round verification or multi-model debate approaches, our method leverages consistency in a sequence of self-reflection actions to improve verification accuracy. Empirical evaluations across diverse mathematical process error identification benchmarks (Mathcheck, ProcessBench, and PRM800K) show consistent performance improvements over baseline methods. When applied to the recent DeepSeek R1 distilled models, our method demonstrates strong performance, enabling 7B/8B distilled models to outperform all 70B/72B models and GPT-4o on ProcessBench. Notably, the distilled 14B model with our method achieves performance comparable to Deepseek-R1. Our codes are available at https://github.com/jcguo123/Temporal-Consistency
中文: 本文提出了一种时序一致性方法,通过迭代性自我反思提升数学推理验证效果,在多个基准测试中表现优异,使小型蒸馏模型性能超越包括GPT-4o在内的大型模型。
English: This paper introduces a temporal consistency method that enhances mathematical reasoning verification through iterative self-reflection, achieving superior performance on multiple benchmarks and enabling smaller distilled models to outperform larger ones, including GPT-4o.
Authors:NVIDIA, :, Hassan Abu Alhaija, Jose Alvarez, Maciej Bala, Tiffany Cai, Tianshi Cao, Liz Cha, Joshua Chen, Mike Chen, Francesco Ferroni, Sanja Fidler, Dieter Fox, Yunhao Ge, Jinwei Gu, Ali Hassani, Michael Isaev, Pooya Jannaty, Shiyi Lan, Tobias Lasser, Huan Ling, Ming-Yu Liu, Xian Liu, Yifan Lu, Alice Luo, Qianli Ma, Hanzi Mao, Fabio Ramos, Xuanchi Ren, Tianchang Shen, Xinglong Sun, Shitao Tang, Ting-Chun Wang, Jay Wu, Jiashu Xu, Stella Xu, Kevin Xie, Yuchong Ye, Xiaodong Yang, Xiaohui Zeng, Yu Zeng
Abstract:
We introduce Cosmos-Transfer, a conditional world generation model that can generate world simulations based on multiple spatial control inputs of various modalities such as segmentation, depth, and edge. In the design, the spatial conditional scheme is adaptive and customizable. It allows weighting different conditional inputs differently at different spatial locations. This enables highly controllable world generation and finds use in various world-to-world transfer use cases, including Sim2Real. We conduct extensive evaluations to analyze the proposed model and demonstrate its applications for Physical AI, including robotics Sim2Real and autonomous vehicle data enrichment. We further demonstrate an inference scaling strategy to achieve real-time world generation with an NVIDIA GB200 NVL72 rack. To help accelerate research development in the field, we open-source our models and code at https://github.com/nvidia-cosmos/cosmos-transfer1.
中文:Cosmos-Transfer 是一种条件性世界生成模型,通过可自定义的空间输入(如分割和深度)实现高度可控的模拟,应用于机器人和自动驾驶领域,并已开源供研究使用。
English: Cosmos-Transfer is a conditional world generation model that uses customizable spatial inputs like segmentation and depth to enable highly controllable simulations, with applications in robotics and autonomous vehicles, and it is open-sourced for research.
Authors:Bo Peng, Ruichong Zhang, Daniel Goldstein, Eric Alcaide, Xingjian Du, Haowen Hou, Jiaju Lin, Jiaxing Liu, Janna Lu, William Merrill, Guangyu Song, Kaifeng Tan, Saiteja Utpala, Nathan Wilce, Johan S. Wind, Tianyi Wu, Daniel Wuttke, Christian Zhou-Zheng
Abstract:
We present RWKV-7 "Goose", a new sequence modeling architecture with constant memory usage and constant inference time per token. Despite being trained on dramatically fewer tokens than other top models, our 2.9 billion parameter language model achieves a new 3B SoTA on multilingual tasks and matches the current 3B SoTA on English language downstream performance. RWKV-7 introduces a newly generalized formulation of the delta rule with vector-valued gating and in-context learning rates, as well as a relaxed value replacement rule. We show that RWKV-7 can perform state tracking and recognize all regular languages, while retaining parallelizability of training. This exceeds the capabilities of Transformers under standard complexity conjectures, which are limited to $\mathsf{TC}^0$. To demonstrate RWKV-7's language modeling capability, we also present an extended open source 3.1 trillion token multilingual corpus, and train four RWKV-7 models ranging from 0.19 billion to 2.9 billion parameters on this dataset.
To foster openness, reproduction, and adoption, we release our models and dataset component listing at https://huggingface.co/RWKV, and our training and inference code at https://github.com/RWKV/RWKV-LM all under the Apache 2.0 License.
中文: RWKV-7 "Goose" 是一种新型序列建模架构,尽管训练数据量较少,却能在多语言任务中达到顶尖性能,同时保持恒定内存和推理时间,并能执行状态跟踪和识别所有正则语言,超越了Transformer的固有局限。
English: RWKV-7 "Goose" is a novel sequence modeling architecture that achieves state-of-the-art performance in multilingual tasks with constant memory and inference time, despite training on fewer tokens, and demonstrates capabilities beyond Transformers by performing state tracking and recognizing all regular languages.
Authors:Aleksandra Eliseeva, Alexander Kovrigin, Ilia Kholkin, Egor Bogomolov, Yaroslav Zharov
Abstract:
Recent advances in Large Language Models (LLMs) have enabled researchers to focus on practical repository-level tasks in software engineering domain. In this work, we consider a cornerstone task for automating work with software repositories-environment setup, i.e., a task of configuring a repository-specific development environment on a system. Existing studies on environment setup introduce innovative agentic strategies, but their evaluation is often based on small datasets that may not capture the full range of configuration challenges encountered in practice. To address this gap, we introduce a comprehensive environment setup benchmark EnvBench. It encompasses 329 Python and 665 JVM-based (Java, Kotlin) repositories, with a focus on repositories that present genuine configuration challenges, excluding projects that can be fully configured by simple deterministic scripts. To enable further benchmark extension and usage for model tuning, we implement two automatic metrics: a static analysis check for missing imports in Python and a compilation check for JVM languages. We demonstrate the applicability of our benchmark by evaluating three environment setup approaches, including a simple zero-shot baseline and two agentic workflows, that we test with two powerful LLM backbones, GPT-4o and GPT-4o-mini. The best approach manages to successfully configure 6.69% repositories for Python and 29.47% repositories for JVM, suggesting that EnvBench remains challenging for current approaches. Our benchmark suite is publicly available at https://github.com/JetBrains-Research/EnvBench. The dataset and experiment trajectories are available at https://jb.gg/envbench.
中文: 本文提出了EnvBench这一全面评估软件仓库环境配置任务的基准,涵盖329个Python和665个基于JVM的具有挑战性配置的项目,实验表明现有方法仅能成功配置6.69%的Python项目和29.47%的JVM项目,凸显了该基准的难度。
English: This paper introduces EnvBench, a comprehensive benchmark for evaluating environment setup tasks in software repositories, covering 329 Python and 665 JVM-based projects with challenging configurations, and demonstrates its difficulty as current methods achieve only 6.69% success for Python and 29.47% for JVM repositories.
Authors:Vlad Hondru, Eduard Hogea, Darian Onchis, Radu Tudor Ionescu
Abstract:
The ever growing realism and quality of generated videos makes it increasingly harder for humans to spot deepfake content, who need to rely more and more on automatic deepfake detectors. However, deepfake detectors are also prone to errors, and their decisions are not explainable, leaving humans vulnerable to deepfake-based fraud and misinformation. To this end, we introduce ExDDV, the first dataset and benchmark for Explainable Deepfake Detection in Video. ExDDV comprises around 5.4K real and deepfake videos that are manually annotated with text descriptions (to explain the artifacts) and clicks (to point out the artifacts). We evaluate a number of vision-language models on ExDDV, performing experiments with various fine-tuning and in-context learning strategies. Our results show that text and click supervision are both required to develop robust explainable models for deepfake videos, which are able to localize and describe the observed artifacts. Our novel dataset and code to reproduce the results are available at https://github.com/vladhondru25/ExDDV.
Chinese: 随着深度伪造视频的真实感日益增强,人类和自动化检测系统都面临挑战,为此我们推出了首个可解释深度伪造检测数据集ExDDV,通过文本和点击标注提升模型对伪造痕迹的定位与描述能力,确保检测结果既可靠又可解释。
English: The increasing realism of deepfake videos challenges both human detection and automated systems, which often lack explainability, prompting the introduction of ExDDV—the first dataset and benchmark for explainable deepfake detection, using text and click annotations to enhance model robustness and localization of artifacts.
Authors:Maximilian Beck, Korbinian Pöppel, Phillip Lippe, Sepp Hochreiter
Abstract:
Linear RNNs with gating recently demonstrated competitive performance compared to Transformers in language modeling. Although their linear compute scaling in sequence length offers theoretical runtime advantages over Transformers, realizing these benefits in practice requires optimized custom kernels, as Transformers rely on the highly efficient Flash Attention kernels (Dao, 2024). Leveraging the chunkwise-parallel formulation of linear RNNs, Flash Linear Attention (FLA) (Yang & Zhang, 2024) shows that linear RNN kernels are faster than Flash Attention, by parallelizing over chunks of the input sequence. However, since the chunk size of FLA is limited, many intermediate states must be materialized in GPU memory. This leads to low arithmetic intensity and causes high memory consumption and IO cost, especially for long-context pre-training. In this work, we present Tiled Flash Linear Attention (TFLA), a novel kernel algorithm for linear RNNs, that enables arbitrary large chunk sizes and high arithmetic intensity by introducing an additional level of sequence parallelization within each chunk. First, we apply TFLA to the xLSTM with matrix memory, the mLSTM (Beck et al., 2024). Second, we propose an mLSTM variant with sigmoid input gate and reduced computation for even faster kernel runtimes at equal language modeling performance. In our speed benchmarks, we show that our new mLSTM kernels based on TFLA outperform highly optimized Flash Attention, Linear Attention and Mamba kernels, setting a new state of the art for efficient long-context sequence modeling primitives.
Chinese: 瓦片式闪存线性注意力(TFLA)是一种新颖的线性RNN核算法,通过支持大块大小和高算术强度,实现了高效的长上下文建模,在速度基准测试中超越了现有注意力机制。
English: Tiled Flash Linear Attention (TFLA) is a novel kernel algorithm for linear RNNs that enables efficient long-context modeling by allowing large chunk sizes and high arithmetic intensity, outperforming existing attention mechanisms in speed benchmarks.
Authors:Sai Coumar, Zachary Kingston
Abstract:
Generating structured ASCII art using computational techniques demands a careful interplay between aesthetic representation and computational precision, requiring models that can effectively translate visual information into symbolic text characters. Although Convolutional Neural Networks (CNNs) have shown promise in this domain, the comparative performance of deep learning architectures and classical machine learning methods remains unexplored. This paper explores the application of contemporary ML and DL methods to generate structured ASCII art, focusing on three key criteria: fidelity, character classification accuracy, and output quality. We investigate deep learning architectures, including Multilayer Perceptrons (MLPs), ResNet, and MobileNetV2, alongside classical approaches such as Random Forests, Support Vector Machines (SVMs) and k-Nearest Neighbors (k-NN), trained on an augmented synthetic dataset of ASCII characters. Our results show that complex neural network architectures often fall short in producing high-quality ASCII art, whereas classical machine learning classifiers, despite their simplicity, achieve performance similar to CNNs. Our findings highlight the strength of classical methods in bridging model simplicity with output quality, offering new insights into ASCII art synthesis and machine learning on image data with low dimensionality.
中文摘要:本研究表明,在生成结构化ASCII艺术时,随机森林和支持向量机等经典机器学习方法能够取得与复杂深度学习架构相当的性能,突显了它们在模型简洁性与输出质量之间实现平衡的有效性。
English Summary: This study demonstrates that classical machine learning methods like Random Forests and SVMs can achieve comparable performance to complex deep learning architectures in generating structured ASCII art, emphasizing their effectiveness in balancing model simplicity with output quality.
Authors:Chenxiao Yang, Nathan Srebro, David McAllester, Zhiyuan Li
Abstract:
While state-of-the-art LLMs have demonstrated great promise of using long Chains-of-Thought (CoT) to boost reasoning, scaling it up to more challenging problems at test-time is fundamentally limited by suboptimal memory usage -- intermediate computations accumulate indefinitely in context even when no longer needed for future thoughts. We introduce PENCIL, which incorporates a novel reduction mechanism into the autoregressive generation process that recursively cleans up intermediate thoughts based on patterns learned from training. By iteratively generating and erasing thoughts, PENCIL can think deeper to solve harder problems using shorter context and less compute. Empirically, we observe PENCIL is significantly more effective and efficient than CoT. For example, we demonstrate PENCIL with a small 25M-parameter transformer and 2048 context length solves Einstein's puzzle -- a task that challenges much larger models like GPT-4. Theoretically, we prove PENCIL can perform universal efficient computation by simulating any Turing machines with optimal time and space complexity, and thus can solve arbitrary computable tasks that are otherwise intractable for vanilla CoT.
中文: PENCIL通过引入一种新颖的消减机制,在推理过程中递归清理中间思考,从而能以更短的上下文和更少的计算资源解决比传统思维链方法更复杂的问题。
English: PENCIL introduces a novel reduction mechanism that recursively cleans up intermediate thoughts during reasoning, enabling deeper problem-solving with shorter context and less computation than traditional Chain-of-Thought methods.
Authors:Guy Bar-Shalom, Fabrizio Frasca, Derek Lim, Yoav Gelberg, Yftah Ziser, Ran El-Yaniv, Gal Chechik, Haggai Maron
Abstract:
Large Language Models (LLMs) have achieved widespread adoption, yet our understanding of their behavior remains limited, particularly in detecting data contamination and hallucinations. While recently proposed probing techniques provide insights through activation analysis, they require ``white-box'' access to model internals, often unavailable. Current ``gray-box'' approaches typically analyze only the probability of the actual tokens in the sequence with simple task-specific heuristics. Importantly, these methods overlook the rich information contained in the full token distribution at each processing step. To address these limitations, we propose that gray-box analysis should leverage the complete observable output of LLMs, consisting of both the previously used token probabilities as well as the complete token distribution sequences - a unified data type we term LOS (LLM Output Signature). To this end, we develop a transformer-based approach to process LOS that theoretically guarantees approximation of existing techniques while enabling more nuanced analysis. Our approach achieves superior performance on hallucination and data contamination detection in gray-box settings, significantly outperforming existing baselines. Furthermore, it demonstrates strong transfer capabilities across datasets and LLMs, suggesting that LOS captures fundamental patterns in LLM behavior. Our code is available at: https://github.com/BarSGuy/LLM-Output-Signatures-Network.
中文: 本文提出LOS-Net,一种基于轻量级注意力机制的模型,通过利用完整的下一个词符概率分布序列(称为LLM输出签名),在不访问模型内部参数的情况下,有效检测大语言模型中的幻觉和训练数据污染问题。
English: This paper introduces LOS-Net, a lightweight attention-based model that utilizes the full sequence of next-token probability distributions, termed LLM Output Signature, to effectively detect hallucinations and training data contamination in Large Language Models without accessing internal model parameters.
Authors:Siwei Han, Peng Xia, Ruiyi Zhang, Tong Sun, Yun Li, Hongtu Zhu, Huaxiu Yao
Abstract:
Document Question Answering (DocQA) is a very common task. Existing methods using Large Language Models (LLMs) or Large Vision Language Models (LVLMs) and Retrieval Augmented Generation (RAG) often prioritize information from a single modal, failing to effectively integrate textual and visual cues. These approaches struggle with complex multi-modal reasoning, limiting their performance on real-world documents. We present MDocAgent (A Multi-Modal Multi-Agent Framework for Document Understanding), a novel RAG and multi-agent framework that leverages both text and image. Our system employs five specialized agents: a general agent, a critical agent, a text agent, an image agent and a summarizing agent. These agents engage in multi-modal context retrieval, combining their individual insights to achieve a more comprehensive understanding of the document's content. This collaborative approach enables the system to synthesize information from both textual and visual components, leading to improved accuracy in question answering. Preliminary experiments on five benchmarks like MMLongBench, LongDocURL demonstrate the effectiveness of our MDocAgent, achieve an average improvement of 12.1% compared to current state-of-the-art method. This work contributes to the development of more robust and comprehensive DocQA systems capable of handling the complexities of real-world documents containing rich textual and visual information. Our data and code are available at https://github.com/aiming-lab/MDocAgent.
中文: MDocAgent提出了一种多模态多智能体框架,通过五个专业智能体协同整合文本和图像分析,在五个基准测试中平均性能提升12.1%,显著提升了文档问答中的多模态推理能力。
English: MDocAgent introduces a multi-modal multi-agent framework that integrates text and image analysis through five specialized agents, achieving a 12.1% average improvement on benchmarks by enhancing multi-modal reasoning for document question answering.
Authors:Sunbowen Lee, Yicheng Gong, Chao Deng
Abstract:
Reinforcement learning control algorithms face significant challenges due to out-of-distribution and inefficient exploration problems. While model-based reinforcement learning enhances the agent's reasoning and planning capabilities by constructing virtual environments, training such virtual environments can be very complex. In order to build an efficient inference model and enhance the representativeness of learning data, we propose the Counterfactual Experience Augmentation (CEA) algorithm. CEA leverages variational autoencoders to model the dynamic patterns of state transitions and introduces randomness to model non-stationarity. This approach focuses on expanding the learning data in the experience pool through counterfactual inference and performs exceptionally well in environments that follow the bisimulation assumption. Environments with bisimulation properties are usually represented by discrete observation and action spaces, we propose a sampling method based on maximum kernel density estimation entropy to extend CEA to various environments. By providing reward signals for counterfactual state transitions based on real information, CEA constructs a complete counterfactual experience to alleviate the out-of-distribution problem of the learning data, and outperforms general SOTA algorithms in environments with difference properties. Finally, we discuss the similarities, differences and properties of generated counterfactual experiences and real experiences. The code is available at https://github.com/Aegis1863/CEA.
Chinese: 反事实经验增强(CEA)算法通过变分自编码器建模状态转移并生成反事实经验,有效缓解强化学习中的分布外和探索效率问题,在多种环境下均优于现有先进方法。
English: The Counterfactual Experience Augmentation (CEA) algorithm addresses out-of-distribution and exploration challenges in reinforcement learning by using variational autoencoders to model state transitions and generate counterfactual experiences, outperforming state-of-the-art methods across diverse environments.
Authors:Seokhyeon Hong, Chaelin Kim, Serin Yoon, Junghyun Nam, Sihun Cha, Junyong Noh
Abstract:
Text-driven motion generation has advanced significantly with the rise of denoising diffusion models. However, previous methods often oversimplify representations for the skeletal joints, temporal frames, and textual words, limiting their ability to fully capture the information within each modality and their interactions. Moreover, when using pre-trained models for downstream tasks, such as editing, they typically require additional efforts, including manual interventions, optimization, or fine-tuning. In this paper, we introduce a skeleton-aware latent diffusion (SALAD), a model that explicitly captures the intricate inter-relationships between joints, frames, and words. Furthermore, by leveraging cross-attention maps produced during the generation process, we enable attention-based zero-shot text-driven motion editing using a pre-trained SALAD model, requiring no additional user input beyond text prompts. Our approach significantly outperforms previous methods in terms of text-motion alignment without compromising generation quality, and demonstrates practical versatility by providing diverse editing capabilities beyond generation. Code is available at project page.
Authors:Yushan Jiang, Kanghui Ning, Zijie Pan, Xuyang Shen, Jingchao Ni, Wenchao Yu, Anderson Schneider, Haifeng Chen, Yuriy Nevmyvaka, Dongjin Song
Abstract:
Multi-modal time series analysis has recently emerged as a prominent research area in data mining, driven by the increasing availability of diverse data modalities, such as text, images, and structured tabular data from real-world sources. However, effective analysis of multi-modal time series is hindered by data heterogeneity, modality gap, misalignment, and inherent noise. Recent advancements in multi-modal time series methods have exploited the multi-modal context via cross-modal interactions based on deep learning methods, significantly enhancing various downstream tasks. In this tutorial and survey, we present a systematic and up-to-date overview of multi-modal time series datasets and methods. We first state the existing challenges of multi-modal time series analysis and our motivations, with a brief introduction of preliminaries. Then, we summarize the general pipeline and categorize existing methods through a unified cross-modal interaction framework encompassing fusion, alignment, and transference at different levels (\textit{i.e.}, input, intermediate, output), where key concepts and ideas are highlighted. We also discuss the real-world applications of multi-modal analysis for both standard and spatial time series, tailored to general and specific domains. Finally, we discuss future research directions to help practitioners explore and exploit multi-modal time series. The up-to-date resources are provided in the GitHub repository: https://github.com/UConn-DSIS/Multi-modal-Time-Series-Analysis
中文:本教程系统概述了多模态时间序列分析,通过跨模态交互方法解决数据异构性和模态差异等挑战,同时探讨了实际应用和未来研究方向。
English: This tutorial provides a systematic overview of multi-modal time series analysis, addressing challenges like data heterogeneity and modality gaps through cross-modal interaction methods, while also exploring applications and future research directions.
Authors:Seokhyeon Hong, Soojin Choi, Chaelin Kim, Sihun Cha, Junyong Noh
Abstract:
Despite the growing accessibility of skeletal motion data, integrating it for animating character meshes remains challenging due to diverse configurations of both skeletons and meshes. Specifically, the body scale and bone lengths of the skeleton should be adjusted in accordance with the size and proportions of the mesh, ensuring that all joints are accurately positioned within the character mesh. Furthermore, defining skinning weights is complicated by variations in skeletal configurations, such as the number of joints and their hierarchy, as well as differences in mesh configurations, including their connectivity and shapes. While existing approaches have made efforts to automate this process, they hardly address the variations in both skeletal and mesh configurations. In this paper, we present a novel method for the automatic rigging and skinning of character meshes using skeletal motion data, accommodating arbitrary configurations of both meshes and skeletons. The proposed method predicts the optimal skeleton aligned with the size and proportion of the mesh as well as defines skinning weights for various mesh-skeleton configurations, without requiring explicit supervision tailored to each of them. By incorporating Diffusion 3D Features (Diff3F) as semantic descriptors of character meshes, our method achieves robust generalization across different configurations. To assess the performance of our method in comparison to existing approaches, we conducted comprehensive evaluations encompassing both quantitative and qualitative analyses, specifically examining the predicted skeletons, skinning weights, and deformation quality.
Authors:Lin-Han Jia, Lan-Zhe Guo, Zhi Zhou, Si-Ye Han, Zi-Wen Li, Yu-Feng Li
Abstract:
In real-world applications, it is often challenging to detect anomalous samples when the anomalous information they contain is extremely limited. In such cases, both macro-level and micro-level detection using multi-instance learning (MIL) encounter significant difficulties. The former struggles because normal and anomalous samples are highly similar and hard to distinguish at the macro level, while the latter is limited by the lack of labels at the micro level. In MIL, micro-level labels are inferred from macro-level labels, which can lead to severe bias. Moreover, the more imbalanced the distribution between normal and anomalous samples, the more pronounced these limitations become. In this study, we observe that the MIL problem can be elegantly transformed into a fine-grained Positive-Unlabeled (PU) learning problem. This transformation allows us to address the imbalance issue in an unbiased manner using a micro-level balancing mechanism. To this end, we propose a novel framework-Balanced Fine-Grained Positive-Unlabeled (BFGPU)-based on rigorous theoretical foundations to address the challenges above. Extensive experiments on both public and real-world datasets demonstrate the effectiveness of BFGPU, which outperforms existing methods, even in extreme scenarios where both macro and micro-level distributions are highly imbalanced. The code is open-sourced at https://github.com/BFGPU/BFGPU.
中文: 本研究通过将多示例学习重新定义为细粒度PU学习问题,提出了基于严格理论基础的BFGPU框架,利用微观平衡机制有效解决了异常样本稀缺的双重不平衡问题,并在合成和真实数据集上验证了其有效性。
English: The study tackles the dual imbalance in Multi-Instance Learning by reframing it as a fine-grained PU learning problem and introduces the BFGPU framework, which effectively addresses the scarcity of anomalies through micro-level balancing mechanisms, as validated by experiments on synthetic and real-world datasets.
Authors:Jingyuan Xue, Longfei Wei, Dongjing Jiang, Fang Sheng, Russell Greiner, Jianfei Zhang
Abstract:
Battery degradation significantly impacts the reliability and efficiency of energy storage systems, particularly in electric vehicles and industrial applications. Predicting the remaining useful life (RUL) of lithium-ion batteries is crucial for optimizing maintenance schedules, reducing costs, and improving safety. Traditional RUL prediction methods often struggle with nonlinear degradation patterns and uncertainty quantification. To address these challenges, we propose a hybrid survival analysis framework integrating survival data reconstruction, survival model learning, and survival probability estimation. Our approach transforms battery voltage time series into time-to-failure data using path signatures. The multiple Cox-based survival models and machine-learning-based methods, such as DeepHit and MTLR, are learned to predict battery failure-free probabilities over time. Experiments conducted on the Toyota battery and NASA battery datasets demonstrate the effectiveness of our approach, achieving high time-dependent AUC and concordance index (C-Index) while maintaining a low integrated Brier score. The data and source codes for this work are available to the public at https://github.com/thinkxca/rul.
Chinese: 本研究提出的混合生存分析框架通过将电压数据转化为失效时间指标并采用多种生存模型,有效预测锂离子电池的剩余使用寿命,在基准数据集上展现出高精度指标的优越性能。
English: The proposed hybrid survival analysis framework effectively predicts lithium-ion battery remaining useful life by transforming voltage data into time-to-failure metrics and employing multiple survival models, demonstrating superior performance on benchmark datasets with high accuracy scores.
Authors:Juhyeong Kim, Sungyoon Choi, Youngbin Lee, Yejin Kim, Yongmin Choi, Yongjae Lee
Abstract:
We propose Decision by Supervised Learning (DSL), a practical framework for robust portfolio optimization. DSL reframes portfolio construction as a supervised learning problem: models are trained to predict optimal portfolio weights, using cross-entropy loss and portfolios constructed by maximizing the Sharpe or Sortino ratio. To further enhance stability and reliability, DSL employs Deep Ensemble methods, substantially reducing variance in portfolio allocations. Through comprehensive backtesting across diverse market universes and neural architectures, shows superior performance compared to both traditional strategies and leading machine learning-based methods, including Prediction-Focused Learning and End-to-End Learning. We show that increasing the ensemble size leads to higher median returns and more stable risk-adjusted performance. The code is available at https://github.com/DSLwDE/DSLwDE.
中文: 决策监督学习(DSL)框架将投资组合优化转化为监督学习问题,采用交叉熵损失和深度集成方法,通过广泛回测显示出优于传统策略和主流机器学习方法的性能。
English: The Decision by Supervised Learning (DSL) framework transforms portfolio optimization into a supervised learning task using cross-entropy loss and Deep Ensemble methods, demonstrating superior performance over traditional and machine learning approaches through extensive backtesting.
Authors:Jia Xu, Tianyi Wei, Bojian Hou, Patryk Orzechowski, Shu Yang, Ruochen Jin, Rachael Paulbeck, Joost Wagenaar, George Demiris, Li Shen
Abstract:
We introduce MentalChat16K, an English benchmark dataset combining a synthetic mental health counseling dataset and a dataset of anonymized transcripts from interventions between Behavioral Health Coaches and Caregivers of patients in palliative or hospice care. Covering a diverse range of conditions like depression, anxiety, and grief, this curated dataset is designed to facilitate the development and evaluation of large language models for conversational mental health assistance. By providing a high-quality resource tailored to this critical domain, MentalChat16K aims to advance research on empathetic, personalized AI solutions to improve access to mental health support services. The dataset prioritizes patient privacy, ethical considerations, and responsible data usage. MentalChat16K presents a valuable opportunity for the research community to innovate AI technologies that can positively impact mental well-being. The dataset is available at https://huggingface.co/datasets/ShenLab/MentalChat16K and the code and documentation are hosted on GitHub at https://github.com/ChiaPatricia/MentalChat16K.
中文: MentalChat16K是一个结合合成与匿名心理健康咨询记录的英文数据集,旨在推动共情式对话助手的AI技术发展,同时严格保障数据隐私与伦理规范。
English: MentalChat16K is a specialized English dataset combining synthetic and anonymized mental health counseling transcripts, designed to advance AI development for empathetic conversational assistance while ensuring privacy and ethical data use.
Authors:Mohammod N. I. Suvon, Shuo Zhou, Prasun C. Tripathi, Wenrui Fan, Samer Alabed, Bishesh Khanal, Venet Osmani, Andrew J. Swift, Chen, Chen, Haiping Lu
Abstract:
Recent advancements in early assessment of pulmonary hypertension (PH) primarily focus on applying machine learning methods to centralized diagnostic modalities, such as 12-lead electrocardiogram (12L-ECG). Despite their potential, these approaches fall short in decentralized clinical settings, e.g., point-of-care and general practice, where handheld 6-lead ECG (6L-ECG) can offer an alternative but is limited by the scarcity of labeled data for developing reliable models. To address this, we propose a lead-specific electrocardiogram multimodal variational autoencoder (\textsc{LS-EMVAE}), which incorporates a hierarchical modality expert (HiME) fusion mechanism and a latent representation alignment loss. HiME combines mixture-of-experts and product-of-experts to enable flexible, adaptive latent fusion, while the alignment loss improves coherence among lead-specific and shared representations. To alleviate data scarcity and enhance representation learning, we adopt a transfer learning strategy: the model is first pre-trained on a large unlabeled 12L-ECG dataset and then fine-tuned on smaller task-specific labeled 6L-ECG datasets. We validate \textsc{LS-EMVAE} across two retrospective cohorts in a 6L-ECG setting: 892 subjects from the ASPIRE registry for (1) PH detection and (2) phenotyping pre-/post-capillary PH, and 16,416 subjects from UK Biobank for (3) predicting elevated pulmonary atrial wedge pressure, where it consistently outperforms unimodal and multimodal baseline methods and demonstrates strong generalizability and interpretability. The code is available at https://github.com/Shef-AIRE/LS-EMVAE.
中文: 针对分散式临床环境中6导联心电图标记数据稀缺的问题,本研究提出LS-EMVAE模型,通过分层专家融合机制和迁移学习策略,在肺动脉高压检测与分型任务中显著优于现有方法,并展现出优异的泛化能力。
English: Recent machine learning methods for pulmonary hypertension assessment using 12-lead ECG face limitations in decentralized settings with 6-lead ECG due to scarce labeled data, prompting the development of LS-EMVAE, a multimodal variational autoencoder that employs transfer learning and hierarchical fusion to enhance performance and generalizability across clinical tasks.
Authors:Maximilian Beck, Korbinian Pöppel, Phillip Lippe, Richard Kurle, Patrick M. Blies, Günter Klambauer, Sebastian Böck, Sepp Hochreiter
Abstract:
Recent breakthroughs in solving reasoning, math and coding problems with Large Language Models (LLMs) have been enabled by investing substantial computation budgets at inference time. Therefore, inference speed is one of the most critical properties of LLM architectures, and there is a growing need for LLMs that are efficient and fast at inference. Recently, LLMs built on the xLSTM architecture have emerged as a powerful alternative to Transformers, offering linear compute scaling with sequence length and constant memory usage, both highly desirable properties for efficient inference. However, such xLSTM-based LLMs have yet to be scaled to larger models and assessed and compared with respect to inference speed and efficiency. In this work, we introduce xLSTM 7B, a 7-billion-parameter LLM that combines xLSTM's architectural benefits with targeted optimizations for fast and efficient inference. Our experiments demonstrate that xLSTM 7B achieves performance on downstream tasks comparable to other similar-sized LLMs, while providing significantly faster inference speeds and greater efficiency compared to Llama- and Mamba-based LLMs. These results establish xLSTM 7B as the fastest and most efficient 7B LLM, offering a solution for tasks that require large amounts of test-time computation. Our work highlights xLSTM's potential as a foundational architecture for methods building on heavy use of LLM inference. Our model weights, model code and training code are open-source.
中文: 基于xLSTM架构的最新突破催生了xLSTM 7B模型,这个70亿参数的模型在保持优异任务性能的同时,实现了比同类模型更快的推理速度和更高的效率。
English: Recent advances in xLSTM architecture enable the development of xLSTM 7B, a 7-billion-parameter model that outperforms similar-sized LLMs in inference speed and efficiency while maintaining competitive task performance.
Authors:James Burgess, Jeffrey J Nirschl, Laura Bravo-Sánchez, Alejandro Lozano, Sanket Rajan Gupte, Jesus G. Galaz-Montoya, Yuhui Zhang, Yuchang Su, Disha Bhowmik, Zachary Coman, Sarina M. Hasan, Alexandra Johannesson, William D. Leineweber, Malvika G Nair, Ridhi Yarlagadda, Connor Zuraski, Wah Chiu, Sarah Cohen, Jan N. Hansen, Manuel D Leonetti, Chad Liu, Emma Lundberg, Serena Yeung-Levy
Abstract:
Scientific research demands sophisticated reasoning over multimodal data, a challenge especially prevalent in biology. Despite recent advances in multimodal large language models (MLLMs) for AI-assisted research, existing multimodal reasoning benchmarks only target up to college-level difficulty, while research-level benchmarks emphasize lower-level perception, falling short of the complex multimodal reasoning needed for scientific discovery. To bridge this gap, we introduce MicroVQA, a visual-question answering (VQA) benchmark designed to assess three reasoning capabilities vital in research workflows: expert image understanding, hypothesis generation, and experiment proposal. MicroVQA consists of 1,042 multiple-choice questions (MCQs) curated by biology experts across diverse microscopy modalities, ensuring VQA samples represent real scientific practice. In constructing the benchmark, we find that standard MCQ generation methods induce language shortcuts, motivating a new two-stage pipeline: an optimized LLM prompt structures question-answer pairs into MCQs; then, an agent-based `RefineBot' updates them to remove shortcuts. Benchmarking on state-of-the-art MLLMs reveal a peak performance of 53\%; models with smaller LLMs only slightly underperform top models, suggesting that language-based reasoning is less challenging than multimodal reasoning; and tuning with scientific articles enhances performance. Expert analysis of chain-of-thought responses shows that perception errors are the most frequent, followed by knowledge errors and then overgeneralization errors. These insights highlight the challenges in multimodal scientific reasoning, showing MicroVQA is a valuable resource advancing AI-driven biomedical research. MicroVQA is available at https://huggingface.co/datasets/jmhb/microvqa, and project page at https://jmhb0.github.io/microvqa.
中文: MicroVQA是一个专为生物学研究设计的视觉问答基准,用于评估多模态推理能力,填补了现有基准的不足,并通过测试先进模型揭示了感知错误是科学推理中的主要挑战。
English: MicroVQA is a research-level visual question answering benchmark developed to evaluate critical multimodal reasoning skills in biology, addressing gaps in existing benchmarks and revealing key challenges through testing state-of-the-art models.
Authors:Qing Zhou, Junyu Gao, Qi Wang
Abstract:
The rapid growth of dataset scales has been a key driver in advancing deep learning research. However, as dataset scale increases, the training process becomes increasingly inefficient due to the presence of low-value samples, including excessive redundant samples, overly challenging samples, and inefficient easy samples that contribute little to model improvement.To address this challenge, we propose Scale Efficient Training (SeTa) for large datasets, a dynamic sample pruning approach that losslessly reduces training time. To remove low-value samples, SeTa first performs random pruning to eliminate redundant samples, then clusters the remaining samples according to their learning difficulty measured by loss. Building upon this clustering, a sliding window strategy is employed to progressively remove both overly challenging and inefficient easy clusters following an easy-to-hard curriculum.We conduct extensive experiments on large-scale synthetic datasets, including ToCa, SS1M, and ST+MJ, each containing over 3 million samples.SeTa reduces training costs by up to 50\% while maintaining or improving performance, with minimal degradation even at 70\% cost reduction. Furthermore, experiments on various scale real datasets across various backbones (CNNs, Transformers, and Mambas) and diverse tasks (instruction tuning, multi-view stereo, geo-localization, composed image retrieval, referring image segmentation) demonstrate the powerful effectiveness and universality of our approach. Code is available at https://github.com/mrazhou/SeTa.
中文: 提出的规模高效训练(SeTa)方法通过动态剪除大规模数据集中的低价值样本,可在保持性能的同时将训练时间减少高达50%,并在多种数据集和架构中验证了其有效性。
English: The proposed Scale Efficient Training (SeTa) method dynamically prunes low-value samples from large datasets to reduce training time by up to 50% without compromising performance, demonstrating effectiveness across multiple datasets and architectures.
Authors:Hai-Long Sun, Zhun Sun, Houwen Peng, Han-Jia Ye
Abstract:
Recent advancements in Large Language Models (LLMs) have demonstrated enhanced reasoning capabilities, evolving from Chain-of-Thought (CoT) prompting to advanced, product-oriented solutions like OpenAI o1. During our re-implementation of this model, we noticed that in multimodal tasks requiring visual input (e.g., geometry problems), Multimodal LLMs (MLLMs) struggle to maintain focus on the visual information, in other words, MLLMs suffer from a gradual decline in attention to visual information as reasoning progresses, causing text-over-relied outputs. To investigate this, we ablate image inputs during long-chain reasoning. Concretely, we truncate the reasoning process midway, then re-complete the reasoning process with the input image removed. We observe only a ~2% accuracy drop on MathVista's test-hard subset, revealing the model's textual outputs dominate the following reasoning process. Motivated by this, we propose Take-along Visual Conditioning (TVC), a strategy that shifts image input to critical reasoning stages and compresses redundant visual tokens via dynamic pruning. This methodology helps the model retain attention to the visual components throughout the reasoning. Our approach achieves state-of-the-art performance on average across five mathematical reasoning benchmarks (+3.4 points vs previous sota), demonstrating the effectiveness of TVC in enhancing multimodal reasoning systems.
Authors:Witold WydmaÅski, Marek Åmieja
Abstract:
Feature selection in deep learning remains a critical challenge, particularly for high-dimensional tabular data where interpretability and computational efficiency are paramount. We present GFSNetwork, a novel neural architecture that performs differentiable feature selection through temperature-controlled Gumbel-Sigmoid sampling. Unlike traditional methods, where the user has to define the requested number of features, GFSNetwork selects it automatically during an end-to-end process. Moreover, GFSNetwork maintains constant computational overhead regardless of the number of input features. We evaluate GFSNetwork on a series of classification and regression benchmarks, where it consistently outperforms recent methods including DeepLasso, attention maps, as well as traditional feature selectors, while using significantly fewer features. Furthermore, we validate our approach on real-world metagenomic datasets, demonstrating its effectiveness in high-dimensional biological data. Concluding, our method provides a scalable solution that bridges the gap between neural network flexibility and traditional feature selection interpretability. We share our python implementation of GFSNetwork at https://github.com/wwydmanski/GFSNetwork, as well as a PyPi package (gfs_network).
中文: GFSNetwork提出了一种新颖的神经网络架构,能够通过可微分特征选择自动处理高维数据,在基准测试和实际应用中优于现有方法,同时保持计算效率和可解释性。
English: GFSNetwork introduces a novel neural architecture for automatic and differentiable feature selection in high-dimensional data, outperforming existing methods in benchmarks and real-world applications while maintaining computational efficiency and interpretability.
Authors:Mikkel Jordahn, Jonas Vestergaard Jensen, Mikkel N. Schmidt, Michael Riis Andersen
Abstract:
Bayesian Neural Networks (BNNs) often improve model calibration and predictive uncertainty quantification compared to point estimators such as maximum-a-posteriori (MAP). Similarly, deep ensembles (DEs) are also known to improve calibration, and therefore, it is natural to hypothesize that deep ensembles of BNNs (DE-BNNs) should provide even further improvements. In this work, we systematically investigate this across a number of datasets, neural network architectures, and BNN approximation methods and surprisingly find that when the ensembles grow large enough, DEs consistently outperform DE-BNNs on in-distribution data. To shine light on this observation, we conduct several sensitivity and ablation studies. Moreover, we show that even though DE-BNNs outperform DEs on out-of-distribution metrics, this comes at the cost of decreased in-distribution performance. As a final contribution, we open-source the large pool of trained models to facilitate further research on this topic.
中文: 贝叶斯神经网络的深度集成虽然在理论上具有优势,但在分布内数据上表现反而不如标准深度集成,尽管其在分布外检测方面表现优异,但这是以牺牲分布内性能为代价的。
English: Deep ensembles of Bayesian Neural Networks surprisingly underperform standard deep ensembles on in-distribution data despite theoretical advantages, though they excel in out-of-distribution detection at the cost of in-distribution performance.
Authors:Fangzhi Xu, Hang Yan, Chang Ma, Haiteng Zhao, Jun Liu, Qika Lin, Zhiyong Wu
Abstract:
Inference-time optimization scales computation to derive deliberate reasoning steps for effective performance. While previous search-based strategies address the short-sightedness of auto-regressive generation, the vast search space leads to excessive exploration and insufficient exploitation. To strike an efficient balance to derive the optimal step, we frame the decoding strategy as foresight sampling, leveraging simulated future steps to obtain globally optimal step estimation. Built on it, we propose a novel decoding strategy, named $Ï$-Decoding. To provide a precise and expressive estimation of step value, $Ï$-Decoding approximates two distributions via foresight and clustering. Sampling from the joint distribution, the optimal steps can be selected for exploitation. To support adaptive computation allocation, we propose in-width and in-depth pruning strategies, featuring a light-weight solution to achieve inference efficiency. Extensive experiments across seven benchmarks show $Ï$-Decoding outperforms strong baselines in both performance and efficiency. Additional analysis demonstrates its generalization across various LLMs and scalability across a wide range of computing budgets. The code will be released at https://github.com/xufangzhi/phi-Decoding, and the open-source PyPI package is coming soon.
中文:提出的$Ï$-解码策略通过前瞻采样和聚类优化推理步骤,在多个基准测试中实现了卓越的性能与效率,并通过剪枝技术支持自适应计算分配。
English: The proposed $Ï$-Decoding strategy uses foresight sampling and clustering to optimize reasoning steps, achieving superior performance and efficiency across benchmarks while supporting adaptive computation through pruning techniques.
Authors:Ihab Asaad, Maha Shadaydeh, Joachim Denzler
Abstract:
Machine learning classification models trained with empirical risk minimization (ERM) often inadvertently rely on spurious correlations. When absent in the test data, these unintended associations between non-target attributes and target labels lead to poor generalization. This paper addresses this problem from a model optimization perspective and proposes a novel method, Gradient Extrapolation for Debiased Representation Learning (GERNE), designed to learn debiased representations in both known and unknown attribute training cases. GERNE uses two distinct batches with different amounts of spurious correlations and defines the target gradient as a linear extrapolation of the gradients computed from each batch's loss. Our analysis shows that when the extrapolated gradient points toward the batch gradient with fewer spurious correlations, it effectively guides training toward learning a debiased model. GERNE serves as a general framework for debiasing, encompassing ERM and Resampling methods as special cases. We derive the theoretical upper and lower bounds of the extrapolation factor employed by GERNE. By tuning this factor, GERNE can adapt to maximize either Group-Balanced Accuracy (GBA) or Worst-Group Accuracy (WGA). We validate GERNE on five vision and one NLP benchmarks, demonstrating competitive and often superior performance compared to state-of-the-art baselines. The project page is available at: https://gerne-debias.github.io/.
Authors:Yijie Liu, Xinyi Shang, Yiqun Zhang, Yang Lu, Chen Gong, Jing-Hao Xue, Hanzi Wang
Abstract:
Federated Semi-Supervised Learning (FSSL) aims to leverage unlabeled data across clients with limited labeled data to train a global model with strong generalization ability. Most FSSL methods rely on consistency regularization with pseudo-labels, converting predictions from local or global models into hard pseudo-labels as supervisory signals. However, we discover that the quality of pseudo-label is largely deteriorated by data heterogeneity, an intrinsic facet of federated learning. In this paper, we study the problem of FSSL in-depth and show that (1) heterogeneity exacerbates pseudo-label mismatches, further degrading model performance and convergence, and (2) local and global models' predictive tendencies diverge as heterogeneity increases. Motivated by these findings, we propose a simple and effective method called Semi-supervised Aggregation for Globally-Enhanced Ensemble (SAGE), that can flexibly correct pseudo-labels based on confidence discrepancies. This strategy effectively mitigates performance degradation caused by incorrect pseudo-labels and enhances consensus between local and global models. Experimental results demonstrate that SAGE outperforms existing FSSL methods in both performance and convergence. Our code is available at https://github.com/Jay-Codeman/SAGE
中文: 联邦半监督学习(FSSL)面临数据异构性导致伪标签质量下降的问题,而提出的SAGE方法通过置信度差异灵活修正伪标签,有效提升了模型性能和收敛速度。
English: Federated Semi-Supervised Learning (FSSL) faces challenges from data heterogeneity that degrade pseudo-label quality, but the proposed SAGE method effectively corrects pseudo-labels using confidence discrepancies to enhance model performance and convergence.
Authors:Etienne Gauthier, Francis Bach, Michael I. Jordan
Abstract:
Conformal prediction is a powerful framework for distribution-free uncertainty quantification. The standard approach to conformal prediction relies on comparing the ranks of prediction scores: under exchangeability, the rank of a future test point cannot be too extreme relative to a calibration set. This rank-based method can be reformulated in terms of p-values. In this paper, we explore an alternative approach based on e-values, known as conformal e-prediction. E-values offer key advantages that cannot be achieved with p-values, enabling new theoretical and practical capabilities. In particular, we present three applications that leverage the unique strengths of e-values: batch anytime-valid conformal prediction, fixed-size conformal sets with data-dependent coverage, and conformal prediction under ambiguous ground truth. Overall, these examples demonstrate that e-value-based constructions provide a flexible expansion of the toolbox of conformal prediction.
Chinese: 保形电子预测提出了一种基于电子值的替代方法,相较于传统保形预测,它在批量随时有效推断、数据依赖性覆盖和处理模糊真实情况方面具有独特优势,从而扩展了该框架的灵活性和应用范围。
English: Conformal e-prediction introduces an e-value-based alternative to traditional conformal prediction, offering unique advantages such as batch anytime-valid inference, data-dependent coverage, and handling of ambiguous ground truth, thereby expanding the framework's flexibility and applicability.
Authors:Haiyang Guo, Fanhu Zeng, Ziwei Xiang, Fei Zhu, Da-Han Wang, Xu-Yao Zhang, Cheng-Lin Liu
Abstract:
Instruction tuning is widely used to improve a pre-trained Multimodal Large Language Model (MLLM) by training it on curated task-specific datasets, enabling better comprehension of human instructions. However, it is infeasible to collect all possible instruction datasets simultaneously in real-world scenarios. Thus, enabling MLLM with continual instruction tuning is essential for maintaining their adaptability. However, existing methods often trade off memory efficiency for performance gains, significantly compromising overall efficiency. In this paper, we propose a task-specific expansion and task-general fusion framework based on the variations in Centered Kernel Alignment (CKA) similarity across different model layers when trained on diverse datasets. Furthermore, we analyze the information leakage present in the existing benchmark and propose a new and more challenging benchmark to rationally evaluate the performance of different methods. Comprehensive experiments showcase a significant performance improvement of our method compared to existing state-of-the-art methods. Code and dataset are released at https://github.com/Ghy0501/HiDe-LLaVA.
中文: 本文提出了一种基于层间相似度变化的多模态大语言模型持续指令调优框架,通过任务特定扩展与任务通用融合提升模型适应性,并构建了更严谨的基准测试以解决现有评估中的信息泄露问题。
English: This paper introduces a framework for continual instruction tuning of Multimodal Large Language Models that enhances adaptability by balancing task-specific expansion and task-general fusion, while also proposing a more challenging benchmark to address information leakage in existing evaluations.
Authors:Haiyang Guo, Fanhu Zeng, Fei Zhu, Wenzhuo Liu, Da-Han Wang, Jian Xu, Xu-Yao Zhang, Cheng-Lin Liu
Abstract:
A vast amount of instruction tuning data is crucial for the impressive performance of Large Multimodal Models (LMMs), but the associated computational costs and data collection demands during supervised fine-tuning make it impractical for most researchers. Federated learning (FL) has the potential to leverage all distributed data and training resources to reduce the overhead of joint training. However, most existing methods assume a fixed number of tasks, while in real-world scenarios, clients continuously encounter new knowledge and often struggle to retain old tasks due to memory constraints. In this work, we introduce the Federated Continual Instruction Tuning (FCIT) benchmark to model this real-world challenge. Our benchmark includes two realistic scenarios, encompassing four different settings and twelve carefully curated instruction tuning datasets. To address the challenges posed by FCIT, we propose dynamic knowledge organization to effectively integrate updates from different tasks during training and subspace selective activation to allocate task-specific output during inference. Extensive experimental results demonstrate that our proposed method significantly enhances model performance across varying levels of data heterogeneity and catastrophic forgetting. Code and dataset are released at https://github.com/Ghy0501/FCIT.
中文:联邦持续指令调优(FCIT)通过动态整合分布式客户端的知识,并采用子空间选择性激活技术来缓解灾难性遗忘,有效解决了大规模多模态模型训练中的计算成本高和数据异构性问题。
English: Federated Continual Instruction Tuning (FCIT) is proposed to address the challenges of computational costs and data heterogeneity in training Large Multimodal Models by dynamically integrating knowledge from distributed clients while mitigating catastrophic forgetting through subspace activation techniques.
Authors:Duke Nguyen, Aditya Joshi, Flora Salim
Abstract:
Test-time adaptation (TTA) is an excellent method which helps generalize models across domains, tasks, and distributions without the use of labeled datasets. Thus, TTA is very useful in natural language processing (NLP) in the dialectal setting, since oftentimes, models are trained on Standard American English (SAE), evaluated on Indian English or Nigerian English, of which distribution differs significantly from the former. This is especially useful since dialectal datasets are scarce. In this paper, we explore one of the most famous TTA techniques, SHOT, in dialectal NLP. We finetune and evaluate SHOT on different combinations of dialectal GLUE. Our findings show that SHOT is a viable technique when labeled datasets are unavailable. We also theoretically propose the concept of dialectal gap and show that it has a positive correlation with the effectiveness of SHOT. We also find that in many cases, finetuning on SAE yields higher performance than finetuning on dialectal data. Our code is available at https://github.com/dukenguyenxyz/dialect-adaptation
中文摘要:测试时适配(TTA)可在无标注数据情况下有效泛化模型至不同方言分布,其中SHOT方法被证明可行且其效果与方言差异正相关,而基于标准美式英语的微调常优于方言数据。
English Summary: Test-time adaptation (TTA) effectively generalizes models across dialectal distributions without labeled data, with SHOT proving viable and its effectiveness positively correlated with dialectal gaps, while SAE-based finetuning often outperforms dialectal data.
Authors:Xian-Rong Zhang, Yue-Jiao Gong, Zhiguang Cao, Jun Zhang
Abstract:
In recent years, there has been a growing interest in data-driven evolutionary algorithms (DDEAs) employing surrogate models to approximate the objective functions with limited data. However, current DDEAs are primarily designed for lower-dimensional problems and their performance drops significantly when applied to large-scale optimization problems (LSOPs). To address the challenge, this paper proposes an offline DDEA named DSKT-DDEA. DSKT-DDEA leverages multiple islands that utilize different data to establish diverse surrogate models, fostering diverse subpopulations and mitigating the risk of premature convergence. In the intra-island optimization phase, a semi-supervised learning method is devised to fine-tune the surrogates. It not only facilitates data argumentation, but also incorporates the distribution information gathered during the search process to align the surrogates with the evolving local landscapes. Then, in the inter-island knowledge transfer phase, the algorithm incorporates an adaptive strategy that periodically transfers individual information and evaluates the transfer effectiveness in the new environment, facilitating global optimization efficacy. Experimental results demonstrate that our algorithm is competitive with state-of-the-art DDEAs on problems with up to 1000 dimensions, while also exhibiting decent parallelism and scalability. Our DSKT-DDEA is open-source and accessible at: https://github.com/LabGong/DSKT-DDEA.
中文: 本文提出DSKT-DDEA离线数据驱动进化算法,通过多岛多样化代理模型和自适应知识转移策略,有效解决高达1000维的大规模优化问题,性能优于现有方法。
English: This paper introduces DSKT-DDEA, an offline data-driven evolutionary algorithm that uses multiple islands with diverse surrogate models and adaptive knowledge transfer to effectively tackle large-scale optimization problems up to 1000 dimensions, outperforming existing methods.
Authors:Kewei Sui, Anindita Ghosh, Inwoo Hwang, Bing Zhou, Jian Wang, Chuan Guo
Abstract:
Humans inhabit a world defined by interactions -- with other humans, objects, and environments. These interactive movements not only convey our relationships with our surroundings but also demonstrate how we perceive and communicate with the real world. Therefore, replicating these interaction behaviors in digital systems has emerged as an important topic for applications in robotics, virtual reality, and animation. While recent advances in deep generative models and new datasets have accelerated progress in this field, significant challenges remain in modeling the intricate human dynamics and their interactions with entities in the external world. In this survey, we present, for the first time, a comprehensive overview of the literature in human interaction motion generation. We begin by establishing foundational concepts essential for understanding the research background. We then systematically review existing solutions and datasets across three primary interaction tasks -- human-human, human-object, and human-scene interactions -- followed by evaluation metrics. Finally, we discuss open research directions and future opportunities.
Chinese Summary: 本综述首次系统梳理了人体交互动作生成领域,涵盖基础概念、解决方案与数据集,重点分析人-人、人-物、人-场景三类交互任务及评估标准,并展望了未来研究方向。
English Summary: This survey provides a comprehensive overview of human interaction motion generation, covering foundational concepts, existing solutions, datasets, and evaluation metrics across human-human, human-object, and human-scene interactions, while identifying future research directions.
Authors:Liangyu Wang, Jie Ren, Hang Xu, Junxiao Wang, Huanyi Xie, David E. Keyes, Di Wang
Abstract:
Fine-tuning large pre-trained LLMs generally demands extensive GPU memory. Traditional first-order optimizers like SGD encounter substantial difficulties due to increased memory requirements from storing activations and gradients during both the forward and backward phases as the model size expands. Alternatively, zeroth-order (ZO) techniques can compute gradients using just forward operations, eliminating the need to store activations. Furthermore, by leveraging CPU capabilities, it's feasible to enhance both the memory and processing power available to a single GPU. We propose a novel framework, ZO2 (Zeroth-Order Offloading), for efficient zeroth-order fine-tuning of LLMs with only limited GPU memory. Our framework dynamically shifts model parameters between the CPU and GPU as required, optimizing computation flow and maximizing GPU usage by minimizing downtime. This integration of parameter adjustments with ZO's double forward operations reduces unnecessary data movement, enhancing the fine-tuning efficacy. Additionally, our framework supports an innovative low-bit precision approach in AMP mode to streamline data exchanges between the CPU and GPU. Employing this approach allows us to fine-tune extraordinarily large models, such as the OPT-175B with more than 175 billion parameters, on a mere 18GB GPU--achievements beyond the reach of traditional methods. Moreover, our framework achieves these results with almost no additional time overhead and absolutely no accuracy loss compared to standard zeroth-order methods. ZO2's code has been open-sourced in https://github.com/liangyuwang/zo2.
Chinese: ZO2框架通过动态在CPU和GPU间迁移模型参数,实现了大语言模型的高效零阶微调,仅需18GB显存即可训练OPT-175B等超大规模模型,且不增加时间开销或损失精度。
English: The ZO2 framework enables efficient zeroth-order fine-tuning of large language models by dynamically offloading parameters between CPU and GPU, allowing models like OPT-175B to be trained on just 18GB of GPU memory without time or accuracy penalties.
Authors:Imran Kabir, Md Alimoor Reza, Syed Billah
Abstract:
Large multimodal models (LMMs) are increasingly integrated into autonomous driving systems for user interaction. However, their limitations in fine-grained spatial reasoning pose challenges for system interpretability and user trust. We introduce Logic-RAG, a novel Retrieval-Augmented Generation (RAG) framework that improves LMMs' spatial understanding in driving scenarios. Logic-RAG constructs a dynamic knowledge base (KB) about object-object relationships in first-order logic (FOL) using a perception module, a query-to-logic embedder, and a logical inference engine. We evaluated Logic-RAG on visual-spatial queries using both synthetic and real-world driving videos. When using popular LMMs (GPT-4V, Claude 3.5) as proxies for an autonomous driving system, these models achieved only 55% accuracy on synthetic driving scenes and under 75% on real-world driving scenes. Augmenting them with Logic-RAG increased their accuracies to over 80% and 90%, respectively. An ablation study showed that even without logical inference, the fact-based context constructed by Logic-RAG alone improved accuracy by 15%. Logic-RAG is extensible: it allows seamless replacement of individual components with improved versions and enables domain experts to compose new knowledge in both FOL and natural language. In sum, Logic-RAG addresses critical spatial reasoning deficiencies in LMMs for autonomous driving applications. Code and data are available at https://github.com/Imran2205/LogicRAG.
Chinese: Logic-RAG是一种新颖的检索增强生成框架,通过一阶逻辑构建动态知识库,显著提升多模态大模型在自动驾驶中的空间推理能力,有效解决了现有模型在细粒度空间理解上的不足。
English: Logic-RAG is a novel Retrieval-Augmented Generation framework that enhances large multimodal models' spatial reasoning in autonomous driving by constructing a dynamic knowledge base with first-order logic, significantly improving accuracy on visual-spatial queries.
Authors:Vrushank Ahire, Kunal Shah, Mudasir Nazir Khan, Nikhil Pakhale, Lownish Rai Sookha, M. A. Ganaie, Abhinav Dhall
Abstract:
Dynamic emotion recognition in the wild remains challenging due to the transient nature of emotional expressions and temporal misalignment of multi-modal cues. Traditional approaches predict valence and arousal and often overlook the inherent correlation between these two dimensions. The proposed Multi-modal Attention for Valence-Arousal Emotion Network (MAVEN) integrates visual, audio, and textual modalities through a bi-directional cross-modal attention mechanism. MAVEN uses modality-specific encoders to extract features from synchronized video frames, audio segments, and transcripts, predicting emotions in polar coordinates following Russell's circumplex model. The evaluation of the Aff-Wild2 dataset using MAVEN achieved a concordance correlation coefficient (CCC) of 0.3061, surpassing the ResNet-50 baseline model with a CCC of 0.22. The multistage architecture captures the subtle and transient nature of emotional expressions in conversational videos and improves emotion recognition in real-world situations. The code is available at: https://github.com/Vrushank-Ahire/MAVEN_8th_ABAW
Chinese: 提出的MAVEN模型通过跨模态注意力机制整合多模态线索,在Aff-Wild2数据集上实现了优于传统方法的动态情绪识别性能。
English: The proposed MAVEN model enhances dynamic emotion recognition in the wild by integrating multi-modal cues through a cross-modal attention mechanism, achieving superior performance on the Aff-Wild2 dataset compared to traditional methods.
Authors:Yancheng Wang, Changyu Liu, Yingzhen Yang
Abstract:
Graph diffusion models have recently been proposed to synthesize entire graphs, such as molecule graphs. Although existing methods have shown great performance in generating entire graphs for graph-level learning tasks, no graph diffusion models have been developed to generate synthetic graph structures, that is, synthetic nodes and associated edges within a given graph, for node-level learning tasks. Inspired by the research in the computer vision literature using synthetic data for enhanced performance, we propose Diffusion on Graph (DoG), which generates synthetic graph structures to boost the performance of GNNs. The synthetic graph structures generated by DoG are combined with the original graph to form an augmented graph for the training of node-level learning tasks, such as node classification and graph contrastive learning (GCL). To improve the efficiency of the generation process, a Bi-Level Neighbor Map Decoder (BLND) is introduced in DoG. To mitigate the adverse effect of the noise introduced by the synthetic graph structures, a low-rank regularization method is proposed for the training of graph neural networks (GNNs) on the augmented graphs. Extensive experiments on various graph datasets for semi-supervised node classification and graph contrastive learning have been conducted to demonstrate the effectiveness of DoG with low-rank regularization. The code of DoG is available at https://github.com/Statistical-Deep-Learning/DoG.
中文: 提出的图扩散方法(DoG)通过生成合成图结构来增强原始图以支持节点级学习任务,采用双级邻居映射解码器提高效率,并引入低秩正则化减少噪声影响,实验在半监督节点分类和图对比学习中验证了其有效性。
English: The proposed Diffusion on Graph (DoG) method generates synthetic graph structures to augment original graphs for node-level learning tasks, incorporating a Bi-Level Neighbor Map Decoder for efficiency and low-rank regularization to reduce noise impact, with experiments validating its effectiveness in semi-supervised node classification and graph contrastive learning.
Authors:Wei Zhu, Abirath Raju, Abdulaziz Shamsah, Anqi Wu, Seth Hutchinson, Ye Zhao
Abstract:
This study presents an emotion-aware navigation framework -- EmoBipedNav -- using deep reinforcement learning (DRL) for bipedal robots walking in socially interactive environments. The inherent locomotion constraints of bipedal robots challenge their safe maneuvering capabilities in dynamic environments. When combined with the intricacies of social environments, including pedestrian interactions and social cues, such as emotions, these challenges become even more pronounced. To address these coupled problems, we propose a two-stage pipeline that considers both bipedal locomotion constraints and complex social environments. Specifically, social navigation scenarios are represented using sequential LiDAR grid maps (LGMs), from which we extract latent features, including collision regions, emotion-related discomfort zones, social interactions, and the spatio-temporal dynamics of evolving environments. The extracted features are directly mapped to the actions of reduced-order models (ROMs) through a DRL architecture. Furthermore, the proposed framework incorporates full-order dynamics and locomotion constraints during training, effectively accounting for tracking errors and restrictions of the locomotion controller while planning the trajectory with ROMs. Comprehensive experiments demonstrate that our approach exceeds both model-based planners and DRL-based baselines. The hardware videos and open-source code are available at https://gatech-lidar.github.io/emobipednav.github.io/.
Authors:Haoqi Yuan, Yu Bai, Yuhui Fu, Bohan Zhou, Yicheng Feng, Xinrun Xu, Yi Zhan, Börje F. Karlsson, Zongqing Lu
Abstract:
Building autonomous robotic agents capable of achieving human-level performance in real-world embodied tasks is an ultimate goal in humanoid robot research. Recent advances have made significant progress in high-level cognition with Foundation Models (FMs) and low-level skill development for humanoid robots. However, directly combining these components often results in poor robustness and efficiency due to compounding errors in long-horizon tasks and the varied latency of different modules. We introduce Being-0, a hierarchical agent framework that integrates an FM with a modular skill library. The FM handles high-level cognitive tasks such as instruction understanding, task planning, and reasoning, while the skill library provides stable locomotion and dexterous manipulation for low-level control. To bridge the gap between these levels, we propose a novel Connector module, powered by a lightweight vision-language model (VLM). The Connector enhances the FM's embodied capabilities by translating language-based plans into actionable skill commands and dynamically coordinating locomotion and manipulation to improve task success. With all components, except the FM, deployable on low-cost onboard computation devices, Being-0 achieves efficient, real-time performance on a full-sized humanoid robot equipped with dexterous hands and active vision. Extensive experiments in large indoor environments demonstrate Being-0's effectiveness in solving complex, long-horizon tasks that require challenging navigation and manipulation subtasks. For further details and videos, visit https://beingbeyond.github.io/Being-0.
Authors:Patryk MarszaÅek, Ulvi Movsum-zada, Oleksii Furman, Kamil KsiÄ
żek, PrzemysÅaw Spurek, Marek Åmieja
Abstract:
In recent years, there has been a growing interest in explainable AI methods. We want not only to make accurate predictions using sophisticated neural networks but also to understand what the model's decision is based on. One of the fundamental levels of interpretability is to provide counterfactual examples explaining the rationale behind the decision and identifying which features, and to what extent, must be modified to alter the model's outcome. To address these requirements, we introduce HyConEx, a classification model based on deep hypernetworks specifically designed for tabular data. Owing to its unique architecture, HyConEx not only provides class predictions but also delivers local interpretations for individual data samples in the form of counterfactual examples that steer a given sample toward an alternative class. While many explainable methods generated counterfactuals for external models, there have been no interpretable classifiers simultaneously producing counterfactual samples so far. HyConEx achieves competitive performance on several metrics assessing classification accuracy and fulfilling the criteria of a proper counterfactual attack. This makes HyConEx a distinctive deep learning model, which combines predictions and explainers as an all-in-one neural network. The code is available at https://github.com/gmum/HyConEx.
Chinese: HyConEx 是一种基于深度超网络的创新分类模型,专为表格数据设计,不仅能提供精确预测,还能生成反事实样本实现局部可解释性,在分类准确性与可解释性指标上均表现出竞争力。
English: HyConEx is a novel deep hypernetwork-based classifier for tabular data that simultaneously delivers accurate predictions and generates counterfactual examples for local interpretability, achieving competitive performance in both classification and explainability metrics.
Authors:Alessio Xompero, Andrea Cavallaro
Abstract:
Subjective interpretation and content diversity make predicting whether an image is private or public a challenging task. Graph neural networks combined with convolutional neural networks (CNNs), which consist of 14,000 to 500 millions parameters, generate features for visual entities (e.g., scene and object types) and identify the entities that contribute to the decision. In this paper, we show that using a simpler combination of transfer learning and a CNN to relate privacy with scene types optimises only 732 parameters while achieving comparable performance to that of graph-based methods. On the contrary, end-to-end training of graph-based methods can mask the contribution of individual components to the classification performance. Furthermore, we show that a high-dimensional feature vector, extracted with CNNs for each visual entity, is unnecessary and complexifies the model. The graph component has also negligible impact on performance, which is driven by fine-tuning the CNN to optimise image features for privacy nodes.
中文: 研究表明,仅用732个参数的迁移学习与CNN结合方法即可实现与复杂图神经网络相当的隐私分类性能,同时揭示高维特征和图结构组件对优化结果并非必要。
English: This study demonstrates that a simpler approach using transfer learning and CNNs with only 732 parameters achieves privacy classification performance comparable to complex graph-based methods, while revealing that high-dimensional features and graph components are unnecessary for optimal results.
Authors:Heng Zhang, Guoxiang Zhao, Xiaoqiang Ren
Abstract:
Pursuit-evasion (PE) problem is a critical challenge in multi-robot systems (MRS). While reinforcement learning (RL) has shown its promise in addressing PE tasks, research has primarily focused on single-target pursuit, with limited exploration of multi-target encirclement, particularly in large-scale settings. This paper proposes a Transformer-Enhanced Reinforcement Learning (TERL) framework for large-scale multi-target encirclement. By integrating a transformer-based policy network with target selection, TERL enables robots to adaptively prioritize targets and safely coordinate robots. Results show that TERL outperforms existing RL-based methods in terms of encirclement success rate and task completion time, while maintaining good performance in large-scale scenarios. Notably, TERL, trained on small-scale scenarios (15 pursuers, 4 targets), generalizes effectively to large-scale settings (80 pursuers, 20 targets) without retraining, achieving a 100% success rate. The code and demonstration video are available at https://github.com/ApricityZ/TERL.
中文: 本文提出了一种基于Transformer增强的强化学习框架(TERL),使多机器人系统能够在大规模场景中有效围捕多个目标,展现出更高的成功率和无需重新训练的良好泛化能力。
English: This paper introduces a Transformer-Enhanced Reinforcement Learning (TERL) framework that enables multi-robot systems to efficiently encircle multiple targets in large-scale scenarios, demonstrating superior success rates and generalization without retraining.
Authors:Kumar Krishna Agrawal, Long Lian, Longchao Liu, Natalia Harguindeguy, Boyi Li, Alexander Bick, Maggie Chung, Trevor Darrell, Adam Yala
Abstract:
Efficiently modeling massive images is a long-standing challenge in machine learning. To this end, we introduce Multi-Scale Attention (MSA). MSA relies on two key ideas, (i) multi-scale representations (ii) bi-directional cross-scale communication. MSA creates O(log N) scales to represent the image across progressively coarser features and leverages cross-attention to propagate information across scales. We then introduce Atlas, a novel neural network architecture based on MSA. We demonstrate that Atlas significantly improves the compute-performance tradeoff of long-context image modeling in a high-resolution variant of ImageNet 100. At 1024px resolution, Atlas-B achieves 91.04% accuracy, comparable to ConvNext-B (91.92%) while being 4.3x faster. Atlas is 2.95x faster and 7.38% better than FasterViT, 2.25x faster and 4.96% better than LongViT. In comparisons against MambaVision-S, we find Atlas-S achieves 5%, 16% and 32% higher accuracy at 1024px, 2048px and 4096px respectively, while obtaining similar runtimes. Code for reproducing our experiments and pretrained models is available at https://github.com/yalalab/atlas.
中文: 摘要介绍了多尺度注意力机制及其Atlas架构,该架构在高分辨率图像建模中显著提升了计算效率与准确性,性能超越了ConvNext-B和FasterViT等现有模型。
English: The abstract introduces Multi-Scale Attention (MSA) and the Atlas architecture, which significantly enhances computational efficiency and accuracy in high-resolution image modeling, outperforming existing models like ConvNext-B and FasterViT.
Authors:Zhe Wang, Yanjun Qi
Abstract:
Gradient optimization-based adversarial attack methods automate the learning of adversarial triggers to generate jailbreak prompts or leak system prompts. In this work, we take a closer look at the optimization objective of adversarial trigger learning and propose ATLA: Adversarial Trigger Learning with Augmented objectives. ATLA improves the negative log-likelihood loss used by previous studies into a weighted loss formulation that encourages the learned adversarial triggers to optimize more towards response format tokens. This enables ATLA to learn an adversarial trigger from just one query-response pair and the learned trigger generalizes well to other similar queries. We further design a variation to augment trigger optimization with an auxiliary loss that suppresses evasive responses. We showcase how to use ATLA to learn adversarial suffixes jailbreaking LLMs and to extract hidden system prompts. Empirically we demonstrate that ATLA consistently outperforms current state-of-the-art techniques, achieving nearly 100% success in attacking while requiring 80% fewer queries. ATLA learned jailbreak suffixes demonstrate high generalization to unseen queries and transfer well to new LLMs. We released our code https://github.com/QData/ALTA_Augmented_Adversarial_Trigger_Learning
中文: ATLA提出了一种增强的对抗性触发器学习方法,通过优化响应格式令牌并抑制回避性回答,以80%更少的查询实现近100%的攻击成功率,同时在跨模型场景中展现出强大的泛化能力。
English: ATLA introduces an augmented adversarial trigger learning method that enhances attack efficiency by optimizing response format tokens and suppressing evasive responses, achieving nearly 100% success with 80% fewer queries while demonstrating strong generalization across models.
Authors:Hang Ni, Jindong Han, Nengjun Zhu, Hao Liu
Abstract:
Graph Anomaly Detection (GAD) plays a vital role in various data mining applications such as e-commerce fraud prevention and malicious user detection. Recently, Graph Neural Network (GNN) based approach has demonstrated great effectiveness in GAD by first encoding graph data into low-dimensional representations and then identifying anomalies under the guidance of supervised or unsupervised signals. However, existing GNN-based approaches implicitly follow the homophily principle (i.e., the "like attracts like" phenomenon) and fail to learn discriminative embedding for anomalies that connect vast normal nodes. Moreover, such approaches identify anomalies in a unified global perspective but overlook diversified abnormal patterns conditioned on local graph context, leading to suboptimal performance. To overcome the aforementioned limitations, in this paper, we propose a Multi-hypersphere Heterophilic Graph Learning (MHetGL) framework for unsupervised GAD. Specifically, we first devise a Heterophilic Graph Encoding (HGE) module to learn distinguishable representations for potential anomalies by purifying and augmenting their neighborhood in a fully unsupervised manner. Then, we propose a Multi-Hypersphere Learning (MHL) module to enhance the detection capability for context-dependent anomalies by jointly incorporating critical patterns from both global and local perspectives. Extensive experiments on ten real-world datasets show that MHetGL outperforms 14 baselines. Our code is publicly available at https://github.com/KennyNH/MHetGL.
中文:提出的MHetGL框架通过异质图编码学习可区分的异常表示,并采用多超球面学习从全局和局部视角捕捉上下文相关异常,从而克服了现有基于图神经网络的异常检测方法的局限性。
English: The proposed MHetGL framework overcomes limitations of existing GNN-based anomaly detection methods by introducing heterophilic graph encoding to learn distinguishable anomaly representations and multi-hypersphere learning to capture context-dependent anomalies from both global and local perspectives.
Authors:Yebo Wu, Chunlin Tian, Jingguang Li, He Sun, Kahou Tam, Zhanting Zhou, Haicheng Liao, Zhijiang Guo, Li Li, Chengzhong Xu
Abstract:
Large Language Models (LLMs) have demonstrated impressive success across various tasks. Integrating LLMs with Federated Learning (FL), a paradigm known as FedLLM, offers a promising avenue for collaborative model adaptation while preserving data privacy. This survey provides a systematic and comprehensive review of FedLLM. We begin by tracing the historical development of both LLMs and FL, summarizing relevant prior research to set the context. Subsequently, we delve into an in-depth analysis of the fundamental challenges inherent in deploying FedLLM. Addressing these challenges often requires efficient adaptation strategies; therefore, we conduct an extensive examination of existing Parameter-Efficient Fine-tuning (PEFT) methods and explore their applicability within the FL framework. To rigorously evaluate the performance of FedLLM, we undertake a thorough review of existing fine-tuning datasets and evaluation benchmarks. Furthermore, we discuss FedLLM's diverse real-world applications across multiple domains. Finally, we identify critical open challenges and outline promising research directions to foster future advancements in FedLLM. This survey aims to serve as a foundational resource for researchers and practitioners, offering valuable insights into the rapidly evolving landscape of federated fine-tuning for LLMs. It also establishes a roadmap for future innovations in privacy-preserving AI. We actively maintain a GitHub repo \href{https://github.com/Clin0212/Awesome-Federated-LLM-Learning}{https://github.com/Clin0212/Awesome-Federated-LLM-Learning} to track cutting-edge advancements in this field.
中文: 本综述系统性地探讨了联邦大语言模型(FedLLM),该模型将大语言模型与联邦学习相结合,在保护数据隐私的同时实现协作模型适配,涵盖了挑战、适配策略、应用及未来研究方向。
English: This survey systematically reviews FedLLM, which integrates Large Language Models with Federated Learning to enable collaborative model adaptation while preserving data privacy, covering challenges, adaptation strategies, applications, and future research directions.
Authors:Xiaoyu Wu, Yifei Pang, Terrance Liu, Steven Wu
Abstract:
Tabular data synthesis using diffusion models has gained significant attention for its potential to balance data utility and privacy. However, existing privacy evaluations often rely on heuristic metrics or weak membership inference attacks (MIA), leaving privacy risks inadequately assessed. In this work, we conduct a rigorous MIA study on diffusion-based tabular synthesis, revealing that state-of-the-art attacks designed for image models fail in this setting. We identify noise initialization as a key factor influencing attack efficacy and propose a machine-learning-driven approach that leverages loss features across different noises and time steps. Our method, implemented with a lightweight MLP, effectively learns membership signals, eliminating the need for manual optimization. Experimental results from the MIDST Challenge @ SaTML 2025 demonstrate the effectiveness of our approach, securing first place across all tracks. Code is available at https://github.com/Nicholas0228/Tartan_Federer_MIDST.
Chinese: 本研究通过开发一种基于机器学习的成员推断攻击方法,严格评估了基于扩散模型的表格数据合成中的隐私风险,该方法优于现有技术,并在MIDST Challenge @ SaTML 2025中荣获第一名。
English: This study rigorously evaluates privacy risks in diffusion-based tabular data synthesis by developing a machine learning-driven membership inference attack that outperforms existing methods, winning first place in the MIDST Challenge @ SaTML 2025.
Authors:Eduard Tulchinskii, Daria Voronkova, Ilya Trofimov, Evgeny Burnaev, Serguei Barannikov
Abstract:
Topological methods for comparing weighted graphs are valuable in various learning tasks but often suffer from computational inefficiency on large datasets. We introduce RTD-Lite, a scalable algorithm that efficiently compares topological features, specifically connectivity or cluster structures at arbitrary scales, of two weighted graphs with one-to-one correspondence between vertices. Using minimal spanning trees in auxiliary graphs, RTD-Lite captures topological discrepancies with $O(n^2)$ time and memory complexity. This efficiency enables its application in tasks like dimensionality reduction and neural network training. Experiments on synthetic and real-world datasets demonstrate that RTD-Lite effectively identifies topological differences while significantly reducing computation time compared to existing methods. Moreover, integrating RTD-Lite into neural network training as a loss function component enhances the preservation of topological structures in learned representations. Our code is publicly available at https://github.com/ArGintum/RTD-Lite
中文摘要:RTD-Lite是一种高效算法,能以O(n²)复杂度比较加权图的拓扑特征,在显著减少计算时间的同时,可应用于降维和神经网络训练等任务。
English summary: RTD-Lite is an efficient algorithm that compares topological features of weighted graphs with O(n²) complexity, enabling applications in dimensionality reduction and neural network training while significantly reducing computation time.
Authors:Alexander Weers, Alexander H. Berger, Laurin Lux, Peter Schüffler, Daniel Rueckert, Johannes C. Paetzold
Abstract:
The histopathological classification of whole-slide images (WSIs) is a fundamental task in digital pathology; yet it requires extensive time and expertise from specialists. While deep learning methods show promising results, they typically process WSIs by dividing them into artificial patches, which inherently prevents a network from learning from the entire image context, disregards natural tissue structures and compromises interpretability. Our method overcomes this limitation through a novel graph-based framework that constructs WSI graph representations. The WSI-graph efficiently captures essential histopathological information in a compact form. We build tissue representations (nodes) that follow biological boundaries rather than arbitrary patches all while providing interpretable features for explainability. Through adaptive graph coarsening guided by learned embeddings, we progressively merge regions while maintaining discriminative local features and enabling efficient global information exchange. In our method's final step, we solve the diagnostic task through a graph attention network. We empirically demonstrate strong performance on multiple challenging tasks such as cancer stage classification and survival prediction, while also identifying predictive factors using Integrated Gradients. Our implementation is publicly available at https://github.com/HistoGraph31/pix2pathology
中文摘要:本研究提出了一种基于生物信息的图框架,将全切片图像转化为可解释的表示形式用于癌症诊断,在显著减少资源使用的同时保持竞争力,并具备完全可解释性。
English Summary: This study introduces a biologically-informed graph framework that transforms whole-slide images into interpretable representations for cancer diagnosis, achieving competitive performance with significantly fewer resources while maintaining full interpretability.
Authors:Haoxin Liu, Harshavardhan Kamarthi, Zhiyuan Zhao, Shangqing Xu, Shiyu Wang, Qingsong Wen, Tom Hartvigsen, Fei Wang, B. Aditya Prakash
Abstract:
Time series analysis (TSA) is a longstanding research topic in the data mining community and has wide real-world significance. Compared to "richer" modalities such as language and vision, which have recently experienced explosive development and are densely connected, the time-series modality remains relatively underexplored and isolated. We notice that many recent TSA works have formed a new research field, i.e., Multiple Modalities for TSA (MM4TSA). In general, these MM4TSA works follow a common motivation: how TSA can benefit from multiple modalities. This survey is the first to offer a comprehensive review and a detailed outlook for this emerging field. Specifically, we systematically discuss three benefits: (1) reusing foundation models of other modalities for efficient TSA, (2) multimodal extension for enhanced TSA, and (3) cross-modality interaction for advanced TSA. We further group the works by the introduced modality type, including text, images, audio, tables, and others, within each perspective. Finally, we identify the gaps with future opportunities, including the reused modalities selections, heterogeneous modality combinations, and unseen tasks generalizations, corresponding to the three benefits. We release an up-to-date GitHub repository that includes key papers and resources.
中文摘要:本调查首次系统综述了多模态时间序列分析这一新兴领域,详细阐述了利用其他模态提升时间序列分析的三大优势,并指出了未来研究方向。
English Summary: This survey introduces MM4TSA as an emerging field exploring how time series analysis can benefit from integrating multiple modalities, systematically reviewing three key benefits and outlining future research opportunities.
Authors:Yiwei Chen, Yuguang Yao, Yihua Zhang, Bingquan Shen, Gaowen Liu, Sijia Liu
Abstract:
Recent vision-language models (VLMs) have made remarkable strides in generative modeling with multimodal inputs, particularly text and images. However, their susceptibility to generating harmful content when exposed to unsafe queries raises critical safety concerns. While current alignment strategies primarily rely on supervised safety fine-tuning with curated datasets, we identify a fundamental limitation we call the "safety mirage" where supervised fine-tuning inadvertently reinforces spurious correlations between superficial textual patterns and safety responses, rather than fostering deep, intrinsic mitigation of harm. We show that these spurious correlations leave fine-tuned VLMs vulnerable even to a simple one-word modification-based attack, where substituting a single word in text queries with a spurious correlation-inducing alternative can effectively bypass safeguards. Additionally, these correlations contribute to the over prudence, causing fine-tuned VLMs to refuse benign queries unnecessarily. To address this issue, we show machine unlearning (MU) as a powerful alternative to supervised safety fine-tuning as it avoids biased feature-label mappings and directly removes harmful knowledge from VLMs while preserving their general capabilities. Extensive evaluations across safety benchmarks show that under one-word attacks, MU-based alignment reduces the attack success rate by up to 60.17% and cuts unnecessary rejections by over 84.20%. Codes are available at https://github.com/OPTML-Group/VLM-Safety-MU. WARNING: There exist AI generations that may be offensive in nature.
中文: 近期视觉语言模型因监督安全微调产生的伪相关性易生成有害内容,而机器遗忘技术通过直接消除有害知识并保留模型能力,有效解决了这一问题。
English: Recent vision language models show vulnerability to generating harmful content due to spurious correlations from supervised safety fine-tuning, but machine unlearning effectively mitigates these risks by removing harmful knowledge while preserving model capabilities.
Authors:Hyunwoo Park, Baekryun Seong, Sang-Ki Ko
Abstract:
In cooperative multi-agent reinforcement learning (MARL), the permutation problem where the state space grows exponentially with the number of agents reduces sample efficiency. Additionally, many existing architectures struggle with scalability, relying on a fixed structure tied to a specific number of agents, limiting their applicability to environments with a variable number of entities. While approaches such as graph neural networks (GNNs) and self-attention mechanisms have progressed in addressing these challenges, they have significant limitations as dense GNNs and self-attention mechanisms incur high computational costs. To overcome these limitations, we propose a novel agent network and a non-linear mixing network that ensure permutation-equivariance and scalability, allowing them to generalize to environments with various numbers of agents. Our agent network significantly reduces computational complexity, and our scalable hypernetwork enables efficient weight generation for non-linear mixing. Additionally, we introduce curriculum learning to improve training efficiency. Experiments on SMACv2 and Google Research Football (GRF) demonstrate that our approach achieves superior learning performance compared to existing methods. By addressing both permutation-invariance and scalability in MARL, our work provides a more efficient and adaptable framework for cooperative MARL. Our code is available at https://github.com/funny-rl/SPECTra.
中文: 合作多智能体强化学习面临状态空间指数增长和可扩展性受限的挑战,我们提出的置换等变且可扩展的网络通过降低计算成本和提高训练效率解决了这些问题,在基准测试中实现了优越性能。
English: Cooperative multi-agent reinforcement learning faces challenges with exponential state space growth and limited scalability, which our proposed permutation-equivariant and scalable networks overcome by reducing computational costs and improving training efficiency, achieving superior performance in benchmarks.
Authors:Hanyang Zhao, Haoxian Chen, Yucheng Guo, Genta Indra Winata, Tingting Ou, Ziyu Huang, David D. Yao, Wenpin Tang
Abstract:
We introduce Rich Preference Optimization (RPO), a novel pipeline that leverages rich feedback signals to improve the curation of preference pairs for fine-tuning text-to-image diffusion models. Traditional methods, like Diffusion-DPO, often rely solely on reward model labeling, which can be opaque, offer limited insights into the rationale behind preferences, and are prone to issues such as reward hacking or overfitting. In contrast, our approach begins with generating detailed critiques of synthesized images, from which we extract reliable and actionable image editing instructions. By implementing these instructions, we create refined images, resulting in synthetic, informative preference pairs that serve as enhanced tuning datasets. We demonstrate the effectiveness of our pipeline and the resulting datasets in fine-tuning state-of-the-art diffusion models. Our code is available at https://github.com/Diffusion-RLHF/RPO.
中文摘要:Rich Preference Optimization (RPO)提出了一种创新流程,通过详细图像评析生成可操作的编辑指令,创建出增强型偏好对,相比传统基于奖励的方法能更有效地微调文生图扩散模型。
English Summary: Rich Preference Optimization (RPO) introduces a novel pipeline that uses detailed image critiques to generate actionable editing instructions, creating enhanced preference pairs for fine-tuning text-to-image diffusion models more effectively than traditional reward-based methods.
Authors:Shunyu Liu, Wenkai Fang, Zetian Hu, Junjie Zhang, Yang Zhou, Kongcheng Zhang, Rongcheng Tu, Ting-En Lin, Fei Huang, Mingli Song, Yongbin Li, Dacheng Tao
Abstract:
Large Language Models (LLMs) have demonstrated unprecedented generative capabilities, yet their alignment with human values remains critical for ensuring helpful and harmless deployments. While Reinforcement Learning from Human Feedback (RLHF) has emerged as a powerful paradigm for aligning LLMs with human preferences, its reliance on complex reward modeling introduces inherent trade-offs in computational efficiency and training stability. In this context, Direct Preference Optimization (DPO) has recently gained prominence as a streamlined alternative that directly optimizes LLMs using human preferences, thereby circumventing the need for explicit reward modeling. Owing to its theoretical elegance and computational efficiency, DPO has rapidly attracted substantial research efforts exploring its various implementations and applications. However, this field currently lacks systematic organization and comparative analysis. In this survey, we conduct a comprehensive overview of DPO and introduce a novel taxonomy, categorizing previous works into four key dimensions: data strategy, learning framework, constraint mechanism, and model property. We further present a rigorous empirical analysis of DPO variants across standardized benchmarks. Additionally, we discuss real-world applications, open challenges, and future directions for DPO. This work delivers both a conceptual framework for understanding DPO and practical guidance for practitioners, aiming to advance robust and generalizable alignment paradigms. All collected resources are available and will be continuously updated at https://github.com/liushunyu/awesome-direct-preference-optimization.
中文摘要:本综述系统梳理了直接偏好优化(DPO)方法,提出了新颖的分类框架并对其变体进行实证分析,同时探讨了这种高效语言模型对齐技术的实际应用与发展前景。
English Summary: This survey provides a comprehensive overview of Direct Preference Optimization (DPO), presenting a novel taxonomy and empirical analysis of its variants while discussing applications and future directions for this streamlined LLM alignment method.
Authors:Piotr Bialas, Piotr Korcyl, Tomasz Stebel, Dawid Zapolski
Abstract:
We present the \texttt{NeuMC} software package, based on \pytorch, aimed at facilitating the research on neural samplers in lattice field theories. Neural samplers based on normalizing flows are becoming increasingly popular in the context of Monte-Carlo simulations as they can effectively approximate target probability distributions, possibly alleviating some shortcomings of the Markov chain Monte-Carlo methods. Our package provides tools to create such samplers for two-dimensional field theories.
Chinese: \texttt{NeuMC} 软件包基于 \pytorch,为二维场论中的神经采样器研究提供工具,利用归一化流有效逼近目标概率分布,以改进传统蒙特卡洛方法的不足。
English: The \texttt{NeuMC} software package, built on \pytorch, supports neural sampler research for lattice field theories by offering tools to develop normalizing flow-based samplers that approximate target distributions and address limitations of traditional Monte-Carlo methods.
Authors:Tobias Morocutti, Florian Schmid, Jonathan Greif, Francesco Foscarin, Gerhard Widmer
Abstract:
We target the problem of developing new low-complexity networks for the sound event detection task. Our goal is to meticulously analyze the performance-complexity trade-off, aiming to be competitive with the large state-of-the-art models, at a fraction of the computational requirements. We find that low-complexity convolutional models previously proposed for audio tagging can be effectively adapted for event detection (which requires frame-wise prediction) by adjusting convolutional strides, removing the global pooling, and, importantly, adding a sequence model before the (now frame-wise) classification heads. Systematic experiments reveal that the best choice for the sequence model type depends on which complexity metric is most important for the given application. We also investigate the impact of enhanced training strategies such as knowledge distillation. In the end, we show that combined with an optimized training strategy, we can reach event detection performance comparable to state-of-the-art transformers while requiring only around 5% of the parameters. We release all our pre-trained models and the code for reproducing this work to support future research in low-complexity sound event detection at https://github.com/theMoro/EfficientSED.
中文摘要:本研究通过调整卷积步幅、移除全局池化及添加序列模型等结构改进,将音频标记模型有效适配于声音事件检测任务,结合优化训练策略后仅用约5%参数量即可达到与顶尖Transformer模型相当的性能。
English Summary: This study develops low-complexity networks for sound event detection by adapting audio tagging models through architectural modifications and training optimizations, achieving performance comparable to state-of-the-art transformers with only 5% of parameters.
Authors:Suchanun Piriyasatit, Ercan Engin Kuruoglu, Mehmet Sinan Ozeren
Abstract:
Earthquake detection is essential for earthquake early warning (EEW) systems. Traditional methods struggle with low signal-to-noise ratios and single-station reliance, limiting their effectiveness. We propose a Spatio-Temporal Graph Convolutional Network (GCN) using Spectral Structure Learning Convolution (Spectral SLC) to model static and dynamic relationships across seismic stations. Our approach processes multi-station waveform data and generates station-specific detection probabilities. Experiments show superior performance over a conventional GCN baseline in terms of true positive rate (TPR) and false positive rate (FPR), highlighting its potential for robust multi-station earthquake detection. The code repository for this study is available at https://github.com/SuchanunP/eq_detector.
Chinese: 本研究提出了一种采用谱结构学习卷积的时空图卷积网络,通过建模多台站关系来改进地震检测,实验证明其在真阳性率和假阳性率上优于传统方法。
English: The study introduces a Spatio-Temporal Graph Convolutional Network with Spectral Structure Learning Convolution to enhance earthquake detection by modeling multi-station relationships, demonstrating superior performance in true and false positive rates compared to traditional methods.
Authors:Giacomo Camposampiero, Michael Hersche, Roger Wattenhofer, Abu Sebastian, Abbas Rahimi
Abstract:
This work presents a first evaluation of two state-of-the-art Large Reasoning Models (LRMs), OpenAI's o3-mini and DeepSeek R1, on analogical reasoning, focusing on well-established nonverbal human IQ tests based on Raven's progressive matrices. We benchmark with the I-RAVEN dataset and its extension, I-RAVEN-X, which tests the ability to generalize to longer reasoning rules and ranges of the attribute values. To assess the influence of visual uncertainties on these symbolic analogical reasoning tests, we extend the I-RAVEN-X dataset, which otherwise assumes an oracle perception. We adopt a two-fold strategy to simulate this imperfect visual perception: 1) we introduce confounding attributes which, being sampled at random, do not contribute to the prediction of the correct answer of the puzzles, and 2) we smoothen the distributions of the input attributes' values. We observe a sharp decline in OpenAI's o3-mini task accuracy, dropping from 86.6% on the original I-RAVEN to just 17.0% -- approaching random chance -- on the more challenging I-RAVEN-X, which increases input length and range and emulates perceptual uncertainty. This drop occurred despite spending 3.4x more reasoning tokens. A similar trend is also observed for DeepSeek R1: from 80.6% to 23.2%. On the other hand, a neuro-symbolic probabilistic abductive model, ARLC, that achieves state-of-the-art performances on I-RAVEN, can robustly reason under all these out-of-distribution tests, maintaining strong accuracy with only a modest accuracy reduction from 98.6% to 88.0%. Our code is available at https://github.com/IBM/raven-large-language-models.
中文摘要:本研究基于瑞文推理测验评估OpenAI的o3-mini和DeepSeek R1的类比推理能力,发现在视觉不确定性条件下两者准确率急剧下降,而神经符号模型ARLC仍保持稳定表现。
English summary: This study evaluates OpenAI's o3-mini and DeepSeek R1 on analogical reasoning using Raven's matrices, finding their accuracy drops sharply under visual uncertainty conditions while the neuro-symbolic ARLC model maintains robust performance.
Authors:Rachel S. Y. Teo, Tan M. Nguyen
Abstract:
Large-scale pre-training of deep models, followed by fine-tuning them, has become the cornerstone of natural language processing (NLP). The prevalence of data coupled with computational resources has led to large models with a considerable number of parameters. While the massive size of these models has led to remarkable success in many NLP tasks, a detriment is the expense required to retrain all the base model's parameters for the adaptation to each task or domain. Parameter Efficient Fine-Tuning (PEFT) provides an effective solution for this challenge by minimizing the number of parameters required to be fine-tuned while maintaining the quality of the model. While existing methods have achieved impressive results, they mainly focus on adapting a subset of parameters, weight reparameterization, and prompt engineering. In this paper, we study layers as extractors of different types of linguistic information that are valuable when used in conjunction. We then propose the Mixture of Layer Experts (MoLEx), a novel sparse mixture of experts (SMoE) whose experts are layers in the pre-trained model. It performs a conditional computation of a mixture of layers during fine-tuning to provide the model with more structural knowledge about the data. By providing an avenue for information exchange between layers, MoLEx enables the model to make a more well-informed prediction for the downstream task, leading to better fine-tuning results with the same number of effective parameters. As experts can be processed in parallel, MoLEx introduces minimal additional computational overhead. We empirically corroborate the advantages of MoLEx when combined with popular PEFT baseline methods on a variety of downstream fine-tuning tasks, including the popular GLUE benchmark as well as the End-to-End Challenge (E2E). The code is publicly available at https://github.com/rachtsy/molex.
中文: 大规模预训练模型在任务适应时面临高昂的重训练成本,而提出的层专家混合方法通过将模型层作为专家组合,以最小计算开销增强结构知识交换,实现了高效微调。
English: Large-scale pre-trained models face high retraining costs for task adaptation, but the proposed Mixture of Layer Experts (MoLEx) method enables efficient fine-tuning by combining layers as experts to enhance structural knowledge exchange with minimal computational overhead.
Authors:Haihong Zhao, Chenyi Zi, Aochuan Chen, Jia Li
Abstract:
Graph learning plays a vital role in mining and analyzing complex relationships involved in graph data, which is widely used in many real-world applications like transaction networks and communication networks. Foundation models in CV and NLP have shown powerful cross-domain capabilities that are also significant in graph domains. However, existing graph learning approaches struggle with cross-domain tasks. Inspired by successes in CV and NLP, cross-domain graph learning has once again become a focal point of attention to realizing true graph foundation models. In this survey, we present a comprehensive review and analysis of existing works on cross-domain graph learning. Concretely, we first propose a new taxonomy, categorizing existing approaches based on the learned cross-domain information: structure, feature, and structure-feature mixture. Next, we systematically survey representative methods in these categories. Finally, we discuss the remaining limitations of existing studies and highlight promising avenues for future research. Relevant papers are summarized and will be consistently updated at: https://github.com/cshhzhao/Awesome-Cross-Domain-Graph-Learning.
中文: 本文系统综述了跨领域图学习方法,按学习信息类型进行分类,并探讨了推动图基础模型发展的潜力与未来方向。
English: This survey comprehensively reviews cross-domain graph learning, categorizing methods by learned information types and analyzing their potential to advance graph foundation models.
Authors:Worameth Chinchuthakun, Tossaporn Saengja, Nontawat Tritrong, Pitchaporn Rewatbowornwong, Pramook Khungurn, Supasorn Suwajanakorn
Abstract:
While diffusion models show promising results in image editing given a target prompt, achieving both prompt fidelity and background preservation remains difficult. Recent works have introduced score distillation techniques that leverage the rich generative prior of text-to-image diffusion models to solve this task without additional fine-tuning. However, these methods often struggle with tasks such as object insertion. Our investigation of these failures reveals significant variations in gradient magnitude and spatial distribution, making hyperparameter tuning highly input-specific or unsuccessful. To address this, we propose two simple yet effective modifications: attention-based spatial regularization and gradient filtering-normalization, both aimed at reducing these variations during gradient updates. Experimental results show our method outperforms state-of-the-art score distillation techniques in prompt fidelity, improving successful edits while preserving the background. Users also preferred our method over state-of-the-art techniques across three metrics, and by 58-64% overall.
中文摘要:本研究通过引入基于注意力的空间正则化和梯度过滤归一化两项改进,有效解决了分数蒸馏技术中的梯度变化问题,在图像编辑任务中实现了比现有最优方法更优的提示匹配度和背景保持效果。
English Summary: The proposed method introduces attention-based spatial regularization and gradient filtering-normalization to address gradient variations in score distillation techniques, achieving superior prompt fidelity and background preservation in image editing compared to state-of-the-art approaches.
Authors:Hongkai Zheng, Wenda Chu, Bingliang Zhang, Zihui Wu, Austin Wang, Berthy T. Feng, Caifeng Zou, Yu Sun, Nikola Kovachki, Zachary E. Ross, Katherine L. Bouman, Yisong Yue
Abstract:
Plug-and-play diffusion priors (PnPDP) have emerged as a promising research direction for solving inverse problems.
However, current studies primarily focus on natural image restoration, leaving the performance of these algorithms in scientific inverse problems largely unexplored. To address this gap, we introduce \textsc{InverseBench}, a framework that evaluates diffusion models across five distinct scientific inverse problems. These problems present unique structural challenges that differ from existing benchmarks, arising from critical scientific applications such as optical tomography, medical imaging, black hole imaging, seismology, and fluid dynamics. With \textsc{InverseBench}, we benchmark 14 inverse problem algorithms that use plug-and-play diffusion priors against strong, domain-specific baselines, offering valuable new insights into the strengths and weaknesses of existing algorithms. To facilitate further research and development, we open-source the codebase, along with datasets and pre-trained models, at https://devzhk.github.io/InverseBench/.
Authors:Sungwoo Cho, Jeongsoo Choi, Sungnyun Kim, Se-Young Yun
Abstract:
Despite recent advances in text-to-speech (TTS) models, audio-visual-to-audio-visual (AV2AV) translation still faces a critical challenge: maintaining speaker consistency between the original and translated vocal and facial features. To address this issue, we propose a conditional flow matching (CFM) zero-shot audio-visual renderer that utilizes strong dual guidance from both audio and visual modalities. By leveraging multimodal guidance with CFM, our model robustly preserves speaker-specific characteristics and enhances zero-shot AV2AV translation abilities. For the audio modality, we enhance the CFM process by integrating robust speaker embeddings with x-vectors, which serve to bolster speaker consistency. Additionally, we convey emotional nuances to the face rendering module. The guidance provided by both audio and visual cues remains independent of semantic or linguistic content, allowing our renderer to effectively handle zero-shot translation tasks for monolingual speakers in different languages. We empirically demonstrate that the inclusion of high-quality mel-spectrograms conditioned on facial information not only enhances the quality of the synthesized speech but also positively influences facial generation, leading to overall performance improvements in LSE and FID score. Our code is available at https://github.com/Peter-SungwooCho/MAVFlow.
中文摘要:本研究提出一种条件流匹配零样本视听渲染器,通过双模态引导保持说话人特征一致性,有效提升跨语言视听翻译性能。
English Summary: This study introduces a conditional flow matching zero-shot audio-visual renderer that uses dual audio-visual guidance to maintain speaker consistency and enhance translation capabilities across languages.
Authors:Gaotang Li, Yuzhong Chen, Hanghang Tong
Abstract:
Language Models (LMs) often encounter knowledge conflicts when parametric memory contradicts contextual knowledge. Previous works attribute this conflict to the interplay between "memory heads" and "context heads", attention heads assumed to promote either memory or context exclusively. In this study, we go beyond this fundamental assumption by uncovering a critical phenomenon we term the superposition of contextual information and parametric memory, where highly influential attention heads simultaneously contribute to both memory and context. Building upon this insight, we propose Just Run Twice (JuICE), a test-time attention intervention method that steers LMs toward either parametric beliefs or contextual knowledge without requiring fine-tuning. JuICE identifies a set of reliable attention heads and leverages a dual-run approach to mitigate the superposition effects. Extensive experiments across 11 datasets and 6 model architectures demonstrate that JuICE sets the new state-of-the-art performance and robust generalization, achieving significant and consistent improvement across different domains under various conflict types. Finally, we theoretically analyze knowledge conflict and the superposition of contextual information and parametric memory in attention heads, which further elucidates the effectiveness of JuICE in these settings. Our code is available at https://github.com/GaotangLi/JUICE.
中文: 本研究揭示了语言模型中注意力头能同时处理上下文与参数记忆,产生叠加效应,并提出了JuICE这一无需微调的测试时干预方法,有效解决知识冲突,在多种数据集和模型上实现了最优性能。
English: This study reveals that attention heads in language models can simultaneously handle both contextual and parametric knowledge, leading to a superposition effect, and introduces JuICE, a test-time intervention method that effectively resolves knowledge conflicts without fine-tuning, achieving state-of-the-art performance across diverse datasets and models.
Authors:Shanghua Gao, Richard Zhu, Zhenglun Kong, Ayush Noori, Xiaorui Su, Curtis Ginder, Theodoros Tsiligkaridis, Marinka Zitnik
Abstract:
Precision therapeutics require multimodal adaptive models that generate personalized treatment recommendations. We introduce TxAgent, an AI agent that leverages multi-step reasoning and real-time biomedical knowledge retrieval across a toolbox of 211 tools to analyze drug interactions, contraindications, and patient-specific treatment strategies. TxAgent evaluates how drugs interact at molecular, pharmacokinetic, and clinical levels, identifies contraindications based on patient comorbidities and concurrent medications, and tailors treatment strategies to individual patient characteristics. It retrieves and synthesizes evidence from multiple biomedical sources, assesses interactions between drugs and patient conditions, and refines treatment recommendations through iterative reasoning. It selects tools based on task objectives and executes structured function calls to solve therapeutic tasks that require clinical reasoning and cross-source validation. The ToolUniverse consolidates 211 tools from trusted sources, including all US FDA-approved drugs since 1939 and validated clinical insights from Open Targets. TxAgent outperforms leading LLMs, tool-use models, and reasoning agents across five new benchmarks: DrugPC, BrandPC, GenericPC, TreatmentPC, and DescriptionPC, covering 3,168 drug reasoning tasks and 456 personalized treatment scenarios. It achieves 92.1% accuracy in open-ended drug reasoning tasks, surpassing GPT-4o and outperforming DeepSeek-R1 (671B) in structured multi-step reasoning. TxAgent generalizes across drug name variants and descriptions. By integrating multi-step inference, real-time knowledge grounding, and tool-assisted decision-making, TxAgent ensures that treatment recommendations align with established clinical guidelines and real-world evidence, reducing the risk of adverse events and improving therapeutic decision-making.
中文: TxAgent是一种人工智能系统,通过多步骤推理和实时检索211种工具的医学知识,提供个性化药物相互作用分析和治疗建议,在药物推理任务中达到92.1%的准确率,性能超过GPT-4o等领先模型。
English: TxAgent is an AI system that utilizes multi-step reasoning and real-time biomedical data from 211 tools to provide personalized drug interaction analysis and treatment recommendations, achieving 92.1% accuracy in drug reasoning tasks while outperforming leading models like GPT-4o.
Authors:Zhuoyan Xu, Khoi Duc Nguyen, Preeti Mukherjee, Saurabh Bagchi, Somali Chaterji, Yingyu Liang, Yin Li
Abstract:
Multimodal Large Language Models (MLLMs) have shown impressive capabilities in visual reasoning, yet come with substantial computational cost, limiting their deployment in resource-constrained settings. Despite recent effort on improving the efficiency of MLLMs, prior solutions fall short in responding to varying runtime conditions, in particular changing resource availability (e.g., contention due to the execution of other programs on the device). To bridge this gap, we introduce AdaLLaVA, an adaptive inference framework that learns to dynamically reconfigure operations in an MLLM during inference, accounting for the input data and a latency budget. We conduct extensive experiments across benchmarks involving question-answering, reasoning, and hallucination. Our results show that AdaLLaVA effectively adheres to input latency budget, achieving varying accuracy and latency tradeoffs at runtime. Further, we demonstrate that AdaLLaVA adapts to both input latency and content, can be integrated with token selection for enhanced efficiency, and generalizes across MLLMs. Our project webpage with code release is at https://zhuoyan-xu.github.io/ada-llava/.
Authors:Pedro Pessoa, Paul Campitelli, Douglas P. Shepherd, S. Banu Ozkan, Steve Pressé
Abstract:
State space models, such as Mamba, have recently garnered attention in time series forecasting due to their ability to capture sequence patterns. However, in electricity consumption benchmarks, Mamba forecasts exhibit a mean error of approximately 8\%. Similarly, in traffic occupancy benchmarks, the mean error reaches 18\%. This discrepancy leaves us to wonder whether the prediction is simply inaccurate or falls within error given spread in historical data. To address this limitation, we propose a method to quantify the predictive uncertainty of Mamba forecasts. Here, we propose a dual-network framework based on the Mamba architecture for probabilistic forecasting, where one network generates point forecasts while the other estimates predictive uncertainty by modeling variance. We abbreviate our tool, Mamba with probabilistic time series forecasting, as Mamba-ProbTSF and the code for its implementation is available on GitHub (https://github.com/PessoaP/Mamba-ProbTSF). Evaluating this approach on synthetic and real-world benchmark datasets, we find Kullback-Leibler divergence between the learned distributions and the data--which, in the limit of infinite data, should converge to zero if the model correctly captures the underlying probability distribution--reduced to the order of $10^{-3}$ for synthetic data and $10^{-1}$ for real-world benchmark, demonstrating its effectiveness. We find that in both the electricity consumption and traffic occupancy benchmark, the true trajectory stays within the predicted uncertainty interval at the two-sigma level about 95\% of the time. We end with a consideration of potential limitations, adjustments to improve performance, and considerations for applying this framework to processes for purely or largely stochastic dynamics where the stochastic changes accumulate, as observed for example in pure Brownian motion or molecular dynamics trajectories.
Chinese: 作者提出了Mamba-ProbTSF,一种基于Mamba架构的双网络框架,既能生成点预测又能量化预测不确定性,在合成和真实世界基准测试中表现出色,KL散度显著降低且不确定性区间准确覆盖真实轨迹。
English: The authors introduce Mamba-ProbTSF, a dual-network framework based on the Mamba architecture that generates point forecasts and quantifies predictive uncertainty, demonstrating effectiveness in synthetic and real-world benchmarks with reduced Kullback-Leibler divergence and accurate uncertainty intervals.
Authors:Leonard Waldmann, Ando Shah, Yi Wang, Nils Lehmann, Adam J. Stewart, Zhitong Xiong, Xiao Xiang Zhu, Stefan Bauer, John Chuang
Abstract:
Earth observation (EO) data features diverse sensing platforms with varying spectral bands, spatial resolutions, and sensing modalities. While most prior work has constrained inputs to fixed sensors, a new class of any-sensor foundation models able to process arbitrary sensors has recently emerged. Contributing to this line of work, we propose Panopticon, an any-sensor foundation model built on the DINOv2 framework. We extend DINOv2 by (1) treating images of the same geolocation across sensors as natural augmentations, (2) subsampling channels to diversify spectral input, and (3) adding a cross attention over channels as a flexible patch embedding mechanism. By encoding the wavelength and modes of optical and synthetic aperture radar sensors, respectively, Panopticon can effectively process any combination of arbitrary channels. In extensive evaluations, we achieve state-of-the-art performance on GEO-Bench, especially on the widely-used Sentinel-1 and Sentinel-2 sensors, while out-competing other any-sensor models, as well as domain adapted fixed-sensor models on unique sensor configurations. Panopticon enables immediate generalization to both existing and future satellite platforms, advancing sensor-agnostic EO.
中文摘要:Panopticon是基于DINOv2框架构建的多传感器基础模型,通过光谱通道子采样和跨注意力机制处理任意组合的遥感数据,在多种传感器上实现最优性能,并能泛化至未来卫星平台。
English Summary: Panopticon is an any-sensor foundation model based on DINOv2 that processes diverse Earth observation data through spectral channel subsampling and cross-attention mechanisms, achieving state-of-the-art performance across multiple sensors while enabling generalization to future satellite platforms.
Authors:Yafei Zhang, Murray Wang, Yu Wang, Xiaohui Wang
Abstract:
Matching job descriptions (JDs) with suitable talent requires models capable of understanding not only textual similarities between JDs and candidate resumes but also contextual factors such as geographical location and academic seniority. To address this challenge, we propose a two-stage training framework for large language models (LLMs). In the first stage, a contrastive learning approach is used to train the model on a dataset constructed from real-world matching rules, such as geographical alignment and research area overlap. While effective, this model primarily learns patterns that defined by the matching rules. In the second stage, we introduce a novel preference-based fine-tuning method inspired by Direct Preference Optimization (DPO), termed Rank Preference Optimization (RankPO), to align the model with AI-curated pairwise preferences emphasizing textual understanding. Our experiments show that while the first-stage model achieves strong performance on rule-based data (nDCG@20 = 0.706), it lacks robust textual understanding (alignment with AI annotations = 0.46). By fine-tuning with RankPO, we achieve a balanced model that retains relatively good performance in the original tasks while significantly improving the alignment with AI preferences. The code and data are available at https://github.com/yflyzhang/RankPO.
中文: 本文提出一个两阶段训练框架,先通过基于现实匹配规则的对比学习训练大语言模型,再采用新型排序偏好优化方法强化文本理解能力,最终获得既能保持规则匹配性能、又显著提升与AI偏好对齐度的平衡模型。
English: This paper introduces a two-stage training framework for large language models that first uses contrastive learning based on real-world matching rules and then applies a novel Rank Preference Optimization method to enhance textual understanding, resulting in a balanced model that maintains rule-based performance while significantly improving alignment with AI preferences.
Authors:Fan Lyu, Tianle Liu, Zhang Zhang, Fuyuan Hu, Liang Wang
Abstract:
We introduce Test-Time Discovery (TTD) as a novel task that addresses class shifts during testing, requiring models to simultaneously identify emerging categories while preserving previously learned ones. A key challenge in TTD is distinguishing newly discovered classes from those already identified. To address this, we propose a training-free, hash-based memory mechanism that enhances class discovery through fine-grained comparisons with past test samples. Leveraging the characteristics of unknown classes, our approach introduces hash representation based on feature scale and directions, utilizing Locality-Sensitive Hashing (LSH) for efficient grouping of similar samples. This enables test samples to be easily and quickly compared with relevant past instances. Furthermore, we design a collaborative classification strategy, combining a prototype classifier for known classes with an LSH-based classifier for novel ones. To enhance reliability, we incorporate a self-correction mechanism that refines memory labels through hash-based neighbor retrieval, ensuring more stable and accurate class assignments. Experimental results demonstrate that our method achieves good discovery of novel categories while maintaining performance on known classes, establishing a new paradigm in model testing. Our code is available at https://github.com/fanlyu/ttd.
中文摘要:本文提出测试时发现(TTD)方法,通过基于哈希的内存机制和协同分类策略,在测试阶段有效识别新类别同时保持已知类别性能,无需重新训练模型。
English Summary: The paper introduces Test-Time Discovery (TTD), a training-free method using hash-based memory and collaborative classification to identify new classes during testing while maintaining known class performance.
Authors:Ho Kei Cheng, Alexander Schwing
Abstract:
Minibatch optimal transport coupling straightens paths in unconditional flow matching. This leads to computationally less demanding inference as fewer integration steps and less complex numerical solvers can be employed when numerically solving an ordinary differential equation at test time. However, in the conditional setting, minibatch optimal transport falls short. This is because the default optimal transport mapping disregards conditions, resulting in a conditionally skewed prior distribution during training. In contrast, at test time, we have no access to the skewed prior, and instead sample from the full, unbiased prior distribution. This gap between training and testing leads to a subpar performance. To bridge this gap, we propose conditional optimal transport C^2OT that adds a conditional weighting term in the cost matrix when computing the optimal transport assignment. Experiments demonstrate that this simple fix works with both discrete and continuous conditions in 8gaussians-to-moons, CIFAR-10, ImageNet-32x32, and ImageNet-256x256. Our method performs better overall compared to the existing baselines across different function evaluation budgets. Code is available at https://hkchengrex.github.io/C2OT
Authors:Zhaoyi Li, Xiaohan Zhao, Dong-Dong Wu, Jiacheng Cui, Zhiqiang Shen
Abstract:
Despite promising performance on open-source large vision-language models (LVLMs), transfer-based targeted attacks often fail against black-box commercial LVLMs. Analyzing failed adversarial perturbations reveals that the learned perturbations typically originate from a uniform distribution and lack clear semantic details, resulting in unintended responses. This critical absence of semantic information leads commercial LVLMs to either ignore the perturbation entirely or misinterpret its embedded semantics, thereby causing the attack to fail. To overcome these issues, we notice that identifying core semantic objects is a key objective for models trained with various datasets and methodologies. This insight motivates our approach that refines semantic clarity by encoding explicit semantic details within local regions, thus ensuring interoperability and capturing finer-grained features, and by concentrating modifications on semantically rich areas rather than applying them uniformly. To achieve this, we propose a simple yet highly effective solution: at each optimization step, the adversarial image is cropped randomly by a controlled aspect ratio and scale, resized, and then aligned with the target image in the embedding space. Experimental results confirm our hypothesis. Our adversarial examples crafted with local-aggregated perturbations focused on crucial regions exhibit surprisingly good transferability to commercial LVLMs, including GPT-4.5, GPT-4o, Gemini-2.0-flash, Claude-3.5-sonnet, Claude-3.7-sonnet, and even reasoning models like o1, Claude-3.7-thinking and Gemini-2.0-flash-thinking. Our approach achieves success rates exceeding 90% on GPT-4.5, 4o, and o1, significantly outperforming all prior state-of-the-art attack methods. Our optimized adversarial examples under different configurations and training code are available at https://github.com/VILA-Lab/M-Attack.
中文: 现有针对黑盒商业大视觉语言模型的定向攻击常因扰动缺乏语义细节而失败,而我们的方法通过局部裁剪和对齐将对抗性修改集中于语义丰富区域,显著提升了迁移性,在GPT-4.5和Claude-3.7等模型上成功率超90%。
English: Current targeted attacks on black-box commercial large vision-language models often fail due to their uniform perturbations lacking semantic clarity, but our method enhances transferability by focusing adversarial modifications on semantically rich regions through local cropping and alignment, achieving over 90% success on models like GPT-4.5 and Claude-3.7.
Authors:Tianjiao Yu, Vedant Shah, Muntasir Wahed, Kiet A. Nguyen, Adheesh Juvekar, Tal August, Ismini Lourentzou
Abstract:
Expressing confidence is challenging for embodied agents navigating dynamic multimodal environments, where uncertainty arises from both perception and decision-making processes. We present the first work investigating embodied confidence elicitation in open-ended multimodal environments. We introduce Elicitation Policies, which structure confidence assessment across inductive, deductive, and abductive reasoning, along with Execution Policies, which enhance confidence calibration through scenario reinterpretation, action sampling, and hypothetical reasoning. Evaluating agents in calibration and failure prediction tasks within the Minecraft environment, we show that structured reasoning approaches, such as Chain-of-Thoughts, improve confidence calibration. However, our findings also reveal persistent challenges in distinguishing uncertainty, particularly under abductive settings, underscoring the need for more sophisticated embodied confidence elicitation methods.
Authors:Jiachen Zhu, Xinlei Chen, Kaiming He, Yann LeCun, Zhuang Liu
Abstract:
Normalization layers are ubiquitous in modern neural networks and have long been considered essential. This work demonstrates that Transformers without normalization can achieve the same or better performance using a remarkably simple technique. We introduce Dynamic Tanh (DyT), an element-wise operation $DyT($x$) = \tanh(α$x$)$, as a drop-in replacement for normalization layers in Transformers. DyT is inspired by the observation that layer normalization in Transformers often produces tanh-like, $S$-shaped input-output mappings. By incorporating DyT, Transformers without normalization can match or exceed the performance of their normalized counterparts, mostly without hyperparameter tuning. We validate the effectiveness of Transformers with DyT across diverse settings, ranging from recognition to generation, supervised to self-supervised learning, and computer vision to language models. These findings challenge the conventional understanding that normalization layers are indispensable in modern neural networks, and offer new insights into their role in deep networks.
Authors:Egor Zverev, Evgenii Kortukov, Alexander Panfilov, Alexandra Volkova, Soroush Tabesh, Sebastian Lapuschkin, Wojciech Samek, Christoph H. Lampert
Abstract:
Despite their remarkable performance, large language models lack elementary safety features, making them susceptible to numerous malicious attacks. In particular, previous work has identified the absence of an intrinsic separation between instructions and data as a root cause of the success of prompt injection attacks. In this work, we propose a new architectural element, ASIDE, that allows language models to clearly separate instructions and data at the level of embeddings. ASIDE applies an orthogonal rotation to the embeddings of data tokens, thus creating clearly distinct representations of instructions and data tokens without introducing any additional parameters. As we demonstrate experimentally across a range of models, instruction-tuning LLMs with ASIDE (1) leads to highly increased instruction-data separation without a loss in model utility and (2) makes the models more robust to prompt injection benchmarks, even without dedicated safety training. Additionally, we provide insights into the mechanism underlying our method through an analysis of the model representations. The source code and training scripts are openly accessible at https://github.com/egozverev/aside.
中文摘要:本研究提出ASIDE架构改进方案,通过对数据令牌嵌入进行正交旋转实现指令与数据的清晰分离,在不影响模型性能的前提下显著增强语言模型对提示注入攻击的防御能力。
English Summary: The study introduces ASIDE, an architectural enhancement that improves language model security by orthogonally rotating data token embeddings to clearly separate instructions from data, thereby increasing robustness against prompt injection attacks without compromising performance.
Authors:Zeyue Tian, Yizhu Jin, Zhaoyang Liu, Ruibin Yuan, Xu Tan, Qifeng Chen, Wei Xue, Yike Guo
Abstract:
Audio and music generation have emerged as crucial tasks in many applications, yet existing approaches face significant limitations: they operate in isolation without unified capabilities across modalities, suffer from scarce high-quality, multi-modal training data, and struggle to effectively integrate diverse inputs. In this work, we propose AudioX, a unified Diffusion Transformer model for Anything-to-Audio and Music Generation. Unlike previous domain-specific models, AudioX can generate both general audio and music with high quality, while offering flexible natural language control and seamless processing of various modalities including text, video, image, music, and audio. Its key innovation is a multi-modal masked training strategy that masks inputs across modalities and forces the model to learn from masked inputs, yielding robust and unified cross-modal representations. To address data scarcity, we curate two comprehensive datasets: vggsound-caps with 190K audio captions based on the VGGSound dataset, and V2M-caps with 6 million music captions derived from the V2M dataset. Extensive experiments demonstrate that AudioX not only matches or outperforms state-of-the-art specialized models, but also offers remarkable versatility in handling diverse input modalities and generation tasks within a unified architecture. The code and datasets will be available at https://zeyuet.github.io/AudioX/
中文: AudioX作为一种统一的扩散变换器模型,通过多模态掩码训练策略和构建大规模数据集,实现了任意内容到音频的高质量生成,既能处理文本、视频等多种输入,又超越了专业模型的性能表现。
English: AudioX is a unified Diffusion Transformer model that overcomes limitations in audio and music generation by enabling anything-to-audio conversion with flexible natural language control and robust cross-modal learning, while newly curated datasets address data scarcity issues.
Authors:Eirik Høyheim, Lars Skaaret-Lund, Solve Sæbø, Aliaksandr Hubin
Abstract:
Modeling natural phenomena with artificial neural networks (ANNs) often provides highly accurate predictions. However, ANNs often suffer from over-parameterization, complicating interpretation and raising uncertainty issues. Bayesian neural networks (BNNs) address the latter by representing weights as probability distributions, allowing for predictive uncertainty evaluation. Latent binary Bayesian neural networks (LBBNNs) further handle structural uncertainty and sparsify models by removing redundant weights. This article advances LBBNNs by enabling covariates to skip to any succeeding layer or be excluded, simplifying networks and clarifying input impacts on predictions. Ultimately, a linear model or even a constant can be found to be optimal for a specific problem at hand. Furthermore, the input-skip LBBNN approach reduces network density significantly compared to standard LBBNNs, achieving over 99% reduction for small networks and over 99.9% for larger ones, while still maintaining high predictive accuracy and uncertainty measurement. For example, on MNIST, we reached 97% accuracy and great calibration with just 935 weights, reaching state-of-the-art for compression of neural networks. Furthermore, the proposed method accurately identifies the true covariates and adjusts for system non-linearity. The main contribution is the introduction of active paths, enhancing directly designed global and local explanations within the LBBNN framework, that have theoretical guarantees and do not require post hoc external tools for explanations.
中文: 本文提出一种改进的潜在二元贝叶斯神经网络(LBBNN),允许协变量跨层连接或排除,在保持高精度和不确定性量化的同时实现超过99%的网络稀疏度,并提供无需外部工具的理论可解释性。
English: The article introduces an enhanced Latent Binary Bayesian Neural Network (LBBNN) that allows covariates to skip layers or be excluded, achieving over 99% network sparsity while maintaining high accuracy and uncertainty quantification, and providing built-in theoretical explanations without external tools.
Authors:Liang Wen, Yunke Cai, Fenrui Xiao, Xin He, Qi An, Zhenyu Duan, Yimin Du, Junchen Liu, Lifu Tang, Xiaowei Lv, Haosheng Zou, Yongchao Deng, Shousheng Jia, Xiangzheng Zhang
Abstract:
This paper introduces Light-R1, an open-source suite for training long reasoning models using reproducible and cost-effective methodology. Given the proprietary nature of data used in the DeepSeek-R1 series, we develop an alternative approach leveraging exclusively public data and models. Our curriculum training progressively increases data difficulty, combined with multi-staged post-training. Our Light-R1-32B model, trained from Qwen2.5-32B-Instruct, outperforms DeepSeek-R1-Distill-Qwen-32B in math reasoning.
Experimental results show that this curriculum approach becomes more effective when distinct, diverse datasets are available for different training stages: fine-tuning DeepSeek-R1-Distilled models (pre-tuned by DeepSeek team on proprietary data) with 3,000 challenging examples from our curriculum dataset yielded state-of-the-art 7B and 14B models, while the 32B model, Light-R1-32B-DS performed comparably to QwQ-32B and DeepSeek-R1.
Furthermore, we extend our work by applying GRPO on long reasoning models. Our final Light-R1-14B-DS achieves SOTA performance among 14B models in math, with AIME24 & 25 scores of 74.0 and 60.2 respectively, surpassing many 32B models and DeepSeek-R1-Distill-Llama-70B. Despite math-focused training, Light-R1-14B-DS demonstrates strong cross-domain generalization.
Light-R1 represents a significant advancement in making sophisticated reasoning models more accessible and implementable in real-world applications. Our models, training data and code have been made available at https://github.com/Qihoo360/Light-R1.
中文: 本文介绍Light-R1开源框架,它采用经济高效的课程学习方法和公开数据训练长推理模型,在数学推理上达到先进水平并展现强大跨领域泛化能力。
English: This paper presents Light-R1, an open-source framework for training long reasoning models using a cost-effective curriculum approach with public data, achieving state-of-the-art performance in math reasoning and strong cross-domain generalization.
Authors:Matteo Gambella, Fabrizio Pittorino, Manuel Roveri
Abstract:
Neural Architecture Search (NAS) has become an essential tool for designing effective and efficient neural networks. In this paper, we investigate the geometric properties of neural architecture spaces commonly used in differentiable NAS methods, specifically NAS-Bench-201 and DARTS. By defining flatness metrics such as neighborhoods and loss barriers along paths in architecture space, we reveal locality and flatness characteristics analogous to the well-known properties of neural network loss landscapes in weight space. In particular, we find that highly accurate architectures cluster together in flat regions, while suboptimal architectures remain isolated, unveiling the detailed geometrical structure of the architecture search landscape. Building on these insights, we propose Architecture-Aware Minimization (A$^2$M), a novel analytically derived algorithmic framework that explicitly biases, for the first time, the gradient of differentiable NAS methods towards flat minima in architecture space. A$^2$M consistently improves generalization over state-of-the-art DARTS-based algorithms on benchmark datasets including CIFAR-10, CIFAR-100, and ImageNet16-120, across both NAS-Bench-201 and DARTS search spaces. Notably, A$^2$M is able to increase the test accuracy, on average across different differentiable NAS methods, by +3.60\% on CIFAR-10, +4.60\% on CIFAR-100, and +3.64\% on ImageNet16-120, demonstrating its superior effectiveness in practice. A$^2$M can be easily integrated into existing differentiable NAS frameworks, offering a versatile tool for future research and applications in automated machine learning. We open-source our code at https://github.com/AI-Tech-Research-Lab/AsquaredM.
中文: 本文提出架构感知最小化(A²M)这一创新框架,通过引导梯度朝向架构空间中的平坦最小值来增强神经架构搜索,在多个基准数据集上显著提升了泛化性能。
English: This paper introduces Architecture-Aware Minimization (A²M), a novel framework that enhances neural architecture search by guiding gradients toward flat minima in architecture space, significantly improving generalization across benchmark datasets.
Authors:Zhen Zhang, Meihan Liu, Bingsheng He
Abstract:
Graph domain adaptation has emerged as a promising approach to facilitate knowledge transfer across different domains. Recently, numerous models have been proposed to enhance their generalization capabilities in this field. However, there is still no unified library that brings together existing techniques and simplifies their implementation. To fill this gap, we introduce PyGDA, an open-source Python library tailored for graph domain adaptation. As the first comprehensive library in this area, PyGDA covers more than 20 widely used graph domain adaptation methods together with different types of graph datasets. Specifically, PyGDA offers modular components, enabling users to seamlessly build custom models with a variety of commonly used utility functions. To handle large-scale graphs, PyGDA includes support for features such as sampling and mini-batch processing, ensuring efficient computation. In addition, PyGDA also includes comprehensive performance benchmarks and well-documented user-friendly API for both researchers and practitioners. To foster convenient accessibility, PyGDA is released under the MIT license at https://github.com/pygda-team/pygda, and the API documentation is https://pygda.readthedocs.io/en/stable/.
中文: PyGDA作为首个开源图域自适应综合库,集成了20多种主流方法并提供模块化组件与可扩展功能,填补了该领域统一工具库的空白。
English: PyGDA is an open-source Python library that unifies over 20 graph domain adaptation methods with modular components and scalable features to bridge the gap in standardized implementations for cross-domain knowledge transfer.
Authors:Jinze Li, Yixing Xu, Haiduo Huang, Xuanwu Yin, Dong Li, Edith C. H. Ngai, Emad Barsoum
Abstract:
Speculative decoding (SPD) aims to accelerate the auto-regressive token generation process of a target Large Language Model (LLM). Some approaches employ a draft model with multiple heads to predict a sequence of future tokens, where each head handles a token in the sequence. The target LLM verifies the predicted sequence and accepts aligned tokens, enabling efficient multi-token generation. However, existing methods assume that all tokens within a sequence are equally important, employing identical head structures and relying on a single-generation paradigm, either serial or parallel. To this end, we theoretically demonstrate that initial tokens in the draft sequence are more important than later ones. Building on this insight, we propose Gumiho, a hybrid model combining serial and parallel heads. Specifically, given the critical importance of early tokens, we employ a sophisticated Transformer architecture for the early draft heads in a serial configuration to improve accuracy. For later tokens, we utilize multiple lightweight MLP heads operating in parallel to enhance efficiency. By allocating more advanced model structures and longer running times to the early heads, Gumiho achieves improved overall performance. The experimental results demonstrate that our method outperforms existing approaches, fully validating its effectiveness.
中文: Gumiho是一种混合推测解码模型,结合了串行与并行头部,通过为早期令牌使用复杂Transformer提升准确性,后期令牌采用轻量级MLP提高效率,从而在性能上超越现有方法。
English: Gumiho is a hybrid speculative decoding model that combines serial and parallel heads, using a sophisticated Transformer for early tokens to boost accuracy and lightweight MLPs for later tokens to enhance efficiency, outperforming existing methods.
Authors:Yiyang Ling, Karan Owalekar, Oluwatobiloba Adesanya, Erdem Bıyık, Daniel Seita
Abstract:
Motion planning involves determining a sequence of robot configurations to reach a desired pose, subject to movement and safety constraints. Traditional motion planning finds collision-free paths, but this is overly restrictive in clutter, where it may not be possible for a robot to accomplish a task without contact. In addition, contacts range from relatively benign (e.g., brushing a soft pillow) to more dangerous (e.g., toppling a glass vase). Due to this diversity, it is difficult to characterize which contacts may be acceptable or unacceptable. In this paper, we propose IMPACT, a novel motion planning framework that uses Vision-Language Models (VLMs) to infer environment semantics, identifying which parts of the environment can best tolerate contact based on object properties and locations. Our approach uses the VLM's outputs to produce a dense 3D "cost map" that encodes contact tolerances and seamlessly integrates with standard motion planners. We perform experiments using 20 simulation and 10 real-world scenes and assess using task success rate, object displacements, and feedback from human evaluators. Our results over 3620 simulation and 200 real-world trials suggest that IMPACT enables efficient contact-rich motion planning in cluttered settings while outperforming alternative methods and ablations. Supplementary material is available at https://impact-planning.github.io/.
Authors:Jiawei Zhang, Ziyuan Liu, Leon Yan, Gen Li, Yuantao Gu
Abstract:
Diffusion models have demonstrated remarkable performance in modeling complex data priors, catalyzing their widespread adoption in solving various inverse problems. However, the inherently iterative nature of diffusion-based inverse algorithms often requires hundreds to thousands of steps, with performance degradation occurring under fewer steps which limits their practical applicability. While high-order diffusion ODE solvers have been extensively explored for efficient diffusion sampling without observations, their application to inverse problems remains underexplored due to the diverse forms of inverse algorithms and their need for repeated trajectory correction based on observations. To address this gap, we first introduce a canonical form that decomposes existing diffusion-based inverse algorithms into three modules to unify their analysis. Inspired by the linear subspace search strategy in the design of high-order diffusion ODE solvers, we propose the Learnable Linear Extrapolation (LLE) method, a lightweight approach that universally enhances the performance of any diffusion-based inverse algorithm that fits the proposed canonical form. Extensive experiments demonstrate consistent improvements of the proposed LLE method across multiple algorithms and tasks, indicating its potential for more efficient solutions and boosted performance of diffusion-based inverse algorithms with limited steps. Codes for reproducing our experiments are available at https://github.com/weigerzan/LLE_inverse_problem.
中文摘要:本文提出了一种规范形式来统一基于扩散的逆问题算法,并设计了可学习线性外推(LLE)方法,能够普遍提升这些算法在有限步数下的性能表现。
English Summary: This paper introduces a canonical form to unify diffusion-based inverse algorithms and proposes the Learnable Linear Extrapolation (LLE) method to universally enhance their performance with limited computational steps.
Authors:Bharat Srikishan, Daniel O'Malley, Mohamed Mehana, Nicholas Lubbers, Nikhil Muralidhar
Abstract:
Modeling the evolution of physical systems is critical to many applications in science and engineering. As the evolution of these systems is governed by partial differential equations (PDEs), there are a number of computational simulations which resolve these systems with high accuracy. However, as these simulations incur high computational costs, they are infeasible to be employed for large-scale analysis. A popular alternative to simulators are neural network surrogates which are trained in a data-driven manner and are much more computationally efficient. However, these surrogate models suffer from high rollout error when used autoregressively, especially when confronted with training data paucity. Existing work proposes to improve surrogate rollout error by either including physical loss terms directly in the optimization of the model or incorporating computational simulators as `differentiable layers' in the neural network. Both of these approaches have their challenges, with physical loss functions suffering from slow convergence for stiff PDEs and simulator layers requiring gradients which are not always available, especially in legacy simulators. We propose the Hybrid PDE Predictor with Reinforcement Learning (HyPER) model: a model-agnostic, RL based, cost-aware model which combines a neural surrogate, RL decision model, and a physics simulator (with or without gradients) to reduce surrogate rollout error significantly. In addition to reducing in-distribution rollout error by 47%-78%, HyPER learns an intelligent policy that is adaptable to changing physical conditions and resistant to noise corruption. Code available at https://github.com/scailab/HyPER.
中文:HyPER模型通过结合神经代理、强化学习和物理模拟器,显著降低了物理系统建模中的滚动误差,实现了47%-78%的误差减少,并能适应变化条件及抵抗噪声干扰。
English: The HyPER model combines neural surrogates with reinforcement learning and physics simulators to significantly reduce rollout errors in physical system modeling, achieving 47%-78% error reduction while adapting to changing conditions and resisting noise.
Authors:Shiwon Kim, Dongjun Hwang, Sungwon Woo, Rita Singh
Abstract:
Class-incremental learning (CIL) aims to adapt to continuously emerging new classes while preserving knowledge of previously learned ones. Few-shot class-incremental learning (FSCIL) presents a greater challenge that requires the model to learn new classes from only a limited number of samples per class. While incremental learning typically assumes restricted access to past data, it often remains available in many real-world scenarios. This raises a practical question: should one retrain the model on the full dataset (i.e., joint training), or continue updating it solely with new data? In CIL, joint training is considered an ideal benchmark that provides a reference for evaluating the trade-offs between performance and computational cost. However, in FSCIL, joint training becomes less reliable due to severe imbalance between base and incremental classes. This results in the absence of a practical baseline, making it unclear which strategy is preferable for practitioners. To this end, we revisit joint training in the context of FSCIL by incorporating imbalance mitigation techniques, and suggest a new imbalance-aware joint training benchmark for FSCIL. We then conduct extensive comparisons between this benchmark and FSCIL methods to analyze which approach is most suitable when prior data is accessible. Our analysis offers realistic insights and guidance for selecting training strategies in real-world FSCIL scenarios. Code is available at: https://github.com/shiwonkim/Joint_FSCIL
中文: 本研究通过引入不平衡缓解技术重新审视了少样本类增量学习中的联合训练,提出了新的基准来比较策略,并为实际场景中可获取历史数据时提供实用指导。
English: The study revisits joint training in few-shot class-incremental learning by incorporating imbalance mitigation techniques, proposing a new benchmark to compare strategies and provide practical guidance when prior data is accessible.
Authors:Zijian Zhao, Xuming Zhang, Jiayu Wen, Mingwen Liu, Xiaoteng Ma
Abstract:
In financial trading, return prediction is one of the foundation for a successful trading system. By the fast development of the deep learning in various areas such as graphical processing, natural language, it has also demonstrate significant edge in handling with financial data. While the success of the deep learning relies on huge amount of labeled sample, labeling each time/event as profitable or unprofitable, under the transaction cost, especially in the high-frequency trading world, suffers from serious label imbalance issue.In this paper, we adopts rigurious end-to-end deep learning framework with comprehensive label imbalance adjustment methods and succeed in predicting in high-frequency return in the Chinese future market. The code for our method is publicly available at https://github.com/RS2002/Label-Unbalance-in-High-Frequency-Trading .
中文: 本文采用严谨的端到端深度学习框架,结合全面的标签不平衡调整方法,成功实现了中国期货市场高频收益预测,解决了交易成本下标签不平衡的难题。
English: This paper introduces a rigorous end-to-end deep learning framework with comprehensive label imbalance adjustment methods to successfully predict high-frequency returns in the Chinese futures market, addressing the challenge of label imbalance under transaction costs.
Authors:Allison Andreyev
Abstract:
Automated speech recognition (ASR) models have gained prominence for applications such as captioning, speech translation, and live transcription. This paper studies Whisper and two model variants: one optimized for live speech streaming and another for offline transcription. Notably, these models have been found to generate hallucinated content, reducing transcription reliability. Furthermore, larger model variants exhibit increased latency and pose challenges for deployment on resource-constrained devices. This study analyzes the similarities and differences between three Whisper models, qualitatively examining their distinct capabilities. Next, this study quantifies the impact of model quantization on latency and evaluates its viability for edge deployment. Using the open source LibriSpeech dataset, this paper evaluates the word error rate (WER) along with latency analysis of whispercpp using 3 quantization methods (INT4, INT5, INT8). Results show that quantization reduces latency by 19\% and model size by 45\%, while preserving transcription accuracy. These findings provide insights into the optimal use cases of different Whisper models and edge device deployment possibilities. All code, datasets, and implementation details are available in a public GitHub repository: https://github.com/allisonandreyev/WhisperQuantization.git
中文: 本研究分析了三种Whisper语音识别模型,发现量化技术在保持精度的同时显著降低了延迟和模型体积,为边缘设备部署提供了可行方案。
English: This study analyzes three Whisper ASR models, revealing that quantization techniques significantly reduce latency and model size while maintaining accuracy, offering practical solutions for edge device deployment.
Authors:Héctor Laria, Alexandra Gomez-Villa, Jiang Qin, Muhammad Atif Butt, Bogdan Raducanu, Javier Vazquez-Corral, Joost van de Weijer, Kai Wang
Abstract:
Recent advances in text-to-image (T2I) diffusion models have enabled remarkable control over various attributes, yet precise color specification remains a fundamental challenge. Existing approaches, such as ColorPeel, rely on model personalization, requiring additional optimization and limiting flexibility in specifying arbitrary colors. In this work, we introduce ColorWave, a novel training-free approach that achieves exact RGB-level color control in diffusion models without fine-tuning. By systematically analyzing the cross-attention mechanisms within IP-Adapter, we uncover an implicit binding between textual color descriptors and reference image features. Leveraging this insight, our method rewires these bindings to enforce precise color attribution while preserving the generative capabilities of pretrained models. Our approach maintains generation quality and diversity, outperforming prior methods in accuracy and applicability across diverse object categories. Through extensive evaluations, we demonstrate that ColorWave establishes a new paradigm for structured, color-consistent diffusion-based image synthesis.
中文摘要:ColorWave是一种无需训练的新方法,通过重构交叉注意力机制在扩散模型中实现精确的RGB色彩控制,无需微调即可超越现有方法的准确性和适用性。
English Summary: ColorWave is a training-free method that enables precise RGB color control in diffusion models by rewiring cross-attention mechanisms, outperforming previous approaches in accuracy and versatility without requiring fine-tuning.
Authors:Benjamin Towle, Xin Chen, Ke Zhou
Abstract:
Pre-trained segmentation models are a powerful and flexible tool for segmenting images. Recently, this trend has extended to medical imaging. Yet, often these methods only produce a single prediction for a given image, neglecting inherent uncertainty in medical images, due to unclear object boundaries and errors caused by the annotation tool. Multiple Choice Learning is a technique for generating multiple masks, through multiple learned prediction heads. However, this cannot readily be extended to producing more outputs than its initial pre-training hyperparameters, as the sparse, winner-takes-all loss function makes it easy for one prediction head to become overly dominant, thus not guaranteeing the clinical relevancy of each mask produced. We introduce SeqSAM, a sequential, RNN-inspired approach to generating multiple masks, which uses a bipartite matching loss for ensuring the clinical relevancy of each mask, and can produce an arbitrary number of masks. We show notable improvements in quality of each mask produced across two publicly available datasets. Our code is available at https://github.com/BenjaminTowle/SeqSAM.
中文: SeqSAM提出了一种基于循环神经网络启发的顺序方法,通过双向匹配损失生成多个临床相关的医学图像分割掩码,在两个公开数据集上显著提升了掩码质量。
English: SeqSAM introduces a sequential, RNN-inspired method that generates multiple clinically relevant masks for medical image segmentation using a bipartite matching loss, demonstrating improved mask quality across datasets.
Authors:William L. Tong, Cengiz Pehlevan
Abstract:
Equality reasoning is ubiquitous and purely abstract: sameness or difference may be evaluated no matter the nature of the underlying objects. As a result, same-different (SD) tasks have been extensively studied as a starting point for understanding abstract reasoning in humans and across animal species. With the rise of neural networks that exhibit striking apparent proficiency for abstractions, equality reasoning in these models has also gained interest. Yet despite extensive study, conclusions about equality reasoning vary widely and with little consensus. To clarify the underlying principles in learning SD tasks, we develop a theory of equality reasoning in multi-layer perceptrons (MLP). Following observations in comparative psychology, we propose a spectrum of behavior that ranges from conceptual to perceptual outcomes. Conceptual behavior is characterized by task-specific representations, efficient learning, and insensitivity to spurious perceptual details. Perceptual behavior is characterized by strong sensitivity to spurious perceptual details, accompanied by the need for exhaustive training to learn the task. We develop a mathematical theory to show that an MLP's behavior is driven by learning richness. Rich-regime MLPs exhibit conceptual behavior, whereas lazy-regime MLPs exhibit perceptual behavior. We validate our theoretical findings in vision SD experiments, showing that rich feature learning promotes success by encouraging hallmarks of conceptual behavior. Overall, our work identifies feature learning richness as a key parameter modulating equality reasoning, and suggests that equality reasoning in humans and animals may similarly depend on learning richness in neural circuits.
中文: 本研究提出一个理论,表明多层感知机在相等性推理中的行为受特征学习丰富度调控,可分为概念性行为和感知性行为,富学习模式实现概念化推理而惰性模式停留于感知层面,并通过视觉实验验证了这一发现。
English: Our study develops a theory showing that multi-layer perceptrons' equality reasoning behavior spans from conceptual to perceptual outcomes, driven by feature learning richness, with rich-regime models achieving conceptual behavior and lazy-regime ones remaining perceptual, validated through vision experiments.
Authors:Tairan Xu, Leyang Xue, Zhan Lu, Adrian Jackson, Luo Mai
Abstract:
This paper presents MoE-Gen, a high-throughput MoE inference system optimized for single-GPU execution. Existing inference systems rely on model-based or continuous batching strategies, originally designed for interactive inference, which result in excessively small batches for MoE's key modules-attention and expert modules-leading to poor throughput. To address this, we introduce module-based batching, which accumulates tokens in host memory and dynamically launches large batches on GPUs to maximize utilization. Additionally, we optimize the choice of batch sizes for each module in an MoE to fully overlap GPU computation and communication, maximizing throughput. Evaluation demonstrates that MoE-Gen achieves 8-31x higher throughput compared to state-of-the-art systems employing model-based batching (FlexGen, MoE-Lightning, DeepSpeed), and offers even greater throughput improvements over continuous batching systems (e.g., vLLM and Ollama) on popular MoE models (DeepSeek and Mixtral) across offline inference tasks. MoE-Gen's source code is publicly available at https://github.com/EfficientMoE/MoE-Gen
中文: MoE-Gen提出模块化批处理系统,通过动态启动大批次操作优化GPU利用率,在MoE模型上实现比现有推理系统高8-31倍的吞吐量。
English: MoE-Gen introduces a module-based batching system that optimizes GPU utilization by dynamically launching large batches, achieving 8-31x higher throughput than existing inference systems on MoE models.
Authors:David P. Hofmeyr
Abstract:
In this paper we introduce a simple and intuitive adaptive k nearest neighbours classifier, and explore its utility within the context of bootstrap aggregating ("bagging"). The approach is based on finding discriminant subspaces which are computationally efficient to compute, and are motivated by enhancing the discrimination of classes through nearest neighbour classifiers. This adaptiveness promotes diversity of the individual classifiers fit across different bootstrap samples, and so further leverages the variance reducing effect of bagging. Extensive experimental results are presented documenting the strong performance of the proposed approach in comparison with Random Forest classifiers, as well as other nearest neighbours based ensembles from the literature, plus other relevant benchmarks. Code to implement the proposed approach is available in the form of an R package from https://github.com/DavidHofmeyr/BOPNN.
Chinese: 本文提出了一种简单直观的自适应k近邻分类器,通过计算高效的判别子空间增强类别区分度,在自助聚合中提升分类器多样性以超越随机森林及其他基准方法,并提供了可用的R软件包实现。
English: This paper presents a simple adaptive k-nearest neighbors classifier that enhances class discrimination through computationally efficient discriminant subspaces, promoting classifier diversity in bootstrap aggregation to outperform Random Forest and other benchmarks, with an available R package for implementation.
Authors:Nannan Wu, Zhuo Kuang, Zengqiang Yan, Ping Wang, Li Yu
Abstract:
Despite the potential of federated learning in medical applications, inconsistent imaging quality across institutions-stemming from lower-quality data from a minority of clients-biases federated models toward more common high-quality images. This raises significant fairness concerns. Existing fair federated learning methods have demonstrated some effectiveness in solving this problem by aligning a single 0th- or 1st-order state of convergence (e.g., training loss or sharpness). However, we argue in this work that fairness based on such a single state is still not an adequate surrogate for fairness during testing, as these single metrics fail to fully capture the convergence characteristics, making them suboptimal for guiding fair learning. To address this limitation, we develop a generalized framework. Specifically, we propose assessing convergence using multiple states, defined as sharpness or perturbed loss computed at varying search distances. Building on this comprehensive assessment, we propose promoting fairness for these states across clients to achieve our ultimate fairness objective. This is accomplished through the proposed method, FedISM+. In FedISM+, the search distance evolves over time, progressively focusing on different states. We then incorporate two components in local training and global aggregation to ensure cross-client fairness for each state. This gradually makes convergence equitable for all states, thereby improving fairness during testing. Our empirical evaluations, performed on the well-known RSNA ICH and ISIC 2019 datasets, demonstrate the superiority of FedISM+ over existing state-of-the-art methods for fair federated learning. The code is available at https://github.com/wnn2000/FFL4MIA.
中文: 提出的FedISM+框架通过评估多个收敛状态并动态调整搜索距离来解决联邦学习中的公平性问题,在医学影像数据集上相比现有方法展现出更优性能。
English: The proposed FedISM+ framework addresses fairness limitations in federated learning by evaluating multiple convergence states across clients and dynamically adjusting search distances during training, demonstrating superior performance on medical imaging datasets compared to existing methods.
Authors:Philippe Chlenski, Kaizhu Du, Dylan Satow, Raiyan R. Khan, Itsik Pe'er
Abstract:
We present Manify, an open-source Python library for non-Euclidean representation learning. Leveraging manifold learning techniques, Manify provides tools for learning embeddings in (products of) non-Euclidean spaces, performing classification and regression with data that lives in such spaces, estimating the curvature of a manifold, and more. Manify aims to advance research and applications in machine learning by offering a comprehensive suite of tools for manifold-based data analysis. Our source code, examples, and documentation are available at https://github.com/pchlenski/manify.
中文: Manify 是一个开源 Python 库,利用流形学习技术在非欧几里得空间中进行嵌入学习、分类、回归和曲率估计等任务,旨在推动机器学习的研究与应用。
English: Manify is an open-source Python library that uses manifold learning to create embeddings in non-Euclidean spaces and perform tasks like classification, regression, and curvature estimation, advancing machine learning research and applications.
Authors:Marianne Arriola, Aaron Gokaslan, Justin T. Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sekhar Sahoo, Volodymyr Kuleshov
Abstract:
Diffusion language models offer unique benefits over autoregressive models due to their potential for parallelized generation and controllability, yet they lag in likelihood modeling and are limited to fixed-length generation. In this work, we introduce a class of block diffusion language models that interpolate between discrete denoising diffusion and autoregressive models. Block diffusion overcomes key limitations of both approaches by supporting flexible-length generation and improving inference efficiency with KV caching and parallel token sampling. We propose a recipe for building effective block diffusion models that includes an efficient training algorithm, estimators of gradient variance, and data-driven noise schedules to minimize the variance. Block diffusion sets a new state-of-the-art performance among diffusion models on language modeling benchmarks and enables generation of arbitrary-length sequences. We provide the code, along with the model weights and blog post on the project page: https://m-arriola.com/bd3lms
中文: 块扩散语言模型通过支持灵活长度生成和提升推理效率,弥补了扩散与自回归模型间的不足,在语言建模基准测试中达到了最新最优性能。
English: Block diffusion language models bridge the gap between diffusion and autoregressive approaches by enabling flexible-length generation and enhanced inference efficiency, achieving state-of-the-art performance in language modeling.
Authors:Ziyu Wan, Yunxiang Li, Xiaoyu Wen, Yan Song, Hanjing Wang, Linyi Yang, Mark Schmidt, Jun Wang, Weinan Zhang, Shuyue Hu, Ying Wen
Abstract:
Recent research on Reasoning of Large Language Models (LLMs) has sought to further enhance their performance by integrating meta-thinking -- enabling models to monitor, evaluate, and control their reasoning processes for more adaptive and effective problem-solving. However, current single-agent work lacks a specialized design for acquiring meta-thinking, resulting in low efficacy. To address this challenge, we introduce Reinforced Meta-thinking Agents (ReMA), a novel framework that leverages Multi-Agent Reinforcement Learning (MARL) to elicit meta-thinking behaviors, encouraging LLMs to think about thinking. ReMA decouples the reasoning process into two hierarchical agents: a high-level meta-thinking agent responsible for generating strategic oversight and plans, and a low-level reasoning agent for detailed executions. Through iterative reinforcement learning with aligned objectives, these agents explore and learn collaboration, leading to improved generalization and robustness. Empirical results from single-turn experiments demonstrate that ReMA outperforms single-agent RL baselines on complex reasoning tasks, including competitive-level mathematical benchmarks and LLM-as-a-Judge benchmarks. Additionally, we further extend ReMA to multi-turn interaction settings, leveraging turn-level ratio and parameter sharing to improve efficiency. Comprehensive ablation studies further illustrate the evolving dynamics of each distinct agent, providing valuable insights into how the meta-thinking reasoning process enhances the reasoning capabilities of LLMs. Our code can be found in https://github.com/ziyuwan/ReMA-public
中文: ReMA框架通过多智能体强化学习将元思考与推理执行分离,有效提升大语言模型在复杂任务中的表现,并通过分层协作机制增强泛化能力。
English: The ReMA framework introduces a multi-agent reinforcement learning approach to enhance large language models' reasoning by decoupling meta-thinking and execution, significantly improving performance on complex tasks.
Authors:Nazanin Moradinasab, Saurav Sengupta, Jiebei Liu, Sana Syed, Donald E. Brown
Abstract:
Healthcare relies on multiple types of data, such as medical images, genetic information, and clinical records, to improve diagnosis and treatment. However, missing data is a common challenge due to privacy restrictions, cost, and technical issues, making many existing multi-modal models unreliable. To address this, we propose a new multi-model model called Mixture of Experts, Symmetric Aligning, and Reconstruction (MoSARe), a deep learning framework that handles incomplete multimodal data while maintaining high accuracy. MoSARe integrates expert selection, cross-modal attention, and contrastive learning to improve feature representation and decision-making. Our results show that MoSARe outperforms existing models in situations when the data is complete. Furthermore, it provides reliable predictions even when some data are missing. This makes it especially useful in real-world healthcare settings, including resource-limited environments. Our code is publicly available at https://github.com/NazaninMn/MoSARe.
中文: 提出的MoSARe模型通过专家选择和跨模态学习,能有效处理不完整的多模态医疗数据,在数据缺失时仍保持高精度并优于现有方法。
English: The proposed MoSARe model effectively handles incomplete multimodal healthcare data through expert selection and cross-modal learning, maintaining high accuracy and outperforming existing methods even with missing information.
Authors:Xiangjian Jiang, Nikola Simidjievski, Mateja Jamnik
Abstract:
Heterogeneous tabular data poses unique challenges in generative modelling due to its fundamentally different underlying data structure compared to homogeneous modalities, such as images and text. Although previous research has sought to adapt the successes of generative modelling in homogeneous modalities to the tabular domain, defining an effective generator for tabular data remains an open problem. One major reason is that the evaluation criteria inherited from other modalities often fail to adequately assess whether tabular generative models effectively capture or utilise the unique structural information encoded in tabular data. In this paper, we carefully examine the limitations of the prevailing evaluation framework and introduce $\textbf{TabStruct}$, a novel evaluation benchmark that positions structural fidelity as a core evaluation dimension. Specifically, TabStruct evaluates the alignment of causal structures in real and synthetic data, providing a direct measure of how effectively tabular generative models learn the structure of tabular data. Through extensive experiments using generators from eight categories on seven datasets with expert-validated causal graphical structures, we show that structural fidelity offers a task-independent, domain-agnostic evaluation dimension. Our findings highlight the importance of tabular data structure and offer practical guidance for developing more effective and robust tabular generative models. Code is available at https://github.com/SilenceX12138/TabStruct.
中文: 本文提出TabStruct这一新型评估基准,通过衡量真实数据与合成数据间因果结构的对齐程度来评估表格生成模型的结构保真度,解决了现有评估框架的局限性。
English: This paper introduces TabStruct, a novel evaluation benchmark that assesses structural fidelity in tabular generative models by measuring the alignment of causal structures between real and synthetic data, addressing the limitations of existing evaluation frameworks.
Authors:Claudius Kienle, Benjamin Alt, Finn Schneider, Tobias Pertlwieser, Rainer Jäkel, Rania Rayyes
Abstract:
Despite the widespread adoption of industrial robots in automotive assembly, wire harness installation remains a largely manual process, as it requires precise and flexible manipulation. To address this challenge, we design a novel AI-based framework that automates cable connector mating by integrating force control with deep visuotactile learning. Our system optimizes search-and-insertion strategies using first-order optimization over a multimodal transformer architecture trained on visual, tactile, and proprioceptive data. Additionally, we design a novel automated data collection and optimization pipeline that minimizes the need for machine learning expertise. The framework optimizes robot programs that run natively on standard industrial controllers, permitting human experts to audit and certify them. Experimental validations on a center console assembly task demonstrate significant improvements in cycle times and robustness compared to conventional robot programming approaches. Videos are available under https://claudius-kienle.github.io/AppMuTT.
Authors:Tobias Christian Nauen, Brian Moser, Federico Raue, Stanislav Frolov, Andreas Dengel
Abstract:
Transformers, particularly Vision Transformers (ViTs), have achieved state-of-the-art performance in large-scale image classification. However, they often require large amounts of data and can exhibit biases that limit their robustness and generalizability. This paper introduces ForAug, a novel data augmentation scheme that addresses these challenges and explicitly includes inductive biases, which commonly are part of the neural network architecture, into the training data. ForAug is constructed by using pretrained foundation models to separate and recombine foreground objects with different backgrounds, enabling fine-grained control over image composition during training. It thus increases the data diversity and effective number of training samples. We demonstrate that training on ForNet, the application of ForAug to ImageNet, significantly improves the accuracy of ViTs and other architectures by up to 4.5 percentage points (p.p.) on ImageNet and 7.3 p.p. on downstream tasks. Importantly, ForAug enables novel ways of analyzing model behavior and quantifying biases. Namely, we introduce metrics for background robustness, foreground focus, center bias, and size bias and show that training on ForNet substantially reduces these biases compared to training on ImageNet. In summary, ForAug provides a valuable tool for analyzing and mitigating biases, enabling the development of more robust and reliable computer vision models. Our code and dataset are publicly available at https://github.com/tobna/ForAug.
Chinese: ForAug是一种新颖的数据增强方法,通过分离和重组前景与背景来提升图像多样性并减少Vision Transformer的偏差,显著提高了其在ImageNet及下游任务中的准确性和鲁棒性。
English: ForAug is a novel data augmentation method that enhances image diversity and mitigates biases in Vision Transformers, significantly boosting their accuracy and robustness on tasks like ImageNet and downstream applications.
Authors:Thomas De Min, Subhankar Roy, Stéphane Lathuilière, Elisa Ricci, Massimiliano Mancini
Abstract:
Machine unlearning is an emerging paradigm to remove the influence of specific training data (i.e., the forget set) from a model while preserving its knowledge of the rest of the data (i.e., the retain set). Previous approaches assume the forget data to be uniformly distributed from all training datapoints. However, if the data to unlearn is dominant in one group, we empirically show that performance for this group degrades, leading to fairness issues. This work tackles the overlooked problem of non-uniformly distributed forget sets, which we call group-robust machine unlearning, by presenting a simple, effective strategy that mitigates the performance loss in dominant groups via sample distribution reweighting. Moreover, we present MIU (Mutual Information-aware Machine Unlearning), the first approach for group robustness in approximate machine unlearning. MIU minimizes the mutual information between model features and group information, achieving unlearning while reducing performance degradation in the dominant group of the forget set. Additionally, MIU exploits sample distribution reweighting and mutual information calibration with the original model to preserve group robustness. We conduct experiments on three datasets and show that MIU outperforms standard methods, achieving unlearning without compromising model robustness. Source code available at https://github.com/tdemin16/group-robust_machine_unlearning.
中文摘要:机器遗忘旨在消除特定训练数据对模型的影响同时保留整体知识,本研究提出MIU方法,通过互信息最小化和样本重加权来解决非均匀遗忘集导致的公平性问题。
English Summary: Machine unlearning addresses removing specific training data's influence while preserving overall model knowledge, with this work introducing MIU to mitigate fairness issues from non-uniform forget sets by using mutual information minimization and sample reweighting.
Authors:Gorjan Radevski, Teodora Popordanoska, Matthew B. Blaschko, Tinne Tuytelaars
Abstract:
Audio-visual understanding is a rapidly evolving field that seeks to integrate and interpret information from both auditory and visual modalities. Despite recent advances in multi-modal learning, existing benchmarks often suffer from strong visual bias -- where answers can be inferred from visual data alone -- and provide only aggregate scores that conflate multiple sources of error. This makes it difficult to determine whether models struggle with visual understanding, audio interpretation, or audio-visual alignment. In this work, we introduce DAVE (Diagnostic Audio Visual Evaluation), a novel benchmark dataset designed to systematically evaluate audio-visual models across controlled challenges. DAVE alleviates existing limitations by (i) ensuring both modalities are necessary to answer correctly and (ii) decoupling evaluation into atomic subcategories. Our detailed analysis of state-of-the-art models reveals specific failure modes and provides targeted insights for improvement. By offering this standardized diagnostic framework, we aim to facilitate more robust development of audio-visual models. The dataset is released: https://github.com/gorjanradevski/dave
中文:DAVE基准数据集通过要求两种模态共同参与正确回答,并分解评估至原子子类别,解决了视听理解中的视觉偏见和综合评分问题,从而精准识别模型缺陷。
English: The DAVE benchmark dataset addresses visual bias and aggregate scoring issues in audio-visual understanding by requiring both modalities for correct answers and providing decoupled evaluation across subcategories to identify specific model weaknesses.
Authors:Chiara Cappellino, Gianluca Mancusi, Matteo Mosconi, Angelo Porrello, Simone Calderara, Rita Cucchiara
Abstract:
Open-Vocabulary object detectors can generalize to an unrestricted set of categories through simple textual prompting. However, adapting these models to rare classes or reinforcing their abilities on multiple specialized domains remains essential. While recent methods rely on monolithic adaptation strategies with a single set of weights, we embrace modular deep learning. We introduce DitHub, a framework designed to build and maintain a library of efficient adaptation modules. Inspired by Version Control Systems, DitHub manages expert modules as branches that can be fetched and merged as needed. This modular approach allows us to conduct an in-depth exploration of the compositional properties of adaptation modules, marking the first such study in Object Detection. Our method achieves state-of-the-art performance on the ODinW-13 benchmark and ODinW-O, a newly introduced benchmark designed to assess class reappearance. For more details, visit our project page: https://aimagelab.github.io/DitHub/
Chinese: DitHub采用模块化框架,将专家模块作为分支进行灵活适配,在开放词汇目标检测中实现了ODinW-13和ODinW-O等基准测试的最先进性能。
English: DitHub introduces a modular framework using expert modules as branches for flexible adaptation in open-vocabulary object detection, achieving state-of-the-art results on benchmarks like ODinW-13 and ODinW-O.
Authors:Wei He, Shangzhi Zhang, Chun-Guang Li, Xianbiao Qi, Rong Xiao, Jun Guo
Abstract:
Spectral clustering, as a popular tool for data clustering, requires an eigen-decomposition step on a given affinity to obtain the spectral embedding. Nevertheless, such a step suffers from the lack of generalizability and scalability. Moreover, the obtained spectral embeddings can hardly provide a good approximation to the ground-truth partition and thus a k-means step is adopted to quantize the embedding. In this paper, we propose a simple yet effective scalable and generalizable approach, called Neural Normalized Cut (NeuNcut), to learn the clustering membership for spectral clustering directly. In NeuNcut, we properly reparameterize the unknown cluster membership via a neural network, and train the neural network via stochastic gradient descent with a properly relaxed normalized cut loss. As a result, our NeuNcut enjoys a desired generalization ability to directly infer clustering membership for out-of-sample unseen data and hence brings us an efficient way to handle clustering task with ultra large-scale data. We conduct extensive experiments on both synthetic data and benchmark datasets and experimental results validate the effectiveness and the superiority of our approach. Our code is available at: https://github.com/hewei98/NeuNcut.
Chinese: 本文提出神经归一化割(NeuNcut),一种通过神经网络直接学习聚类成员的可扩展和泛化方法,无需特征分解和k均值步骤,能高效处理大规模数据。
English: The paper introduces Neural Normalized Cut (NeuNcut), a scalable and generalizable method that uses a neural network to directly learn clustering membership, eliminating the need for eigen-decomposition and k-means steps while efficiently handling large-scale data.
Authors:Xiuwen Fang, Mang Ye, Bo Du
Abstract:
This paper studies a challenging robust federated learning task with model heterogeneous and data corrupted clients, where the clients have different local model structures. Data corruption is unavoidable due to factors such as random noise, compression artifacts, or environmental conditions in real-world deployment, drastically crippling the entire federated system. To address these issues, this paper introduces a novel Robust Asymmetric Heterogeneous Federated Learning (RAHFL) framework. We propose a Diversity-enhanced supervised Contrastive Learning technique to enhance the resilience and adaptability of local models on various data corruption patterns. Its basic idea is to utilize complex augmented samples obtained by the mixed-data augmentation strategy for supervised contrastive learning, thereby enhancing the ability of the model to learn robust and diverse feature representations. Furthermore, we design an Asymmetric Heterogeneous Federated Learning strategy to resist corrupt feedback from external clients. The strategy allows clients to perform selective one-way learning during collaborative learning phase, enabling clients to refrain from incorporating lower-quality information from less robust or underperforming collaborators. Extensive experimental results demonstrate the effectiveness and robustness of our approach in diverse, challenging federated learning environments. Our code and models are public available at https://github.com/FangXiuwen/RAHFL.
中文摘要:本文提出了一种新颖的鲁棒非对称异构联邦学习框架,通过多样性增强的对比学习和非对称协作策略,有效解决了联邦系统中模型异构和数据损坏的挑战。
English Summary: This paper introduces a novel Robust Asymmetric Heterogeneous Federated Learning (RAHFL) framework that employs diversity-enhanced contrastive learning and asymmetric collaboration strategies to address model heterogeneity and data corruption in federated systems.
Authors:David P. Hofmeyr
Abstract:
A novel formulation of the clustering problem is introduced in which the task is expressed as an estimation problem, where the object to be estimated is a function which maps a point to its distribution of cluster membership. Unlike existing approaches which implicitly estimate such a function, like Gaussian Mixture Models (GMMs), the proposed approach bypasses any explicit modelling assumptions and exploits the flexible estimation potential of nonparametric smoothing. An intuitive approach for selecting the tuning parameters governing estimation is provided, which allows the proposed method to automatically determine both an appropriate level of flexibility and also the number of clusters to extract from a given data set. Experiments on a large collection of publicly available data sets are used to document the strong performance of the proposed approach, in comparison with relevant benchmarks from the literature. R code to implement the proposed approach is available from https://github.com/DavidHofmeyr/ CNS
中文: 本文提出了一种新的聚类方法,通过非参数平滑估计聚类隶属度分布,能够自动确定模型灵活性和聚类数量,并在实验中展现出优于现有方法的性能。
English: This paper introduces a novel clustering formulation that estimates cluster membership distributions through nonparametric smoothing, automatically determining both model flexibility and cluster count while demonstrating superior performance in experiments.
Authors:Jin Li, Ziqiang He, Anwei Luo, Jian-Fang Hu, Z. Jane Wang, Xiangui Kang
Abstract:
Imperceptible adversarial attacks aim to fool DNNs by adding imperceptible perturbation to the input data. Previous methods typically improve the imperceptibility of attacks by integrating common attack paradigms with specifically designed perception-based losses or the capabilities of generative models. In this paper, we propose Adversarial Attacks in Diffusion (AdvAD), a novel modeling framework distinct from existing attack paradigms. AdvAD innovatively conceptualizes attacking as a non-parametric diffusion process by theoretically exploring basic modeling approach rather than using the denoising or generation abilities of regular diffusion models requiring neural networks. At each step, much subtler yet effective adversarial guidance is crafted using only the attacked model without any additional network, which gradually leads the end of diffusion process from the original image to a desired imperceptible adversarial example. Grounded in a solid theoretical foundation of the proposed non-parametric diffusion process, AdvAD achieves high attack efficacy and imperceptibility with intrinsically lower overall perturbation strength. Additionally, an enhanced version AdvAD-X is proposed to evaluate the extreme of our novel framework under an ideal scenario. Extensive experiments demonstrate the effectiveness of the proposed AdvAD and AdvAD-X. Compared with state-of-the-art imperceptible attacks, AdvAD achieves an average of 99.9$\%$ (+17.3$\%$) ASR with 1.34 (-0.97) $l_2$ distance, 49.74 (+4.76) PSNR and 0.9971 (+0.0043) SSIM against four prevalent DNNs with three different architectures on the ImageNet-compatible dataset. Code is available at https://github.com/XianguiKang/AdvAD.
Chinese: 本文提出AdvAD,一种新颖的非参数扩散框架,通过理论建模而非神经网络实现难以察觉的对抗攻击,在更低扰动强度下获得更高攻击成功率和视觉隐蔽性。
English: The paper introduces AdvAD, a novel non-parametric diffusion framework for imperceptible adversarial attacks that crafts subtle perturbations without neural networks, achieving superior attack success and imperceptibility with lower perturbation strength.
Authors:Byeongchan Lee, Sehyun Lee
Abstract:
In self-supervised representation learning, Siamese networks are a natural architecture for learning transformation-invariance by bringing representations of positive pairs closer together. But it is prone to collapse into a degenerate solution. To address the issue, in contrastive learning, a contrastive loss is used to prevent collapse by moving representations of negative pairs away from each other. But it is known that algorithms with negative sampling are not robust to a reduction in the number of negative samples. So, on the other hand, there are algorithms that do not use negative pairs. Many positive-only algorithms adopt asymmetric network architecture consisting of source and target encoders as a key factor in coping with collapse. By exploiting the asymmetric architecture, we introduce a methodology to implicitly incorporate the idea of contrastive learning. As its implementation, we present a novel method guided stop-gradient. We apply our method to benchmark algorithms SimSiam and BYOL and show that our method stabilizes training and boosts performance. We also show that the algorithms with our method work well with small batch sizes and do not collapse even when there is no predictor. The code is available at https://github.com/bych-lee/gsg.
Chinese: 本文提出了一种引导式停止梯度方法,通过非对称网络架构隐式融入对比学习思想,无需负样本或大批量即可提升SimSiam和BYOL等自监督学习算法的训练稳定性与性能表现。
English: This paper introduces a guided stop-gradient method that enhances self-supervised learning by implicitly incorporating contrastive principles through asymmetric network architectures, improving training stability and performance in algorithms like SimSiam and BYOL without requiring negative samples or large batch sizes.
Authors:Yifan Wang, Yifei Liu, Yingdong Shi, Changming Li, Anqi Pang, Sibei Yang, Jingyi Yu, Kan Ren
Abstract:
Vision Transformer models exhibit immense power yet remain opaque to human understanding, posing challenges and risks for practical applications. While prior research has attempted to demystify these models through input attribution and neuron role analysis, there's been a notable gap in considering layer-level information and the holistic path of information flow across layers. In this paper, we investigate the significance of influential neuron paths within vision Transformers, which is a path of neurons from the model input to output that impacts the model inference most significantly. We first propose a joint influence measure to assess the contribution of a set of neurons to the model outcome. And we further provide a layer-progressive neuron locating approach that efficiently selects the most influential neuron at each layer trying to discover the crucial neuron path from input to output within the target model. Our experiments demonstrate the superiority of our method finding the most influential neuron path along which the information flows, over the existing baseline solutions. Additionally, the neuron paths have illustrated that vision Transformers exhibit some specific inner working mechanism for processing the visual information within the same image category. We further analyze the key effects of these neurons on the image classification task, showcasing that the found neuron paths have already preserved the model capability on downstream tasks, which may also shed some lights on real-world applications like model pruning. The project website including implementation code is available at https://foundation-model-research.github.io/NeuronPath/.
Authors:Dikai Liu, Tianwei Zhang, Jianxiong Yin, Simon See
Abstract:
Quadrupeds have gained rapid advancement in their capability of traversing across complex terrains. The adoption of deep Reinforcement Learning (RL), transformers and various knowledge transfer techniques can greatly reduce the sim-to-real gap. However, the classical teacher-student framework commonly used in existing locomotion policies requires a pre-trained teacher and leverages the privilege information to guide the student policy. With the implementation of large-scale models in robotics controllers, especially transformers-based ones, this knowledge distillation technique starts to show its weakness in efficiency, due to the requirement of multiple supervised stages. In this paper, we propose Unified Locomotion Transformer (ULT), a new transformer-based framework to unify the processes of knowledge transfer and policy optimization in a single network while still taking advantage of privilege information. The policies are optimized with reinforcement learning, next state-action prediction, and action imitation, all in just one training stage, to achieve zero-shot deployment. Evaluation results demonstrate that with ULT, optimal teacher and student policies can be obtained at the same time, greatly easing the difficulty in knowledge transfer, even with complex transformer-based models.
Authors:Paolo Torrado, Joshua Levin, Markus Grotz, Joshua Smith
Abstract:
Warehouse robotic systems equipped with vacuum grippers must reliably grasp a diverse range of objects from densely packed shelves. However, these environments present significant challenges, including occlusions, diverse object orientations, stacked and obstructed items, and surfaces that are difficult to suction. We introduce \tetra, a novel vacuum-based grasping strategy featuring four suction cups mounted on linear actuators. Each actuator is equipped with an optical time-of-flight (ToF) proximity sensor, enabling reactive grasping.
We evaluate \tetra in a warehouse-style setting, demonstrating its ability to manipulate objects in stacked and obstructed configurations. Our results show that our RL-based policy improves picking success in stacked-object scenarios by 22.86\% compared to a single-suction gripper. Additionally, we demonstrate that TetraGrip can successfully grasp objects in scenarios where a single-suction gripper fails due to physical limitations, specifically in two cases: (1) picking an object occluded by another object and (2) retrieving an object in a complex scenario. These findings highlight the advantages of multi-actuated, suction-based grasping in unstructured warehouse environments. The project website is available at: \href{https://tetragrip.github.io/}{https://tetragrip.github.io/}.
Authors:Joao D. S. Marques, Arlindo L. Oliveira
Abstract:
Pulmonary embolism is a leading cause of out of hospital cardiac arrest that requires fast diagnosis. While computed tomography pulmonary angiography is the standard diagnostic tool, it is not always accessible. Electrocardiography is an essential tool for diagnosing multiple cardiac anomalies, as it is affordable, fast and available in many settings. However, the availability of public ECG datasets, specially for PE, is limited and, in practice, these datasets tend to be small, making it essential to optimize learning strategies. In this study, we investigate the performance of multiple neural networks in order to assess the impact of various approaches. Moreover, we check whether these practices enhance model generalization when transfer learning is used to translate information learned in larger ECG datasets, such as PTB-XL, CPSC18 and MedalCare-XL, to a smaller, more challenging dataset for PE. By leveraging transfer learning, we analyze the extent to which we can improve learning efficiency and predictive performance on limited data. Code available at https://github.com/joaodsmarques/Are-ECGs-enough-Deep-Learning-Classifiers .
中文: 本研究通过从大型心电图数据集进行迁移学习,评估神经网络在有限数据条件下对肺栓塞的诊断性能,旨在提升学习效率与预测表现。
English: This study evaluates neural networks using transfer learning from large ECG datasets to improve pulmonary embolism diagnosis on limited data, aiming to enhance learning efficiency and predictive performance.
Authors:Hrishikesh Viswanath, Md Ashiqur Rahman, Chi Lin, Damon Conover, Aniket Bera
Abstract:
Accurate and efficient 3D mapping of large-scale outdoor environments from LiDAR measurements is a fundamental challenge in robotics, particularly towards ensuring smooth and artifact-free surface reconstructions. Although the state-of-the-art methods focus on memory-efficient neural representations for high-fidelity surface generation, they often fail to produce artifact-free manifolds, with artifacts arising due to noisy and sparse inputs. To address this issue, we frame surface mapping as a physics-informed energy optimization problem, enforcing surface smoothness by optimizing an energy functional that penalizes sharp surface ridges. Specifically, we propose a deep learning based approach that learns the signed distance field (SDF) of the surface manifold from raw LiDAR point clouds using a physics-informed loss function that optimizes the $L_2$-Hessian energy of the surface. Our learning framework includes a hierarchical octree based input feature encoding and a multi-scale neural network to iteratively refine the signed distance field at different scales of resolution. Lastly, we introduce a test-time refinement strategy to correct topological inconsistencies and edge distortions that can arise in the generated mesh. We propose a \texttt{CUDA}-accelerated least-squares optimization that locally adjusts vertex positions to enforce feature-preserving smoothing. We evaluate our approach on large-scale outdoor datasets and demonstrate that our approach outperforms current state-of-the-art methods in terms of improved accuracy and smoothness. Our code is available at \href{https://github.com/HrishikeshVish/HessianForge/}{https://github.com/HrishikeshVish/HessianForge/}
Chinese: 本研究提出了一种基于物理信息的深度学习方法,通过优化L2-Hessian能量从LiDAR点云重建平滑无伪影的3D表面,在精度和网格质量上超越了现有技术。
English: This study introduces a physics-informed deep learning method that optimizes the L2-Hessian energy to reconstruct smooth, artifact-free 3D surfaces from LiDAR point clouds, outperforming existing techniques in accuracy and mesh quality.
Authors:Anand Menon, Samit S Miftah, Shamik Kundu, Souvik Kundu, Amisha Srivastava, Arnab Raha, Gabriel Theodor Sonnenschein, Suvadeep Banerjee, Deepak Mathaikutty, Kanad Basu
Abstract:
Hardware verification is crucial in modern SoC design, consuming around 70% of development time. SystemVerilog assertions ensure correct functionality. However, existing industrial practices rely on manual efforts for assertion generation, which becomes increasingly untenable as hardware systems become complex. Recent research shows that Large Language Models (LLMs) can automate this process. However, proprietary SOTA models like GPT-4o often generate inaccurate assertions and require expensive licenses, while smaller open-source LLMs need fine-tuning to manage HDL code complexities. To address these issues, we introduce **VERT**, an open-source dataset designed to enhance SystemVerilog assertion generation using LLMs. VERT enables researchers in academia and industry to fine-tune open-source models, outperforming larger proprietary ones in both accuracy and efficiency while ensuring data privacy through local fine-tuning and eliminating costly licenses. The dataset is curated by systematically augmenting variables from open-source HDL repositories to generate synthetic code snippets paired with corresponding assertions. Experimental results demonstrate that fine-tuned models like Deepseek Coder 6.7B and Llama 3.1 8B outperform GPT-4o, achieving up to 96.88% improvement over base models and 24.14% over GPT-4o on platforms including OpenTitan, CVA6, OpenPiton and Pulpissimo. VERT is available at https://github.com/AnandMenon12/VERT.
中文: VERT是一个开源数据集,通过微调开源大语言模型来自动生成SystemVerilog断言,在精度和效率上显著超越GPT-4o等专有模型,同时无需授权费用。
English: VERT is an open-source dataset that enables fine-tuning of open-source LLMs to automate SystemVerilog assertion generation, significantly outperforming proprietary models like GPT-4o in accuracy and efficiency while eliminating licensing costs.
Authors:Minsu Kim, Rodrigo Mira, Honglie Chen, Stavros Petridis, Maja Pantic
Abstract:
In this paper, we investigate a novel approach for Target Speech Extraction (TSE), which relies solely on textual context to extract the target speech. We refer to this task as Contextual Speech Extraction (CSE). Unlike traditional TSE methods that rely on pre-recorded enrollment utterances, video of the target speaker's face, spatial information, or other explicit cues to identify the target stream, our proposed method requires only a few turns of previous dialogue (or monologue) history. This approach is naturally feasible in mobile messaging environments where voice recordings are typically preceded by textual dialogue that can be leveraged implicitly. We present three CSE models and analyze their performances on three datasets. Through our experiments, we demonstrate that even when the model relies purely on dialogue history, it can achieve over 90 % accuracy in identifying the correct target stream with only two previous dialogue turns. Furthermore, we show that by leveraging both textual context and enrollment utterances as cues during training, we further enhance our model's flexibility and effectiveness, allowing us to use either cue during inference, or combine both for improved performance. Samples and code available on https://miraodasilva.github.io/cse-project-page .
本文提出了一种新颖的上下文语音提取方法,仅利用文本对话历史即可提取目标语音,仅需两个对话轮次就能达到90%以上的准确率,同时保持结合注册语音线索的灵活性。
This paper introduces a novel Contextual Speech Extraction method that uses only textual dialogue history to extract target speech, achieving over 90% accuracy with just two dialogue turns while maintaining flexibility to incorporate enrollment cues.
Authors:Nithin Parsan, David J. Yang, John J. Yang
Abstract:
Protein language models have revolutionized structure prediction, but their nonlinear nature obscures how sequence representations inform structure prediction. While sparse autoencoders (SAEs) offer a path to interpretability here by learning linear representations in high-dimensional space, their application has been limited to smaller protein language models unable to perform structure prediction. In this work, we make two key advances: (1) we scale SAEs to ESM2-3B, the base model for ESMFold, enabling mechanistic interpretability of protein structure prediction for the first time, and (2) we adapt Matryoshka SAEs for protein language models, which learn hierarchically organized features by forcing nested groups of latents to reconstruct inputs independently. We demonstrate that our Matryoshka SAEs achieve comparable or better performance than standard architectures. Through comprehensive evaluations, we show that SAEs trained on ESM2-3B significantly outperform those trained on smaller models for both biological concept discovery and contact map prediction. Finally, we present an initial case study demonstrating how our approach enables targeted steering of ESMFold predictions, increasing structure solvent accessibility while fixing the input sequence. To facilitate further investigation by the broader community, we open-source our code, dataset, pretrained models https://github.com/johnyang101/reticular-sae , and visualizer https://sae.reticular.ai .
中文: 本研究将稀疏自编码器扩展至ESM2-3B模型,首次实现了蛋白质结构预测的可解释性分析,并通过改进的套娃式稀疏自编码器在性能上超越传统架构,同时展示了调控结构预测的实际应用能力。
English: This study scales sparse autoencoders to the ESM2-3B model, enabling interpretable protein structure prediction and introducing Matryoshka SAEs that outperform standard architectures while demonstrating practical applications in steering structural predictions.
Authors:Wenyi Wu, Hao Zhang, Zhisen Wei, Xiao-Yuan Jing, Qinghua Zhang, Songsong Wu
Abstract:
Source-free domain adaptation (SFDA) has been exploited for cross-domain bearing fault diagnosis without access to source data. Current methods select partial target samples with reliable pseudo-labels for model adaptation, which is sub-optimal due to the ignored target samples. We argue that every target sample can contribute to model adaptation, and accordingly propose in this paper a novel SFDA-based approach for bearing fault diagnosis that exploits both reliable and unreliable pseudo-labels. We develop a data-augmentation-based label voting strategy to divide the target samples into reliable and unreliable ones. We propose to explore the underlying relation between feature space and label space by using the reliable pseudo-labels as ground-truth labels, meanwhile, alleviating negative transfer by maximizing the entropy of the unreliable pseudo-labels. The proposed method achieves well-balance between discriminability and diversity by taking advantage of reliable and unreliable pseudo-labels. Extensive experiments are conducted on two bearing fault benchmarks, demonstrating that our approach achieves significant performance improvements against existing SFDA-based bearing fault diagnosis methods. Our code is available at https://github.com/BdLab405/SDALR.
中文: 本文提出了一种新的无源域自适应轴承故障诊断方法,通过标签投票策略和熵最大化同时利用可靠与不可靠伪标签,在基准测试中实现了优于现有方法的性能表现。
English: This paper introduces a novel source-free domain adaptation method for bearing fault diagnosis that leverages both reliable and unreliable pseudo-labels through a label voting strategy and entropy maximization, achieving superior performance on benchmark datasets.
Authors:Raphi Kang, Yue Song, Georgia Gkioxari, Pietro Perona
Abstract:
Contrastive Language-Image Pre-Training (CLIP) is a popular method for learning multimodal latent spaces with well-organized semantics. Despite its wide range of applications, CLIP's latent space is known to fail at handling complex visual-textual interactions. Recent works attempt to address its shortcomings with data-centric or algorithmic approaches. But what if the problem is more fundamental, and lies in the geometry of CLIP? Toward this end, we rigorously analyze CLIP's latent space properties, and prove that no CLIP-like joint embedding space exists which can correctly do any two of the following at the same time: 1. represent basic descriptions and image content, 2. represent attribute binding, 3. represent spatial location and relationships, 4. represent negation. Informed by this analysis, we propose Dense Cosine Similarity Maps (DCSMs) as a principled and interpretable scoring method for CLIP-like models, which solves the fundamental limitations of CLIP by retaining the semantic topology of the image patches and text tokens. This method improves upon the performance of classical CLIP-like joint encoder models on a wide array of benchmarks. We share our code and data here for reproducibility: https://github.com/Raphoo/DCSM_Ideal_CLIP
中文: 研究发现CLIP的潜在空间存在固有几何限制,无法同时处理多种语义任务,并提出密集余弦相似度映射作为可解释的解决方案,通过保持图像-文本拓扑结构提升了各项基准测试的性能。
English: The study reveals that CLIP's latent space has inherent geometric limitations preventing it from simultaneously handling multiple semantic tasks, and proposes Dense Cosine Similarity Maps as an interpretable solution that improves performance across benchmarks by preserving image-text topology.
Authors:Yan Hu, Ahmad Chaddad
Abstract:
This study introduces the SHAP-integrated convolutional diagnostic network (SICDN), an interpretable feature selection method designed for limited datasets, to address the challenge posed by data privacy regulations that restrict access to medical datasets. The SICDN model was tested on classification tasks using pneumonia and breast cancer datasets, demonstrating over 97% accuracy and surpassing four popular CNN models. We also integrated a historical weighted moving average technique to enhance feature selection. The SICDN shows potential in medical image prediction, with the code available on https://github.com/AIPMLab/SICDN.
中文摘要:SICDN模型作为一种可解释的特征选择方法,在有限医疗数据集上实现了超过97%的分类准确率,有效应对了数据隐私保护带来的挑战。
English Summary: The SICDN model is an interpretable feature selection method that achieves over 97% accuracy on medical datasets while addressing data privacy constraints through effective performance with limited data.
Authors:Viktor Moskvoretskii, Chris Biemann, Irina Nikishina
Abstract:
Although large language models (LLMs) have achieved remarkable performance across various tasks, they remain prone to errors. A key challenge is enabling them to self-correct. While prior research has relied on external tools or large proprietary models, this work explores self-correction in small language models (SLMs) through iterative fine-tuning using solely self-generated data. We introduce the Self-Taught Self-Correction (STaSC) algorithm, which incorporates multiple algorithmic design choices. Experimental results on a question-answering task demonstrate that STaSC effectively learns self-correction, leading to significant performance improvements. Our analysis further provides insights into the mechanisms of self-correction and the impact of different design choices on learning dynamics and overall performance. To support future research, we release our user-friendly codebase and lightweight models.
Chinese: 本研究提出自我教学式修正(STaSC)算法,通过仅使用自生成数据进行迭代微调,使小型语言模型实现自我纠错,在问答任务上显著提升性能,并为自我修正机制及设计选择的影响提供了深入见解。
English: This study introduces the Self-Taught Self-Correction (STaSC) algorithm, enabling small language models to self-correct through iterative fine-tuning with self-generated data, significantly improving performance on question-answering tasks while providing insights into self-correction mechanisms.
Authors:Tobias Kreiman, Aditi S. Krishnapriyan
Abstract:
Machine Learning Force Fields (MLFFs) are a promising alternative to expensive ab initio quantum mechanical molecular simulations. Given the diversity of chemical spaces that are of interest and the cost of generating new data, it is important to understand how MLFFs generalize beyond their training distributions. In order to characterize and better understand distribution shifts in MLFFs, we conduct diagnostic experiments on chemical datasets, revealing common shifts that pose significant challenges, even for large foundation models trained on extensive data. Based on these observations, we hypothesize that current supervised training methods inadequately regularize MLFFs, resulting in overfitting and learning poor representations of out-of-distribution systems. We then propose two new methods as initial steps for mitigating distribution shifts for MLFFs. Our methods focus on test-time refinement strategies that incur minimal computational cost and do not use expensive ab initio reference labels. The first strategy, based on spectral graph theory, modifies the edges of test graphs to align with graph structures seen during training. Our second strategy improves representations for out-of-distribution systems at test-time by taking gradient steps using an auxiliary objective, such as a cheap physical prior. Our test-time refinement strategies significantly reduce errors on out-of-distribution systems, suggesting that MLFFs are capable of and can move towards modeling diverse chemical spaces, but are not being effectively trained to do so. Our experiments establish clear benchmarks for evaluating the generalization capabilities of the next generation of MLFFs. Our code is available at https://tkreiman.github.io/projects/mlff_distribution_shifts/.
Authors:Da-Wei Zhou, Kai-Wen Li, Jingyi Ning, Han-Jia Ye, Lijun Zhang, De-Chuan Zhan
Abstract:
Class-Incremental Learning (CIL) enables learning systems to continuously adapt to evolving data streams. With the advancement of pre-training, leveraging pre-trained vision-language models (e.g., CLIP) offers a promising starting point for CIL. However, CLIP makes decisions by matching visual embeddings to class names, overlooking the rich contextual information conveyed through language. For instance, the concept of ``cat'' can be decomposed into features like tail, fur, and face for recognition. Besides, since the model is continually updated, these detailed features are overwritten in CIL, requiring external knowledge for compensation. In this paper, we introduce ExterNal knowledGe INjEction (ENGINE) for CLIP-based CIL. To enhance knowledge transfer from outside the dataset, we propose a dual-branch injection tuning framework that encodes informative knowledge from both visual and textual modalities. The visual branch is enhanced with data augmentation to enrich the visual features, while the textual branch leverages GPT-4 to rewrite discriminative descriptors. In addition to this on-the-fly knowledge injection, we also implement post-tuning knowledge by re-ranking the prediction results during inference. With the injected knowledge, the model can better capture informative features for downstream tasks as data evolves. Extensive experiments demonstrate the state-of-the-art performance of ENGINE. Code is available at: https://github.com/LAMDA-CL/ICCV25-ENGINE
Chinese: 本文提出ENGINE方法,通过双分支调优和重排序机制将外部知识注入CLIP模型,在类增量学习中实现了最先进的性能,使模型能够更好地捕捉演化数据中的上下文特征。
English: This paper introduces ENGINE, a method that enhances Class-Incremental Learning (CIL) by injecting external knowledge into CLIP through dual-branch tuning and re-ranking, achieving state-of-the-art performance by capturing richer contextual features for evolving data streams.
Authors:Yuncheng Guo, Xiaodong Gu
Abstract:
Large-scale pre-trained Vision-Language Models (VLMs) have become essential for transfer learning across diverse tasks. However, adapting these models with limited few-shot data often leads to overfitting, diminishing their performance on new tasks. To tackle this issue, we propose a novel Multi-Modal Representation Learning (MMRL) framework that introduces a shared, learnable, and modality-agnostic representation space. MMRL projects the space tokens to text and image representation tokens, facilitating more effective multi-modal interactions. Unlike previous approaches that solely optimize class token features, MMRL integrates representation tokens at higher layers of the encoders--where dataset-specific features are more prominent--while preserving generalized knowledge in the lower layers. During training, both representation and class features are optimized, with trainable projection layer applied to the representation tokens, whereas the class token projection layer remains frozen to retain pre-trained knowledge. Furthermore, a regularization term is introduced to align the class features and text features with the zero-shot features from the frozen VLM, thereby safeguarding the model's generalization capacity. For inference, a decoupling strategy is employed, wherein both representation and class features are utilized for base classes, while only the class features, which retain more generalized knowledge, are used for new tasks. Extensive experiments across 15 datasets demonstrate that MMRL outperforms state-of-the-art methods, achieving a balanced trade-off between task-specific adaptation and generalization. Code is available at https://github.com/yunncheng/MMRL.
中文摘要:提出的多模态表征学习(MMRL)框架通过构建共享表征空间,在优化表征和类别特征的同时与冻结预训练知识对齐,有效解决了小样本视觉语言任务中的过拟合问题,实现了任务适应性与泛化能力的平衡。
English Summary: The proposed Multi-Modal Representation Learning (MMRL) framework addresses overfitting in few-shot vision-language tasks by creating a shared representation space that optimizes both representation and class features while preserving generalization through alignment with frozen pre-trained knowledge.
Authors:Liang Yu, Lai Tu, Xiang Bai
Abstract:
Multivariate time-series forecasting holds immense value across diverse applications, requiring methods to effectively capture complex temporal and inter-variable dynamics. A key challenge lies in uncovering the intrinsic patterns that govern predictability, beyond conventional designs, focusing on network architectures to explore latent relationships or temporal dependencies. Inspired by signal decomposition, this paper posits that time series predictability is derived from periodic characteristics at different frequencies. Consequently, we propose a novel time series forecasting method based on multi-frequency reference series correlation analysis. Through spectral analysis on long-term training data, we identify dominant spectral components and their harmonics to design base-pattern reference series. Unlike signal decomposition, which represents the original series as a linear combination of basis signals, our method uses a transformer model to compute cross-attention between the original series and reference series, capturing essential features for forecasting. Experiments on major open and synthetic datasets show state-of-the-art performance. Furthermore, by focusing on attention with a small number of reference series rather than pairwise variable attention, our method ensures scalability and broad applicability. The source code is available at: https://github.com/yuliang555/MFRS
中文摘要:本文提出了一种新颖的时间序列预测方法,通过多频参考序列和基于Transformer的交叉注意力机制来捕捉关键时序特征,在实现优异预测性能的同时保证了方法的可扩展性。
English Summary: This paper introduces a novel time series forecasting method that uses multi-frequency reference series and transformer-based cross-attention to capture essential temporal patterns, achieving state-of-the-art performance with enhanced scalability.
Authors:Pol G. Recasens, Ferran Agullo, Yue Zhu, Chen Wang, Eun Kyung Lee, Olivier Tardieu, Jordi Torres, Josep Ll. Berral
Abstract:
Large language models have been widely adopted across different tasks, but their auto-regressive generation nature often leads to inefficient resource utilization during inference. While batching is commonly used to increase throughput, performance gains plateau beyond a certain batch size, especially with smaller models, a phenomenon that existing literature typically explains as a shift to the compute-bound regime. In this paper, through an in-depth GPU-level analysis, we reveal that large-batch inference remains memory-bound, with most GPU compute capabilities underutilized due to DRAM bandwidth saturation as the primary bottleneck. To address this, we propose a Batching Configuration Advisor (BCA) that optimizes memory allocation, reducing GPU memory requirements with minimal impact on throughput. The freed memory and underutilized GPU compute capabilities can then be leveraged by concurrent workloads. Specifically, we use model replication to improve serving throughput and GPU utilization. Our findings challenge conventional assumptions about LLM inference, offering new insights and practical strategies for improving resource utilization, particularly for smaller language models. The code is publicly available at https://github.com/FerranAgulloLopez/vLLMBatchingMemoryGap.
中文摘要:本研究揭示了大批量语言模型推理因DRAM带宽饱和仍受内存限制,提出了批处理配置顾问来优化内存使用,并通过模型复制实现并发工作负载以提升资源利用率。
English Summary: This study reveals that large-batch inference for language models remains memory-bound due to DRAM bandwidth saturation, and proposes a Batching Configuration Advisor to optimize memory usage while enabling concurrent workloads through model replication.
Authors:Wei Shi, Sihang Li, Tao Liang, Mingyang Wan, Guojun Ma, Xiang Wang, Xiangnan He
Abstract:
Mechanistic interpretability of large language models (LLMs) aims to uncover the internal processes of information propagation and reasoning. Sparse autoencoders (SAEs) have demonstrated promise in this domain by extracting interpretable and monosemantic features. However, prior works primarily focus on feature extraction from a single layer, failing to effectively capture activations that span multiple layers. In this paper, we introduce Route Sparse Autoencoder (RouteSAE), a new framework that integrates a routing mechanism with a shared SAE to efficiently extract features from multiple layers. It dynamically assigns weights to activations from different layers, incurring minimal parameter overhead while achieving high interpretability and flexibility for targeted feature manipulation. We evaluate RouteSAE through extensive experiments on Llama-3.2-1B-Instruct. Specifically, under the same sparsity constraint of 64, RouteSAE extracts 22.5% more features than baseline SAEs while achieving a 22.3% higher interpretability score. These results underscore the potential of RouteSAE as a scalable and effective method for LLM interpretability, with applications in feature discovery and model intervention. Our codes are available at https://github.com/swei2001/RouteSAEs.
中文: RouteSAE通过结合路由机制和共享稀疏自编码器,有效提取大语言模型中跨多层的特征,在相同稀疏度下比基线方法提取更多特征且具有更高的可解释性。
English: RouteSAE introduces a routing mechanism with a shared sparse autoencoder to efficiently extract and interpret features across multiple layers in large language models, achieving higher feature count and interpretability scores than baseline methods.
Authors:DongHeun Han, Byungmin Kim, RoUn Lee, KyeongMin Kim, Hyoseok Hwang, HyeongYeop Kang
Abstract:
Realistic Hand manipulation is a key component of immersive virtual reality (VR), yet existing methods often rely on kinematic approach or motion-capture datasets that omit crucial physical attributes such as contact forces and finger torques. Consequently, these approaches prioritize tight, one-size-fits-all grips rather than reflecting users' intended force levels. We present ForceGrip, a deep learning agent that synthesizes realistic hand manipulation motions, faithfully reflecting the user's grip force intention. Instead of mimicking predefined motion datasets, ForceGrip uses generated training scenarios-randomizing object shapes, wrist movements, and trigger input flows-to challenge the agent with a broad spectrum of physical interactions. To effectively learn from these complex tasks, we employ a three-phase curriculum learning framework comprising Finger Positioning, Intention Adaptation, and Dynamic Stabilization. This progressive strategy ensures stable hand-object contact, adaptive force control based on user inputs, and robust handling under dynamic conditions. Additionally, a proximity reward function enhances natural finger motions and accelerates training convergence. Quantitative and qualitative evaluations reveal ForceGrip's superior force controllability and plausibility compared to state-of-the-art methods. Demo videos are available as supplementary material and the code is provided at https://han-dongheun.github.io/ForceGrip.
Authors:Sanghyuk Chun, Sangdoo Yun
Abstract:
Recently, Probabilistic Language-Image Pre-Training (ProLIP) has been proposed to tackle the multiplicity issue of vision-language (VL) tasks. Despite their success in probabilistic representation learning at a scale, the ProLIP models cannot handle long context texts longer than 64 context length, which limits their ability to capture rich contextual information from longer text sequences. To address this issue, this paper proposes a fine-tuning strategy for ProLIP to accept longer texts, e.g., 256 text tokens. Experimental results on Urban-1k and the DataComp evaluation suite show that the proposed LongProLIP recipe can improve understanding of long contexts while minimizing the negative effect of fine-tuning.We also observe a trade-off between the long context understanding (measured by Urban-1k) and general zero-shot capability (measured by evaluation datasets by DataComp). Code is available at https://github.com/naver-ai/prolip
中文: 本文提出了一种针对概率语言-图像预训练(ProLIP)的微调策略,使其能处理长达256个标记的文本序列,在Urban-1k和DataComp基准测试中验证了该方法在提升长文本理解能力的同时,尽可能保持了模型的通用零样本能力。
English: This paper introduces a fine-tuning strategy for Probabilistic Language-Image Pre-Training (ProLIP) to handle longer text sequences up to 256 tokens, improving long-context understanding while maintaining general zero-shot capabilities, as validated on Urban-1k and DataComp benchmarks.
Authors:Ying Fu Lim, Jiawen Zhu, Guansong Pang
Abstract:
Log Anomaly Detection (LAD) seeks to identify atypical patterns in log data that are crucial to assessing the security and condition of systems. Although Large Language Models (LLMs) have shown tremendous success in various fields, the use of LLMs in enabling the detection of log anomalies is largely unexplored. This work aims to fill this gap. Due to the prohibitive costs involved in fully fine-tuning LLMs, we explore the use of parameter-efficient fine-tuning techniques (PEFTs) for adapting LLMs to LAD. To have an in-depth exploration of the potential of LLM-driven LAD, we present a comprehensive investigation of leveraging two of the most popular PEFTs -- Low-Rank Adaptation (LoRA) and Representation Fine-tuning (ReFT) -- to tap into three prominent LLMs of varying size, including RoBERTa, GPT-2, and Llama-3, for parameter-efficient LAD. Comprehensive experiments on four public log datasets are performed to reveal important insights into effective LLM-driven LAD in several key perspectives, including the efficacy of these PEFT-based LLM-driven LAD methods, their stability, sample efficiency, robustness w.r.t. unstable logs, and cross-dataset generalization. Code is available at https://github.com/mala-lab/LogADReft.
中文: 本研究探索使用LoRA和ReFT等参数高效微调技术,将大语言模型适配于日志异常检测任务,并在多个数据集上从效能、稳定性等关键维度评估其性能。
English: This study explores parameter-efficient fine-tuning techniques like LoRA and ReFT to adapt large language models for log anomaly detection, evaluating their effectiveness across multiple datasets and key performance aspects.
Authors:Jiequan Cui, Beier Zhu, Qingshan Xu, Zhuotao Tian, Xiaojuan Qi, Bei Yu, Hanwang Zhang, Richang Hong
Abstract:
In this paper, we delve deeper into the Kullback-Leibler (KL) Divergence loss and mathematically prove that it is equivalent to the Decoupled Kullback-Leibler (DKL) Divergence loss that consists of (1) a weighted Mean Square Error (wMSE) loss and (2) a Cross-Entropy loss incorporating soft labels. Thanks to the decoupled structure of DKL loss, we have identified two areas for improvement. Firstly, we address the limitation of KL loss in scenarios like knowledge distillation by breaking its asymmetric optimization property along with a smoother weight function. This modification effectively alleviates convergence challenges in optimization, particularly for classes with high predicted scores in soft labels. Secondly, we introduce class-wise global information into KL/DKL to reduce bias arising from individual samples. With these two enhancements, we derive the Generalized Kullback-Leibler (GKL) Divergence loss and evaluate its effectiveness by conducting experiments on CIFAR-10/100, ImageNet, and vision-language datasets, focusing on adversarial training, and knowledge distillation tasks. Specifically, we achieve new state-of-the-art adversarial robustness on the public leaderboard -- RobustBench and competitive knowledge distillation performance across CIFAR/ImageNet models and CLIP models, demonstrating the substantial practical merits. Our code is available at https://github.com/jiequancui/DKL.
中文: 本文从数学上证明了KL散度损失与解耦KL(DKL)损失的等价性,提出了一种广义KL(GKL)损失,通过改进非对称优化特性和引入类间全局信息,在多个数据集上实现了最先进的对抗鲁棒性和有竞争力的知识蒸馏性能。
English: This paper mathematically demonstrates the equivalence between KL Divergence loss and Decoupled KL (DKL) loss, proposing a Generalized KL (GKL) loss with two enhancements—addressing asymmetric optimization and incorporating class-wise global information—which achieves state-of-the-art adversarial robustness and competitive knowledge distillation performance across multiple datasets.
Authors:Meghna Roy Chowdhury, Wei Xuan, Shreyas Sen, Yixue Zhao, Yi Ding
Abstract:
Mental health issues among college students have reached critical levels, significantly impacting academic performance and overall wellbeing. Predicting and understanding mental health status among college students is challenging due to three main factors: the necessity for large-scale longitudinal datasets, the prevalence of black-box machine learning models lacking transparency, and the tendency of existing approaches to provide aggregated insights at the population level rather than individualized understanding.
To tackle these challenges, this paper presents I-HOPE, the first Interpretable Hierarchical mOdel for Personalized mEntal health prediction. I-HOPE is a two-stage hierarchical model that connects raw behavioral features to mental health status through five defined behavioral categories as interaction labels. We evaluate I-HOPE on the College Experience Study, the longest longitudinal mobile sensing dataset. This dataset spans five years and captures data from both pre-pandemic periods and the COVID-19 pandemic. I-HOPE achieves a prediction accuracy of 91%, significantly surpassing the 60-70% accuracy of baseline methods. In addition, I-HOPE distills complex patterns into interpretable and individualized insights, enabling the future development of tailored interventions and improving mental health support. The code is available at https://github.com/roycmeghna/I-HOPE.
中文: 本文提出I-HOPE可解释分层模型,通过行为分析以91%的预测准确率解决大学生心理健康预测难题,并提供个性化解读。
English: This paper introduces I-HOPE, an interpretable hierarchical model that addresses challenges in predicting college students' mental health by achieving 91% accuracy and providing individualized insights through behavioral analysis.
Authors:Zhao Yang, Bing Su, Chuan Cao, Ji-Rong Wen
Abstract:
Cis-regulatory elements (CREs), such as promoters and enhancers, are relatively short DNA sequences that directly regulate gene expression. The fitness of CREs, measured by their ability to modulate gene expression, highly depends on the nucleotide sequences, especially specific motifs known as transcription factor binding sites (TFBSs). Designing high-fitness CREs is crucial for therapeutic and bioengineering applications. Current CRE design methods are limited by two major drawbacks: (1) they typically rely on iterative optimization strategies that modify existing sequences and are prone to local optima, and (2) they lack the guidance of biological prior knowledge in sequence optimization. In this paper, we address these limitations by proposing a generative approach that leverages reinforcement learning (RL) to fine-tune a pre-trained autoregressive (AR) model. Our method incorporates data-driven biological priors by deriving computational inference-based rewards that simulate the addition of activator TFBSs and removal of repressor TFBSs, which are then integrated into the RL process. We evaluate our method on promoter design tasks in two yeast media conditions and enhancer design tasks for three human cell types, demonstrating its ability to generate high-fitness CREs while maintaining sequence diversity. The code is available at https://github.com/yangzhao1230/TACO.
中文: 本文提出了一种结合强化学习的生成方法,通过微调自回归模型并整合模拟转录因子结合位点修饰的生物先验知识,成功设计了适用于酵母和人类细胞的高适应性顺式调控元件。
English: This paper introduces a generative approach using reinforcement learning to fine-tune an autoregressive model for designing high-fitness cis-regulatory elements by incorporating biological priors that simulate transcription factor binding site modifications, demonstrating effectiveness across yeast and human cell applications.
Authors:Jiahao Xu, Zikai Zhang, Rui Hu
Abstract:
The distributed nature of training makes Federated Learning (FL) vulnerable to backdoor attacks, where malicious model updates aim to compromise the global model's performance on specific tasks. Existing defense methods show limited efficacy as they overlook the inconsistency between benign and malicious model updates regarding both general and fine-grained directions. To fill this gap, we introduce AlignIns, a novel defense method designed to safeguard FL systems against backdoor attacks. AlignIns looks into the direction of each model update through a direction alignment inspection process. Specifically, it examines the alignment of model updates with the overall update direction and analyzes the distribution of the signs of their significant parameters, comparing them with the principle sign across all model updates. Model updates that exhibit an unusual degree of alignment are considered malicious and thus be filtered out. We provide the theoretical analysis of the robustness of AlignIns and its propagation error in FL. Our empirical results on both independent and identically distributed (IID) and non-IID datasets demonstrate that AlignIns achieves higher robustness compared to the state-of-the-art defense methods. The code is available at https://github.com/JiiahaoXU/AlignIns.
Chinese: AlignIns是一种新颖的防御方法,通过检查模型更新的方向一致性,根据其与整体更新方向和参数符号分布的偏差来过滤恶意更新,从而保护联邦学习免受后门攻击。
English: AlignIns is a novel defense method that protects Federated Learning from backdoor attacks by inspecting the direction alignment of model updates and filtering out malicious ones based on their deviation from the overall update direction and parameter sign distribution.
Authors:James Burgess, Xiaohan Wang, Yuhui Zhang, Anita Rau, Alejandro Lozano, Lisa Dunlap, Trevor Darrell, Serena Yeung-Levy
Abstract:
How do two individuals differ when performing the same action? In this work, we introduce Video Action Differencing (VidDiff), the novel task of identifying subtle differences between videos of the same action, which has many applications, such as coaching and skill learning. To enable development on this new task, we first create VidDiffBench, a benchmark dataset containing 549 video pairs, with human annotations of 4,469 fine-grained action differences and 2,075 localization timestamps indicating where these differences occur. Our experiments demonstrate that VidDiffBench poses a significant challenge for state-of-the-art large multimodal models (LMMs), such as GPT-4o and Qwen2-VL. By analyzing failure cases of LMMs on VidDiffBench, we highlight two key challenges for this task: localizing relevant sub-actions over two videos and fine-grained frame comparison. To overcome these, we propose the VidDiff method, an agentic workflow that breaks the task into three stages: action difference proposal, keyframe localization, and frame differencing, each stage utilizing specialized foundation models. To encourage future research in this new task, we release the benchmark at https://huggingface.co/datasets/jmhb/VidDiffBench and code at http://jmhb0.github.io/viddiff.
中文: 本文提出了VidDiff任务,用于识别相同动作视频间的细微差异,并创建了VidDiffBench基准数据集,该数据集对现有先进模型构成挑战;同时设计了三阶段智能工作流方法,以解决动作定位和细粒度帧比较两大核心难题。
English: This paper introduces VidDiff, a novel task for identifying subtle differences between videos of the same action, and proposes VidDiffBench, a benchmark dataset that challenges state-of-the-art models, along with a three-stage agentic workflow method to address localization and fine-grained comparison difficulties.
Authors:Elvis Kimara, Mozhgan Hadadi, Jackson Godbersen, Aditya Balu, Talukder Jubery, Yawei Li, Adarsh Krishnamurthy, Patrick S. Schnable, Baskar Ganapathysubramanian
Abstract:
The development of artificial intelligence (AI) and machine learning (ML) based tools for 3D phenotyping, especially for maize, has been limited due to the lack of large and diverse 3D datasets. 2D image datasets fail to capture essential structural details such as leaf architecture, plant volume, and spatial arrangements that 3D data provide. To address this limitation, we present MaizeField3D (https://baskargroup.github.io/MaizeField3D/), a curated dataset of 3D point clouds of field-grown maize plants from a diverse genetic panel, designed to be AI-ready for advancing agricultural research. Our dataset includes 1,045 high-quality point clouds of field-grown maize collected using a terrestrial laser scanner (TLS). Point clouds of 520 plants from this dataset were segmented and annotated using a graph-based segmentation method to isolate individual leaves and stalks, ensuring consistent labeling across all samples. This labeled data was then used for fitting procedural models that provide a structured parametric representation of the maize plants. The leaves of the maize plants in the procedural models are represented using Non-Uniform Rational B-Spline (NURBS) surfaces that were generated using a two-step optimization process combining gradient-free and gradient-based methods. We conducted rigorous manual quality control on all datasets, correcting errors in segmentation, ensuring accurate leaf ordering, and validating metadata annotations. The dataset also includes metadata detailing plant morphology and quality, alongside multi-resolution subsampled point cloud data (100k, 50k, 10k points), which can be readily used for different downstream computational tasks. MaizeField3D will serve as a comprehensive foundational dataset for AI-driven phenotyping, plant structural analysis, and 3D applications in agricultural research.
中文: MaizeField3D数据集通过提供1045个带标注的田间玉米三维点云数据,解决了三维表型分析数据匮乏的问题,其分割的叶片茎秆结构和多分辨率数据为农业人工智能研究奠定了基础。
English: The MaizeField3D dataset addresses the scarcity of diverse 3D maize data by providing 1,045 annotated point clouds with segmented leaves and stalks, enabling AI-driven agricultural research through structured parametric models and multi-resolution data.
Authors:Kwanyoung Kim, Byeongsu Sim
Abstract:
Diffusion models have shown impressive results in generating high-quality conditional samples using guidance techniques such as Classifier-Free Guidance (CFG). However, existing methods often require additional training or neural function evaluations (NFEs), making them incompatible with guidance-distilled models. Also, they rely on heuristic approaches that need identifying target layers. In this work, we propose a novel and efficient method, termed PLADIS, which boosts pre-trained models (U-Net/Transformer) by leveraging sparse attention. Specifically, we extrapolate query-key correlations using softmax and its sparse counterpart in the cross-attention layer during inference, without requiring extra training or NFEs. By leveraging the noise robustness of sparse attention, our PLADIS unleashes the latent potential of text-to-image diffusion models, enabling them to excel in areas where they once struggled with newfound effectiveness. It integrates seamlessly with guidance techniques, including guidance-distilled models. Extensive experiments show notable improvements in text alignment and human preference, offering a highly efficient and universally applicable solution. See Our project page : https://cubeyoung.github.io/pladis-proejct/
Authors:Wei Dai, Peilin Chen, Malinda Lu, Daniel Li, Haowen Wei, Hejie Cui, Paul Pu Liang
Abstract:
Recent advances in clinical AI have enabled remarkable progress across many clinical domains. However, existing benchmarks and models are primarily limited to a small set of modalities and tasks, which hinders the development of large-scale multimodal methods that can make holistic assessments of patient health and well-being. To bridge this gap, we introduce Clinical Large-Scale Integrative Multimodal Benchmark (CLIMB), a comprehensive clinical benchmark unifying diverse clinical data across imaging, language, temporal, and graph modalities. CLIMB comprises 4.51 million patient samples totaling 19.01 terabytes distributed across 2D imaging, 3D video, time series, graphs, and multimodal data. Through extensive empirical evaluation, we demonstrate that multitask pretraining significantly improves performance on understudied domains, achieving up to 29% improvement in ultrasound and 23% in ECG analysis over single-task learning. Pretraining on CLIMB also effectively improves models' generalization capability to new tasks, and strong unimodal encoder performance translates well to multimodal performance when paired with task-appropriate fusion strategies. Our findings provide a foundation for new architecture designs and pretraining strategies to advance clinical AI research. Code is released at https://github.com/DDVD233/climb.
中文:临床大规模综合多模态基准(CLIMB)通过整合多样化的临床数据,为全面评估患者健康提供了基础,研究表明多任务预训练能显著提升超声和心电图等薄弱领域的性能及模型泛化能力。
English: The Clinical Large-Scale Integrative Multimodal Benchmark (CLIMB) introduces a comprehensive dataset integrating diverse clinical data to advance holistic patient assessments, demonstrating that multitask pretraining significantly enhances performance and generalization in understudied domains like ultrasound and ECG analysis.
Authors:Weixing Chen, Yang Liu, Binglin Chen, Jiandong Su, Yongsen Zheng, Liang Lin
Abstract:
Video question grounding (VideoQG) requires models to answer the questions and simultaneously infer the relevant video segments to support the answers. However, existing VideoQG methods usually suffer from spurious cross-modal correlations, leading to a failure to identify the dominant visual scenes that align with the intended question. Moreover, vision-language models exhibit unfaithful generalization performance and lack robustness on challenging downstream tasks such as VideoQG. In this work, we propose a novel VideoQG framework named Cross-modal Causal Relation Alignment (CRA), to eliminate spurious correlations and improve the causal consistency between question-answering and video temporal grounding. Our CRA involves three essential components: i) Gaussian Smoothing Grounding (GSG) module for estimating the time interval via cross-modal attention, which is de-noised by an adaptive Gaussian filter, ii) Cross-Modal Alignment (CMA) enhances the performance of weakly supervised VideoQG by leveraging bidirectional contrastive learning between estimated video segments and QA features, iii) Explicit Causal Intervention (ECI) module for multimodal deconfounding, which involves front-door intervention for vision and back-door intervention for language. Extensive experiments on two VideoQG datasets demonstrate the superiority of our CRA in discovering visually grounded content and achieving robust question reasoning. Codes are available at https://github.com/WissingChen/CRA-GQA.
中文摘要:本文提出的跨模态因果关联对齐(CRA)框架通过高斯平滑定位、跨模态对齐和显式因果干预三大模块,有效消除视频问答任务中的伪相关性,显著提升了时序定位精度与问答推理的鲁棒性。
English Summary: The proposed Cross-modal Causal Relation Alignment (CRA) framework addresses spurious correlations in Video Question Grounding by integrating Gaussian smoothing, cross-modal alignment, and explicit causal intervention to improve temporal localization and reasoning robustness.
Authors:Yuxin Jiang, Liming Jiang, Shuai Yang, Jia-Wei Liu, Ivor Tsang, Mike Zheng Shou
Abstract:
We present Style Matching Score (SMS), a novel optimization method for image stylization with diffusion models. Balancing effective style transfer with content preservation is a long-standing challenge. Unlike existing efforts, our method reframes image stylization as a style distribution matching problem. The target style distribution is estimated from off-the-shelf style-dependent LoRAs via carefully designed score functions. To preserve content information adaptively, we propose Progressive Spectrum Regularization, which operates in the frequency domain to guide stylization progressively from low-frequency layouts to high-frequency details. In addition, we devise a Semantic-Aware Gradient Refinement technique that leverages relevance maps derived from diffusion semantic priors to selectively stylize semantically important regions. The proposed optimization formulation extends stylization from pixel space to parameter space, readily applicable to lightweight feedforward generators for efficient one-step stylization. SMS effectively balances style alignment and content preservation, outperforming state-of-the-art approaches, verified by extensive experiments.
中文:风格匹配评分(SMS)方法将图像风格化重新定义为风格分布匹配问题,通过渐进频谱正则化和语义感知梯度优化,有效平衡风格转换与内容保留,性能优于现有技术。
English: The Style Matching Score (SMS) method reframes image stylization as a style distribution matching problem, utilizing Progressive Spectrum Regularization and Semantic-Aware Gradient Refinement to effectively balance style transfer with content preservation, outperforming existing approaches.
Authors:Bardia Safaei, Faizan Siddiqui, Jiacong Xu, Vishal M. Patel, Shao-Yuan Lo
Abstract:
Visual instruction tuning (VIT) for large vision-language models (LVLMs) requires training on expansive datasets of image-instruction pairs, which can be costly. Recent efforts in VIT data selection aim to select a small subset of high-quality image-instruction pairs, reducing VIT runtime while maintaining performance comparable to full-scale training. However, a major challenge often overlooked is that generating instructions from unlabeled images for VIT is highly expensive. Most existing VIT datasets rely heavily on human annotations or paid services like the GPT API, which limits users with constrained resources from creating VIT datasets for custom applications. To address this, we introduce Pre-Instruction Data Selection (PreSel), a more practical data selection paradigm that directly selects the most beneficial unlabeled images and generates instructions only for the selected images. PreSel first estimates the relative importance of each vision task within VIT datasets to derive task-wise sampling budgets. It then clusters image features within each task, selecting the most representative images with the budget. This approach reduces computational overhead for both instruction generation during VIT data formation and LVLM fine-tuning. By generating instructions for only 15% of the images, PreSel achieves performance comparable to full-data VIT on the LLaVA-1.5 and Vision-Flan datasets. The link to our project page: https://bardisafa.github.io/PreSel
Authors:Zhao-Heng Yin, Changhao Wang, Luis Pineda, Krishna Bodduluri, Tingfan Wu, Pieter Abbeel, Mustafa Mukadam
Abstract:
We introduce Geometric Retargeting (GeoRT), an ultrafast, and principled neural hand retargeting algorithm for teleoperation, developed as part of our recent Dexterity Gen (DexGen) system. GeoRT converts human finger keypoints to robot hand keypoints at 1KHz, achieving state-of-the-art speed and accuracy with significantly fewer hyperparameters. This high-speed capability enables flexible postprocessing, such as leveraging a foundational controller for action correction like DexGen. GeoRT is trained in an unsupervised manner, eliminating the need for manual annotation of hand pairs. The core of GeoRT lies in novel geometric objective functions that capture the essence of retargeting: preserving motion fidelity, ensuring configuration space (C-space) coverage, maintaining uniform response through high flatness, pinch correspondence and preventing self-collisions. This approach is free from intensive test-time optimization, offering a more scalable and practical solution for real-time hand retargeting.
Authors:Yash Akhauri, Ahmed F AbouElhamayed, Yifei Gao, Chi-Chih Chang, Nilesh Jain, Mohamed S. Abdelfattah
Abstract:
Large Language Models (LLMs) rely on the Key-Value (KV) Cache to store token history, enabling efficient decoding of tokens. As the KV-Cache grows, it becomes a major memory and computation bottleneck, however, there is an opportunity to alleviate this bottleneck, especially because prior research has shown that only a small subset of tokens contribute meaningfully to each decoding step. A key challenge in finding these critical tokens is that they are dynamic, and heavily input query-dependent. Existing methods either risk quality by evicting tokens permanently, or retain the full KV-Cache but rely on retrieving chunks (pages) of tokens at generation, failing at dense, context-rich tasks. Additionally, many existing KV-Cache sparsity methods rely on inaccurate proxies for token importance. To address these limitations, we introduce TokenButler, a high-granularity, query-aware predictor that learns to identify these critical tokens. By training a light-weight predictor with less than 1.2% parameter overhead, TokenButler prioritizes tokens based on their contextual, predicted importance. This improves perplexity & downstream accuracy by over 8% relative to SoTA methods for estimating token importance. We evaluate TokenButler on a novel synthetic small-context co-referential retrieval task, demonstrating near-oracle accuracy. Code, models and benchmarks: https://github.com/abdelfattah-lab/TokenButler
中文: TokenButler 提出了一种轻量级、查询感知的预测器,能动态识别KV缓存中的关键令牌,相比现有方法,在效率和准确率上提升了超过8%。
English: TokenButler introduces a lightweight, query-aware predictor that dynamically identifies critical tokens in the KV-Cache, significantly improving efficiency and accuracy over existing methods by over 8%.
Authors:Jimmy Gammell, Anand Raghunathan, Abolfazl Hashemi, Kaushik Roy
Abstract:
While cryptographic algorithms such as the ubiquitous Advanced Encryption Standard (AES) are secure, *physical implementations* of these algorithms in hardware inevitably 'leak' sensitive data such as cryptographic keys. A particularly insidious form of leakage arises from the fact that hardware consumes power and emits radiation in a manner that is statistically associated with the data it processes and the instructions it executes. Supervised deep learning has emerged as a state-of-the-art tool for carrying out *side-channel attacks*, which exploit this leakage by learning to map power/radiation measurements throughout encryption to the sensitive data operated on during that encryption. In this work we develop a principled deep learning framework for determining the relative leakage due to measurements recorded at different points in time, in order to inform *defense* against such attacks. This information is invaluable to cryptographic hardware designers for understanding *why* their hardware leaks and how they can mitigate it (e.g. by indicating the particular sections of code or electronic components which are responsible). Our framework is based on an adversarial game between a family of classifiers trained to estimate the conditional distributions of sensitive data given subsets of measurements, and a budget-constrained noise distribution which probabilistically erases individual measurements to maximize the loss of these classifiers. We demonstrate our method's efficacy and ability to overcome limitations of prior work through extensive experimental comparison with 8 baseline methods using 3 evaluation metrics and 6 publicly-available power/EM trace datasets from AES, ECC and RSA implementations. We provide an open-source PyTorch implementation of these experiments.
中文: 本研究开发了一种深度学习框架,用于识别加密硬件何时通过功耗和电磁辐射泄露敏感数据,从而帮助设计者精确定位漏洞并加强针对侧信道攻击的防护能力。
English: This research develops a deep learning framework to identify when cryptographic hardware leaks sensitive data through power consumption and electromagnetic radiation, enabling designers to pinpoint vulnerabilities and strengthen defenses against side-channel attacks.
Authors:Rui Qiao, Zhaoxuan Wu, Jingtan Wang, Pang Wei Koh, Bryan Kian Hsiang Low
Abstract:
Machine learning models often have uneven performance among subpopulations (a.k.a., groups) in the data distributions. This poses a significant challenge for the models to generalize when the proportions of the groups shift during deployment. To improve robustness to such shifts, existing approaches have developed strategies that train models or perform hyperparameter tuning using the group-labeled data to minimize the worst-case loss over groups. However, a non-trivial amount of high-quality labels is often required to obtain noticeable improvements. Given the costliness of the labels, we propose to adopt a different paradigm to enhance group label efficiency: utilizing the group-labeled data as a target set to optimize the weights of other group-unlabeled data. We introduce Group-robust Sample Reweighting (GSR), a two-stage approach that first learns the representations from group-unlabeled data, and then tinkers the model by iteratively retraining its last layer on the reweighted data using influence functions. Our GSR is theoretically sound, practically lightweight, and effective in improving the robustness to subpopulation shifts. In particular, GSR outperforms the previous state-of-the-art approaches that require the same amount or even more group labels.
中文摘要:机器学习模型常在不同数据子群体间表现不均,本研究提出的群体鲁棒样本重加权(GSR)方法,通过两阶段策略利用有限标注数据优化未标注样本权重,以更高效的标签使用显著提升模型对群体分布变化的鲁棒性,性能优于现有最佳方法。
English summary: Machine learning models often struggle with performance disparities across subpopulations, and this study introduces Group-robust Sample Reweighting (GSR), a lightweight two-stage method that enhances robustness to group shifts by efficiently utilizing limited labeled data to reweight unlabeled samples, outperforming prior approaches with equal or fewer labels.
Authors:Baiyu Chen, Wilson Wongso, Zechen Li, Yonchanok Khaokaew, Hao Xue, Flora Salim
Abstract:
Egocentric video-based models capture rich semantic information and have demonstrated strong performance in human activity recognition (HAR). However, their high power consumption, privacy concerns, and dependence on lighting conditions limit their feasibility for continuous on-device recognition. In contrast, inertial measurement unit (IMU) sensors offer an energy-efficient and privacy-preserving alternative, yet they suffer from limited large-scale annotated datasets, leading to weaker generalization in downstream tasks. To bridge this gap, we propose COMODO, a cross-modal self-supervised distillation framework that transfers rich semantic knowledge from the video modality to the IMU modality without requiring labeled annotations. COMODO leverages a pretrained and frozen video encoder to construct a dynamic instance queue, aligning the feature distributions of video and IMU embeddings. By distilling knowledge from video representations, our approach enables the IMU encoder to inherit rich semantic information from video while preserving its efficiency for real-world applications. Experiments on multiple egocentric HAR datasets demonstrate that COMODO consistently improves downstream classification performance, achieving results comparable to or exceeding fully supervised fine-tuned models. Moreover, COMODO exhibits strong cross-dataset generalization. Benefiting from its simplicity, our method is also generally applicable to various video and time-series pre-trained models, offering the potential to leverage more powerful teacher and student foundation models in future research. The code is available at https://github.com/Breezelled/COMODO .
中文摘要:COMODO是一种跨模态自监督蒸馏框架,通过将视频语义知识迁移至IMU传感器,无需标注数据即可实现高效且保护隐私的人类活动识别,并展现出卓越性能。
English Summary: COMODO is a self-supervised framework that transfers semantic knowledge from video to IMU sensors, enabling efficient and privacy-preserving human activity recognition without labeled data while achieving superior performance.
Authors:Zebin You, Jingyang Ou, Xiaolu Zhang, Jun Hu, Jun Zhou, Chongxuan Li
Abstract:
Although masked image generation models and masked diffusion models are designed with different motivations and objectives, we observe that they can be unified within a single framework. Building upon this insight, we carefully explore the design space of training and sampling, identifying key factors that contribute to both performance and efficiency. Based on the improvements observed during this exploration, we develop our model, referred to as eMIGM. Empirically, eMIGM demonstrates strong performance on ImageNet generation, as measured by Fréchet Inception Distance (FID). In particular, on ImageNet 256x256, with similar number of function evaluations (NFEs) and model parameters, eMIGM outperforms the seminal VAR. Moreover, as NFE and model parameters increase, eMIGM achieves performance comparable to the state-of-the-art continuous diffusion models while requiring less than 40% of the NFE. Additionally, on ImageNet 512x512, with only about 60% of the NFE, eMIGM outperforms the state-of-the-art continuous diffusion models. Code is available at https://github.com/ML-GSAI/eMIGM.
中文: eMIGM模型将掩码图像生成与扩散模型统一起来,在ImageNet基准测试中以更少的函数评估次数实现了优于现有方法的效率与性能表现。
English: The eMIGM model unifies masked image generation and diffusion approaches, demonstrating superior efficiency and performance on ImageNet benchmarks by requiring significantly fewer function evaluations than existing methods.
Authors:Spyros Kondylatos, Nikolaos Ioannis Bountos, Dimitrios Michail, Xiao Xiang Zhu, Gustau Camps-Valls, Ioannis Papoutsis
Abstract:
Recent advances in Computer Vision have introduced the concept of pretrained representation uncertainty, enabling zero-shot uncertainty estimation. This holds significant potential for Earth Observation (EO), where trustworthiness is critical, yet the complexity of EO data poses challenges to uncertainty-aware methods. In this work, we investigate the generalization of representation uncertainty in EO, considering the domain's unique semantic characteristics. We pretrain uncertainties on large EO datasets and propose an evaluation framework to assess their zero-shot performance in multi-label classification and segmentation EO tasks. Our findings reveal that, unlike uncertainties pretrained on natural images, EO-pretraining exhibits strong generalization across unseen EO domains, geographic locations, and target granularities, while maintaining sensitivity to variations in ground sampling distance. We demonstrate the practical utility of pretrained uncertainties showcasing their alignment with task-specific uncertainties in downstream tasks, their sensitivity to real-world EO image noise, and their ability to generate spatial uncertainty estimates out-of-the-box. Initiating the discussion on representation uncertainty in EO, our study provides insights into its strengths and limitations, paving the way for future research in the field. Code and weights are available at: https://github.com/Orion-AI-Lab/EOUncertaintyGeneralization.
中文: 计算机视觉的最新进展通过预训练表示不确定性实现了零样本不确定性估计,该技术在地球观测任务中展现出跨领域和地理位置的强大泛化能力,同时与任务特定不确定性及实际噪声敏感性保持一致。
English: Recent advances in computer vision enable zero-shot uncertainty estimation through pretrained representation uncertainty, which shows strong generalization in Earth Observation tasks across domains and geographic locations while aligning with task-specific uncertainties and real-world noise sensitivity.
Authors:Xiaotian Han, Tianlong Chen, Kaixiong Zhou, Zhimeng Jiang, Zhangyang Wang, Xia Hu
Abstract:
Deep neural networks are prone to various bias issues, jeopardizing their applications for high-stake decision-making. Existing fairness methods typically offer a fixed accuracy-fairness trade-off, since the weight of the well-trained model is a fixed point (fairness-optimum) in the weight space. Nevertheless, more flexible accuracy-fairness trade-offs at inference time are practically desired since: 1) stakes of the same downstream task can vary for different individuals, and 2) different regions have diverse laws or regularization for fairness. If using the previous fairness methods, we have to train multiple models, each offering a specific level of accuracy-fairness trade-off. This is often computationally expensive, time-consuming, and difficult to deploy, making it less practical for real-world applications. To address this problem, we propose You Only Debias Once (YODO) to achieve in-situ flexible accuracy-fairness trade-offs at inference time, using a single model that trained only once. Instead of pursuing one individual fixed point (fairness-optimum) in the weight space, we aim to find a "line" in the weight space that connects the accuracy-optimum and fairness-optimum points using a single model. Points (models) on this line implement varying levels of accuracy-fairness trade-offs. At inference time, by manually selecting the specific position of the learned "line", our proposed method can achieve arbitrary accuracy-fairness trade-offs for different end-users and scenarios. Experimental results on tabular and image datasets show that YODO achieves flexible trade-offs between model accuracy and fairness, at ultra-low overheads. For example, if we need $100$ levels of trade-off on the \acse dataset, YODO takes $3.53$ seconds while training $100$ fixed models consumes $425$ seconds. The code is available at https://github.com/ahxt/yodo.
Chinese: 提出的“一次去偏”方法通过单一模型在推理时实现灵活的准确性与公平性权衡,无需训练多个模型,通过在权重空间中找到连接准确性和公平性最优点的直线来实现这一目标。
English: The proposed You Only Debias Once (YODO) method enables flexible accuracy-fairness trade-offs during inference using a single model, eliminating the need for multiple models by finding an optimal line in the weight space that connects accuracy and fairness optima.
Authors:Mohammed Mahfoud, Ghait Boukachab, MichaÅ Koziarski, Alex Hernandez-Garcia, Stefan Bauer, Yoshua Bengio, Nikolay Malkin
Abstract:
Building predictive models for tabular data presents fundamental challenges, notably in scaling consistently, i.e., more resources translating to better performance, and generalizing systematically beyond the training data distribution. Designing decision tree models remains especially challenging given the intractably large search space, and most existing methods rely on greedy heuristics, while deep learning inductive biases expect a temporal or spatial structure not naturally present in tabular data. We propose a hybrid amortized structure inference approach to learn predictive decision tree ensembles given data, formulating decision tree construction as a sequential planning problem. We train a deep reinforcement learning (GFlowNet) policy to solve this problem, yielding a generative model that samples decision trees from the Bayesian posterior. We show that our approach, DT-GFN, outperforms state-of-the-art decision tree and deep learning methods on standard classification benchmarks derived from real-world data, robustness to distribution shifts, and anomaly detection, all while yielding interpretable models with shorter description lengths. Samples from the trained DT-GFN model can be ensembled to construct a random forest, and we further show that the performance of scales consistently in ensemble size, yielding ensembles of predictors that continue to generalize systematically.
中文摘要:DT-GFN采用深度强化学习方法将决策树构建转化为序列规划问题,在分类性能、鲁棒性和可解释性上超越现有方法,并能通过集成学习实现持续的性能提升。
English Summary: DT-GFN introduces a deep reinforcement learning approach to construct decision trees as sequential planning problems, outperforming existing methods in classification, robustness, and interpretability while scaling effectively with ensemble size.
Authors:Zenghao Guan, Yucan Zhou, Xiaoyan Gu
Abstract:
Traditional Federated Learning (FL) necessitates numerous rounds of communication between the server and clients, posing significant challenges including high communication costs, connection drop risks and susceptibility to privacy attacks. One-shot FL has become a compelling learning paradigm to overcome above drawbacks by enabling the training of a global server model via a single communication round. However, existing one-shot FL methods suffer from expensive computation cost on the server or clients and cannot deal with non-IID (Independent and Identically Distributed) data stably and effectively. To address these challenges, this paper proposes FedCGS, a novel Federated learning algorithm that Capture Global feature Statistics leveraging pre-trained models. With global feature statistics, we achieve training-free and heterogeneity-resistant one-shot FL. Furthermore, we extend its application to personalization scenario, where clients only need execute one extra communication round with server to download global statistics. Extensive experimental results demonstrate the effectiveness of our methods across diverse data heterogeneity settings. Code is available at https://github.com/Yuqin-G/FedCGS.
中文: 传统联邦学习存在高通信成本和隐私风险,单轮联邦学习虽能克服这些缺点,但面临计算开销大和非独立同分布数据处理的挑战;本文提出FedCGS算法,利用预训练模型捕获全局特征统计量,实现无需训练、抗异构的单轮联邦学习,并可扩展至个性化场景,仅需额外一轮通信。
English: Traditional Federated Learning faces high communication costs and privacy risks, which one-shot FL addresses but struggles with computational expenses and non-IID data; this paper introduces FedCGS, a method that uses global feature statistics from pre-trained models for efficient, heterogeneity-resistant one-shot FL and personalization with minimal extra communication.
Authors:Chikai Shang, Mengke Li, Yiqun Zhang, Zhen Chen, Jinlin Wu, Fangqing Gu, Yang Lu, Yiu-ming Cheung
Abstract:
Visual prompt tuning (VPT) provides an efficient and effective solution for adapting pre-trained models to various downstream tasks by incorporating learnable prompts. However, most prior art indiscriminately applies a fixed prompt distribution across different tasks, neglecting the importance of each block differing depending on the task. In this paper, we investigate adaptive distribution optimization (ADO) by addressing two key questions: (1) How to appropriately and formally define ADO, and (2) How to design an adaptive distribution strategy guided by this definition? Through in-depth analysis, we provide an affirmative answer that properly adjusting the distribution significantly improves VPT performance, and further uncover a key insight that a nested relationship exists between ADO and VPT. Based on these findings, we propose a new VPT framework, termed PRO-VPT (iterative Prompt RelOcation-based VPT), which adaptively adjusts the distribution building upon a nested optimization formulation. Specifically, we develop a prompt relocation strategy for ADO derived from this formulation, comprising two optimization steps: identifying and pruning idle prompts, followed by determining the optimal blocks for their relocation. By iteratively performing prompt relocation and VPT, our proposal adaptively learns the optimal prompt distribution, thereby unlocking the full potential of VPT. Extensive experiments demonstrate that our proposal significantly outperforms state-of-the-art VPT methods, e.g., PRO-VPT surpasses VPT by 1.6% average accuracy, leading prompt-based methods to state-of-the-art performance on the VTAB-1k benchmark. The code is available at https://github.com/ckshang/PRO-VPT.
中文摘要:本文提出PRO-VPT框架,通过迭代式提示重定位策略自适应优化提示分布,显著提升视觉提示微调性能,在多个基准测试中达到最先进水平。
English Summary: This paper introduces PRO-VPT, an adaptive prompt distribution optimization framework that iteratively relocates prompts between model blocks to enhance visual prompt tuning performance, achieving state-of-the-art results on multiple benchmarks.
Authors:Mengting Ai, Tianxin Wei, Yifan Chen, Zhichen Zeng, Ritchie Zhao, Girish Varatkar, Bita Darvish Rouhani, Xianfeng Tang, Hanghang Tong, Jingrui He
Abstract:
Mixture-of-Experts (MoE) Transformer, the backbone architecture of multiple phenomenal language models, leverages sparsity by activating only a fraction of model parameters for each input token. The sparse structure, while allowing constant time costs, results in space inefficiency: we still need to load all the model parameters during inference. We introduce ResMoE, an innovative MoE approximation framework that utilizes Wasserstein barycenter to extract a common expert (barycenter expert) and approximate the residuals between this barycenter expert and the original ones. ResMoE enhances the space efficiency for inference of large-scale MoE Transformers in a one-shot and data-agnostic manner without retraining while maintaining minimal accuracy loss, thereby paving the way for broader accessibility to large language models. We demonstrate the effectiveness of ResMoE through extensive experiments on Switch Transformer, Mixtral, and DeepSeekMoE models. The results show that ResMoE can reduce the number of parameters in an expert by up to 75% while maintaining comparable performance. The code is available at https://github.com/iDEA-iSAIL-Lab-UIUC/ResMoE.
中文: ResMoE是一种创新框架,通过利用Wasserstein质心减少专家参数高达75%,无需重新训练即可提升混合专家Transformer的空间效率,同时保持性能。
English: ResMoE is an innovative framework that improves space efficiency in Mixture-of-Experts Transformers by using Wasserstein barycenter to reduce parameters by up to 75% while maintaining performance, without requiring retraining.
Authors:Ta Duc Huy, Sen Kim Tran, Phan Nguyen, Nguyen Hoang Tran, Tran Bao Sam, Anton van den Hengel, Zhibin Liao, Johan W. Verjans, Minh-Son To, Vu Minh Hieu Phan
Abstract:
The ability to interpret and intervene model decisions is important for the adoption of computer-aided diagnosis methods in clinical workflows. Recent concept-based methods link the model predictions with interpretable concepts and modify their activation scores to interact with the model. However, these concepts are at the image level, which hinders the model from pinpointing the exact patches the concepts are activated. Alternatively, prototype-based methods learn representations from training image patches and compare these with test image patches, using the similarity scores for final class prediction. However, interpreting the underlying concepts of these patches can be challenging and often necessitates post-hoc guesswork. To address this issue, this paper introduces the novel Concept-based Similarity Reasoning network (CSR), which offers (i) patch-level prototype with intrinsic concept interpretation, and (ii) spatial interactivity. First, the proposed CSR provides localized explanation by grounding prototypes of each concept on image regions. Second, our model introduces novel spatial-level interaction, allowing doctors to engage directly with specific image areas, making it an intuitive and transparent tool for medical imaging. CSR improves upon prior state-of-the-art interpretable methods by up to 4.5\% across three biomedical datasets. Our code is released at https://github.com/tadeephuy/InteractCSR.
Chinese: 本文提出了基于概念的相似性推理网络(CSR),通过提供具有内在概念解释的补丁级原型和空间交互性,增强了医学影像的可解释性,并在三个生物医学数据集上比现有最优方法性能提升高达4.5%。
English: The paper introduces the Concept-based Similarity Reasoning network (CSR), which provides patch-level prototypes with intrinsic concept interpretation and spatial interactivity, improving interpretability in medical imaging and outperforming prior methods by up to 4.5% on biomedical datasets.
Authors:Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, Shaohui Lin
Abstract:
DeepSeek-R1-Zero has successfully demonstrated the emergence of reasoning capabilities in LLMs purely through Reinforcement Learning (RL). Inspired by this breakthrough, we explore how RL can be utilized to enhance the reasoning capability of MLLMs. However, direct training with RL struggles to activate complex reasoning capabilities such as questioning and reflection in MLLMs, due to the absence of substantial high-quality multimodal reasoning data. To address this issue, we propose the reasoning MLLM, Vision-R1, to improve multimodal reasoning capability. Specifically, we first construct a high-quality multimodal CoT dataset without human annotations by leveraging an existing MLLM and DeepSeek-R1 through modality bridging and data filtering to obtain a 200K multimodal CoT dataset, Vision-R1-cold dataset. It serves as cold-start initialization data for Vision-R1. To mitigate the optimization challenges caused by overthinking after cold start, we propose Progressive Thinking Suppression Training (PTST) strategy and employ Group Relative Policy Optimization (GRPO) with the hard formatting result reward function to gradually refine the model's ability to learn correct and complex reasoning processes on a 10K multimodal math dataset. Comprehensive experiments show our model achieves an average improvement of $\sim$6% across various multimodal math reasoning benchmarks. Vision-R1-7B achieves a 73.5% accuracy on the widely used MathVista benchmark, which is only 0.4% lower than the leading reasoning model, OpenAI O1. The datasets and code will be released in: https://github.com/Osilly/Vision-R1 .
中文: Vision-R1通过构建无需人工标注的20万规模多模态思维链数据集,并采用渐进式训练策略,有效提升了多模态数学推理能力,在多个基准测试中表现优异。
English: Vision-R1 is a multimodal reasoning model that enhances reasoning capabilities by creating a high-quality 200K multimodal CoT dataset without human annotation and employing progressive training strategies, achieving significant improvements on math reasoning benchmarks.
Authors:Ming Zhang, Yuhui Wang, Yujiong Shen, Tingyi Yang, Changhao Jiang, Yilong Wu, Shihan Dou, Qinhao Chen, Zhiheng Xi, Zhihao Zhang, Yi Dong, Zhen Wang, Zhihui Fei, Mingyang Wan, Tao Liang, Guojun Ma, Qi Zhang, Tao Gui, Xuanjing Huang
Abstract:
Process-driven dialogue systems, which operate under strict predefined process constraints, are essential in customer service and equipment maintenance scenarios. Although Large Language Models (LLMs) have shown remarkable progress in dialogue and reasoning, they still struggle to solve these strictly constrained dialogue tasks. To address this challenge, we construct Process Flow Dialogue (PFDial) dataset, which contains 12,705 high-quality Chinese dialogue instructions derived from 440 flowcharts containing 5,055 process nodes. Based on PlantUML specification, each UML flowchart is converted into atomic dialogue units i.e., structured five-tuples. Experimental results demonstrate that a 7B model trained with merely 800 samples, and a 0.5B model trained on total data both can surpass 90% accuracy. Additionally, the 8B model can surpass GPT-4o up to 43.88% with an average of 11.00%. We further evaluate models' performance on challenging backward transitions in process flows and conduct an in-depth analysis of various dataset formats to reveal their impact on model performance in handling decision and sequential branches. The data is released in https://github.com/KongLongGeFDU/PFDial.
中文: PFDial数据集基于440个UML流程图构建了12,705条中文对话指令,实验表明仅用800样本训练的7B模型和全量训练的0.5B模型均能突破90%准确率,在流程驱动对话任务中最高可超越GPT-4o达43.88%。
English: The PFDial dataset, comprising 12,705 Chinese dialogue instructions derived from 440 UML flowcharts, enables small models like 7B and 0.5B to achieve over 90% accuracy in process-driven dialogue tasks, even surpassing GPT-4o by up to 43.88%.
Authors:AgiBot-World-Contributors, Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, Xu Huang, Shu Jiang, Yuxin Jiang, Cheng Jing, Hongyang Li, Jialu Li, Chiming Liu, Yi Liu, Yuxiang Lu, Jianlan Luo, Ping Luo, Yao Mu, Yuehan Niu, Yixuan Pan, Jiangmiao Pang, Yu Qiao, Guanghui Ren, Cheng Ruan, Jiaqi Shan, Yongjian Shen, Chengshi Shi, Mingkang Shi, Modi Shi, Chonghao Sima, Jianheng Song, Huijie Wang, Wenhao Wang, Dafeng Wei, Chengen Xie, Guo Xu, Junchi Yan, Cunbiao Yang, Lei Yang, Shukai Yang, Maoqing Yao, Jia Zeng, Chi Zhang, Qinglin Zhang, Bin Zhao, Chengyue Zhao, Jiaqi Zhao, Jianchao Zhu
Abstract:
We explore how scalable robot data can address real-world challenges for generalized robotic manipulation. Introducing AgiBot World, a large-scale platform comprising over 1 million trajectories across 217 tasks in five deployment scenarios, we achieve an order-of-magnitude increase in data scale compared to existing datasets. Accelerated by a standardized collection pipeline with human-in-the-loop verification, AgiBot World guarantees high-quality and diverse data distribution. It is extensible from grippers to dexterous hands and visuo-tactile sensors for fine-grained skill acquisition. Building on top of data, we introduce Genie Operator-1 (GO-1), a novel generalist policy that leverages latent action representations to maximize data utilization, demonstrating predictable performance scaling with increased data volume. Policies pre-trained on our dataset achieve an average performance improvement of 30% over those trained on Open X-Embodiment, both in in-domain and out-of-distribution scenarios. GO-1 exhibits exceptional capability in real-world dexterous and long-horizon tasks, achieving over 60% success rate on complex tasks and outperforming prior RDT approach by 32%. By open-sourcing the dataset, tools, and models, we aim to democratize access to large-scale, high-quality robot data, advancing the pursuit of scalable and general-purpose intelligence.
中文: 本研究推出了AgiBot World大规模机器人数据集平台和GO-1通用策略,通过利用海量高质量数据显著提升了机器人操作性能,在复杂任务中表现优异。
English: This research introduces AgiBot World, a large-scale platform with over 1 million robot trajectories, and GO-1, a generalist policy that significantly improves robotic manipulation performance by leveraging extensive, high-quality data.
Authors:Yu Zhou, Bingyan Liu
Abstract:
Federated Learning (FL) enables multiple clients to collaboratively develop a global model while maintaining data privacy. However, online FL deployment faces challenges due to distribution shifts and evolving test samples. Personalized Federated Learning (PFL) tailors the global model to individual client distributions, but struggles with Out-Of-Distribution (OOD) samples during testing, leading to performance degradation. In real-world scenarios, balancing personalization and generalization during online testing is crucial and existing methods primarily focus on training-phase generalization. To address the test-time trade-off, we introduce a new scenario: Test-time Generalization for Internal and External Distributions in Federated Learning (TGFL), which evaluates adaptability under Internal Distribution (IND) and External Distribution (EXD). We propose BTFL, a Bayesian-based test-time generalization method for TGFL, which balances generalization and personalization at the sample level during testing. BTFL employs a two-head architecture to store local and global knowledge, interpolating predictions via a dual-Bayesian framework that considers both historical test data and current sample characteristics with theoretical guarantee and faster speed. Our experiments demonstrate that BTFL achieves improved performance across various datasets and models with less time cost. The source codes are made publicly available at https://github.com/ZhouYuCS/BTFL .
中文摘要:联邦学习面临分布偏移和测试时适应性挑战,BTFL方法通过贝叶斯框架在测试阶段平衡个性化与泛化能力,在不同数据集上实现了更好性能。
English Summary: Federated Learning faces challenges with distribution shifts and test-time adaptability, which the proposed BTFL method addresses by balancing personalization and generalization using a Bayesian framework for improved performance across datasets.
Authors:Jinmyeong An, Sangwon Ryu, Heejin Do, Yunsu Kim, Jungseul Ok, Gary Geunbae Lee
Abstract:
Online grooming is a severe social threat where sexual predators gradually entrap child victims with subtle and gradual manipulation. Therefore, timely intervention for online grooming is critical for proactive protection. However, previous methods fail to determine the optimal intervention points (i.e., jump to conclusions) as they rely on chat-level risk labels by causing weak supervision of risky utterances. For timely detection, we propose speed control reinforcement learning (SCoRL) (The code and supplementary materials are available at https://github.com/jinmyeongAN/SCoRL), incorporating a practical strategy derived from luring communication theory (LCT). To capture the predator's turn-level entrapment, we use a turn-level risk label based on the LCT. Then, we design a novel speed control reward function that balances the trade-off between speed and accuracy based on turn-level risk label; thus, SCoRL can identify the optimal intervention moment. In addition, we introduce a turn-level metric for precise evaluation, identifying limitations in previously used chat-level metrics. Experimental results show that SCoRL effectively preempted online grooming, offering a more proactive and timely solution. Further analysis reveals that our method enhances performance while intuitively identifying optimal early intervention points.
中文摘要:本研究提出的SCoRL方法通过引入话轮级风险评估和速度控制奖励机制,能主动识别网络诱骗的最佳干预时机,有效解决了以往基于聊天级标签方法导致的监管滞后问题。
English Summary: The proposed SCoRL method uses reinforcement learning with turn-level risk assessment to proactively identify optimal intervention moments in online grooming, overcoming previous methods' reliance on chat-level labels that caused delayed detection.
Authors:Hasan Abed Al Kader Hammoud, Bernard Ghanem
Abstract:
We propose DiffCLIP, a novel vision-language model that extends the differential attention mechanism to CLIP architectures. Differential attention was originally developed for large language models to amplify relevant context while canceling out noisy information. In this work, we integrate this mechanism into CLIP's dual encoder (image and text) framework. With minimal additional parameters, DiffCLIP achieves superior performance on image-text understanding tasks. Across zero-shot classification, retrieval, and robustness benchmarks, DiffCLIP consistently outperforms baseline CLIP models. Notably, these gains come with negligible computational overhead, demonstrating that differential attention can significantly enhance multi-modal representations without sacrificing efficiency. Code can be found at https://github.com/hammoudhasan/DiffCLIP.
中文摘要:DiffCLIP通过在CLIP模型中引入差分注意力机制,以极少的计算开销显著提升了图像-文本理解任务的性能,同时保持了高效性。
English Summary: DiffCLIP enhances the CLIP model by integrating a differential attention mechanism, which improves performance on image-text tasks with minimal computational cost and no efficiency loss.
Authors:Ruchi Bhatt, Shreya Bansal, Amanpreet Chander, Rupinder Kaur, Malya Singh, Mohan Kankanhalli, Abdulmotaleb El Saddik, Mukesh Kumar Saini
Abstract:
Understanding plant growth dynamics is essential for applications in agriculture and plant phenotyping. We present the Growth Modelling (GroMo) challenge, which is designed for two primary tasks: (1) plant age prediction and (2) leaf count estimation, both essential for crop monitoring and precision agriculture. For this challenge, we introduce GroMo25, a dataset with images of four crops: radish, okra, wheat, and mustard. Each crop consists of multiple plants (p1, p2, ..., pn) captured over different days (d1, d2, ..., dm) and categorized into five levels (L1, L2, L3, L4, L5). Each plant is captured from 24 different angles with a 15-degree gap between images. Participants are required to perform both tasks for all four crops with these multiview images. We proposed a Multiview Vision Transformer (MVVT) model for the GroMo challenge and evaluated the crop-wise performance on GroMo25. MVVT reports an average MAE of 7.74 for age prediction and an MAE of 5.52 for leaf count. The GroMo Challenge aims to advance plant phenotyping research by encouraging innovative solutions for tracking and predicting plant growth. The GitHub repository is publicly available at https://github.com/mriglab/GroMo-Plant-Growth-Modeling-with-Multiview-Images.
中文: GroMo挑战赛通过提供多视角图像数据集,提出植物年龄预测和叶片计数两项任务以推动农业表型研究,其MVVT模型实现了较低误差率。
English: The GroMo Challenge introduces a multiview image dataset and tasks for predicting plant age and leaf count to advance agricultural phenotyping, with a proposed MVVT model achieving low error rates.
Authors:Shinnosuke Matsuo, Riku Togashi, Ryoma Bise, Seiichi Uchida, Masahiro Nomura
Abstract:
Active learning (AL) is a label-efficient machine learning paradigm that focuses on selectively annotating high-value instances to maximize learning efficiency. Its effectiveness can be further enhanced by incorporating weak supervision, which uses rough yet cost-effective annotations instead of exact (i.e., full) but expensive annotations. We introduce a novel AL framework, Instance-wise Supervision-Level Optimization (ISO), which not only selects the instances to annotate but also determines their optimal annotation level within a fixed annotation budget. Its optimization criterion leverages the value-to-cost ratio (VCR) of each instance while ensuring diversity among the selected instances. In classification experiments, ISO consistently outperforms traditional AL methods and surpasses a state-of-the-art AL approach that combines full and weak supervision, achieving higher accuracy at a lower overall cost. This code is available at https://github.com/matsuo-shinnosuke/ISOAL.
Chinese: ISO框架通过动态选择实例及其标注级别,利用优化的价值成本比和多样性,以更低成本实现了更高的分类准确率。
English: The ISO framework enhances active learning by dynamically selecting both instances and their annotation levels, achieving superior accuracy at lower cost through optimized value-to-cost ratios and diversity.
Authors:Amir Mohammad Izadi, Seyed Mohammad Hadi Hosseini, Soroush Vafaie Tabar, Ali Abdollahi, Armin Saghafian, Mahdieh Soleymani Baghshah
Abstract:
Text-to-image generative models have made significant advancements in recent years; however, accurately capturing intricate details in textual prompts-such as entity missing, attribute binding errors, and incorrect relationships remains a formidable challenge. In response, we present an innovative, training-free method that directly addresses these challenges by incorporating tailored objectives to account for textual constraints. Unlike layout-based approaches that enforce rigid structures and limit diversity, our proposed approach offers a more flexible arrangement of the scene by imposing just the extracted constraints from the text, without any unnecessary additions. These constraints are formulated as losses-entity missing, entity mixing, attribute binding, and spatial relationships-integrated into a unified loss that is applied in the first generation stage. Furthermore, we introduce a feedback-driven system for fine-grained initial noise refinement. This system integrates a verifier that evaluates the generated image, identifies inconsistencies, and provides corrective feedback. Leveraging this feedback, our refinement method first targets the unmet constraints by refining the faulty attention maps caused by initial noise, through the optimization of selective losses associated with these constraints. Subsequently, our unified loss function is reapplied to proceed the second generation phase. Experimental results demonstrate that our method, relying solely on our proposed objective functions, significantly enhances compositionality, achieving a 24% improvement in human evaluation and a 25% gain in spatial relationships. Furthermore, our fine-grained noise refinement proves effective, boosting performance by up to 5%. Code is available at \href{https://github.com/hadi-hosseini/noise-refinement}{https://github.com/hadi-hosseini/noise-refinement}.
中文摘要:本文提出一种无需训练的方法,通过将文本约束整合为统一损失函数并采用反馈驱动的噪声优化系统,显著提升了文本到图像生成的组合能力与空间关系准确性。
English Summary: This paper introduces a training-free method that enhances text-to-image generation by integrating textual constraints as unified losses and employing a feedback-driven noise refinement system, significantly improving compositionality and spatial accuracy.
Authors:Yu Jin, Jingming Liu, Zhexu Luo, Yifei Peng, Ziang Qin, Wang-Zhou Dai, Yao-Xiang Ding, Kun Zhou
Abstract:
Visual generative abductive learning studies jointly training symbol-grounded neural visual generator and inducing logic rules from data, such that after learning, the visual generation process is guided by the induced logic rules. A major challenge for this task is to reduce the time cost of logic abduction during learning, an essential step when the logic symbol set is large and the logic rule to induce is complicated. To address this challenge, we propose a pre-training method for obtaining meta-rule selection policy for the recently proposed visual generative learning approach AbdGen [Peng et al., 2023], aiming at significantly reducing the candidate meta-rule set and pruning the search space. The selection model is built based on the embedding representation of both symbol grounding of cases and meta-rules, which can be effectively integrated with both neural model and logic reasoning system. The pre-training process is done on pure symbol data, not involving symbol grounding learning of raw visual inputs, making the entire learning process low-cost. An additional interesting observation is that the selection policy can rectify symbol grounding errors unseen during pre-training, which is resulted from the memorization ability of attention mechanism and the relative stability of symbolic patterns. Experimental results show that our method is able to effectively address the meta-rule selection problem for visual abduction, boosting the efficiency of visual generative abductive learning. Code is available at https://github.com/future-item/metarule-select.
中文摘要:本研究提出一种预训练方法,通过元规则选择策略缩减逻辑规则搜索空间,在无需视觉输入的情况下低成本提升视觉生成溯因学习的效率,并能修正符号基础错误。
English Summary: This study introduces a pre-training method to enhance visual generative abductive learning by reducing logic rule search space through meta-rule selection, significantly improving efficiency without compromising performance.
Authors:Tatsuro Inaba, Kentaro Inui, Yusuke Miyao, Yohei Oseki, Benjamin Heinzerling, Yu Takagi
Abstract:
Large Language Models (LLMs) demonstrate remarkable multilingual capabilities and broad knowledge. However, the internal mechanisms underlying the development of these capabilities remain poorly understood. To investigate this, we analyze how the information encoded in LLMs' internal representations evolves during the training process. Specifically, we train sparse autoencoders at multiple checkpoints of the model and systematically compare the interpretative results across these stages. Our findings suggest that LLMs initially acquire language-specific knowledge independently, followed by cross-linguistic correspondences. Moreover, we observe that after mastering token-level knowledge, the model transitions to learning higher-level, abstract concepts, indicating the development of more conceptual understanding.
中文: 大语言模型在训练过程中先习得各语言特有知识,随后建立跨语言关联,并从词汇层面学习逐步过渡到掌握更高层次的抽象概念。
English: Large Language Models initially develop language-specific knowledge and then learn cross-linguistic patterns, progressing from token-level information to higher-level abstract concepts during training.
Authors:Samuel Garcin, Trevor McInroe, Pablo Samuel Castro, Prakash Panangaden, Christopher G. Lucas, David Abel, Stefano V. Albrecht
Abstract:
Extracting relevant information from a stream of high-dimensional observations is a central challenge for deep reinforcement learning agents. Actor-critic algorithms add further complexity to this challenge, as it is often unclear whether the same information will be relevant to both the actor and the critic. To this end, we here explore the principles that underlie effective representations for the actor and for the critic in on-policy algorithms. We focus our study on understanding whether the actor and critic will benefit from separate, rather than shared, representations. Our primary finding is that when separated, the representations for the actor and critic systematically specialise in extracting different types of information from the environment -- the actor's representation tends to focus on action-relevant information, while the critic's representation specialises in encoding value and dynamics information. We conduct a rigourous empirical study to understand how different representation learning approaches affect the actor and critic's specialisations and their downstream performance, in terms of sample efficiency and generation capabilities. Finally, we discover that a separated critic plays an important role in exploration and data collection during training. Our code, trained models and data are accessible at https://github.com/francelico/deac-rep.
中文: 本研究探讨了行动者-评论家算法中分离表征的效益,发现独立表征使行动者专注动作相关信息而评论家专精价值和动态信息,从而提升样本效率和探索能力。
English: This study investigates whether actor-critic algorithms benefit from separate representations for the actor and critic, finding that specialized representations enable the actor to focus on action-relevant information while the critic encodes value and dynamics, improving sample efficiency and exploration.
Authors:Qizhe Wu, Huawen Liang, Yuchen Gui, Zhichen Zeng, Zerong He, Linfeng Tao, Xiaotian Wang, Letian Zhao, Zhaoxi Zeng, Wei Yuan, Wei Wu, Xi Jin
Abstract:
General matrix-matrix multiplication (GEMM) is a cornerstone of AI computations, making tensor processing engines (TPEs) increasingly critical in GPUs and domain-specific architectures. Existing architectures primarily optimize dataflow or operand reuse strategies. However, considering the interaction between matrix multiplication and multiply-accumulators (MACs) offers greater optimization potential. This work introduces a novel hardware perspective on matrix multiplication, focusing on the bit-weight dimension of MACs. We propose a finer-grained TPE notation using matrix triple loops as an example, introducing new methods for designing and optimizing PE microarchitectures. Based on this notation and its transformations, we propose four optimization techniques that improve timing, area, and power consumption. Implementing our design in RTL using the SMIC-28nm process, we evaluate its effectiveness across four classic TPE architectures: systolic array, 3D-Cube, multiplier-adder tree, and 2D-Matrix. Our techniques achieve area efficiency improvements of 1.27x, 1.28x, 1.56x, and 1.44x, and energy efficiency gains of 1.04x, 1.56x, 1.49x, and 1.20x, respectively. Applied to a bit-slice architecture, our approach achieves a 12.10x improvement in energy efficiency and 2.85x in area efficiency compared to Laconic. Our Verilog HDL code, along with timing, area, and power reports, is available at https://github.com/wqzustc/High-Performance-Tensor-Processing-Engines
本研究通过聚焦矩阵乘法中的比特权重维度,提出了一种新的张量处理器硬件优化方法,在多种架构上实现了面积和能效的显著提升。
This work introduces a novel hardware optimization approach for tensor processing engines by focusing on bit-weight dimensions in matrix multiplication, achieving significant improvements in area and energy efficiency across multiple architectures.
Authors:Mohit Pandey, Gopeshh Subbaraj, Artem Cherkasov, Martin Ester, Emmanuel Bengio
Abstract:
Generative Flow Networks (GFlowNets) have recently emerged as a suitable framework for generating diverse and high-quality molecular structures by learning from rewards treated as unnormalized distributions. Previous works in this framework often restrict exploration by using predefined molecular fragments as building blocks, limiting the chemical space that can be accessed. In this work, we introduce Atomic GFlowNets (A-GFNs), a foundational generative model leveraging individual atoms as building blocks to explore drug-like chemical space more comprehensively. We propose an unsupervised pre-training approach using drug-like molecule datasets, which teaches A-GFNs about inexpensive yet informative molecular descriptors such as drug-likeliness, topological polar surface area, and synthetic accessibility scores. These properties serve as proxy rewards, guiding A-GFNs towards regions of chemical space that exhibit desirable pharmacological properties. We further implement a goal-conditioned finetuning process, which adapts A-GFNs to optimize for specific target properties. In this work, we pretrain A-GFN on a subset of ZINC dataset, and by employing robust evaluation metrics we show the effectiveness of our approach when compared to other relevant baseline methods for a wide range of drug design tasks. The code is accessible at https://github.com/diamondspark/AGFN.
Chinese: 原子生成流网络(A-GFNs)提出了一种基础生成模型,以单个原子为构建单元,通过基于代理奖励的无监督预训练和目标导向微调,实现了对类药化学空间更全面的探索。
English: Atomic GFlowNets (A-GFNs) introduce a foundational generative model that uses individual atoms as building blocks, enabling broader exploration of drug-like chemical space through unsupervised pre-training with proxy rewards and goal-conditioned fine-tuning for specific properties.
Authors:Thomas Winninger, Boussad Addad, Katarzyna Kapusta
Abstract:
Traditional white-box methods for creating adversarial perturbations against LLMs typically rely only on gradient computation from the targeted model, ignoring the internal mechanisms responsible for attack success or failure. Conversely, interpretability studies that analyze these internal mechanisms lack practical applications beyond runtime interventions. We bridge this gap by introducing a novel white-box approach that leverages mechanistic interpretability techniques to craft practical adversarial inputs. Specifically, we first identify acceptance subspaces - sets of feature vectors that do not trigger the model's refusal mechanisms - then use gradient-based optimization to reroute embeddings from refusal subspaces to acceptance subspaces, effectively achieving jailbreaks. This targeted approach significantly reduces computation cost, achieving attack success rates of 80-95\% on state-of-the-art models including Gemma2, Llama3.2, and Qwen2.5 within minutes or even seconds, compared to existing techniques that often fail or require hours of computation. We believe this approach opens a new direction for both attack research and defense development. Furthermore, it showcases a practical application of mechanistic interpretability where other methods are less efficient, which highlights its utility. The code and generated datasets are available at https://github.com/Sckathach/subspace-rerouting.
中文摘要:本研究提出了一种新颖的白盒对抗攻击方法,通过结合机制可解释性与梯度优化,将模型嵌入从拒绝子空间高效重定向至接受子空间,在先进大语言模型上实现80-95%的攻击成功率,仅需数分钟且显著降低计算成本。
English Summary: This study introduces a novel white-box adversarial attack method that combines mechanistic interpretability with gradient optimization to efficiently reroute model embeddings from refusal to acceptance subspaces, achieving high success rates of 80-95% on advanced LLMs within minutes while significantly reducing computational costs.
Authors:Aditya Shankar, Lydia Y. Chen, Arie van Deursen, Rihan Hai
Abstract:
Generating temporal data under conditions is crucial for forecasting, imputation, and generative tasks. Such data often has metadata and partially observed signals that jointly influence the generated values. However, existing methods face three key limitations: (1) they condition on either the metadata or observed values, but rarely both together; (2) they adopt either training-time approaches that fail to generalize to unseen scenarios, or inference-time approaches that ignore metadata; and (3) they suffer from trade-offs between generation speed and temporal coherence across time windows--choosing either slow but coherent autoregressive methods or fast but incoherent parallel ones. We propose WaveStitch, a novel diffusion-based method to overcome these hurdles through: (1) dual-sourced conditioning on both metadata and partially observed signals; (2) a hybrid training-inference architecture, incorporating metadata during training and observations at inference via gradient-based guidance; and (3) a novel pipeline-style paradigm that generates time windows in parallel while preserving coherence through an inference-time conditional loss and a stitching mechanism. Across diverse datasets, WaveStitch demonstrates adaptability to arbitrary patterns of observed signals, achieving 1.81x lower mean-squared-error compared to the state-of-the-art, and generates data up to 166.48x faster than autoregressive methods while maintaining coherence. Our code is available at: https://github.com/adis98/WaveStitch
中文: WaveStitch是一种基于扩散的新方法,通过同时结合元数据和部分观测信号进行双源调节,采用混合架构提升适应性,并在保持时间连贯性的前提下实现快速并行生成,从而在精度和速度上均优于现有技术。
English: WaveStitch is a novel diffusion-based method that overcomes key limitations in temporal data generation by conditioning on both metadata and observed signals, employing a hybrid architecture for adaptability, and enabling fast parallel generation while maintaining temporal coherence, achieving superior accuracy and speed.
Authors:Li weile, Liu Xiao
Abstract:
Time series models face significant challenges in scaling to handle large and complex datasets, akin to the scaling achieved by large language models (LLMs). The unique characteristics of time series data and the computational demands of model scaling necessitate innovative approaches. While researchers have explored various architectures such as Transformers, LSTMs, and GRUs to address these challenges, we propose a novel solution using RWKV-7, which incorporates meta-learning into its state update mechanism. By integrating RWKV-7's time mix and channel mix components into the transformer-based time series model Timer, we achieve a substantial performance improvement of approximately 1.13 to 43.3x and a 4.5x reduction in training time with 1/23 parameters, all while utilizing fewer parameters. Our code and model weights are publicly available for further research and development at https://github.com/Alic-Li/BlackGoose_Rimer.
中文: 研究者将RWKV-7的元学习架构整合到Timer时序模型中,在减少参数量的同时实现了最高43倍性能提升和4.5倍训练加速。
English: Researchers propose integrating RWKV-7's meta-learning architecture into the Timer model, achieving up to 43x performance gains and 4.5x faster training with significantly fewer parameters.
Authors:Hoang-Thang Ta, Anh Tran
Abstract:
Kolmogorov-Arnold Networks (KANs) have inspired numerous works exploring their applications across a wide range of scientific problems, with the potential to replace Multilayer Perceptrons (MLPs). While many KANs are designed using basis and polynomial functions, such as B-splines, ReLU-KAN utilizes a combination of ReLU functions to mimic the structure of B-splines and take advantage of ReLU's speed. However, ReLU-KAN is not built for multiple inputs, and its limitations stem from ReLU's handling of negative values, which can restrict feature extraction. To address these issues, we introduce Activation Function-Based Kolmogorov-Arnold Networks (AF-KAN), expanding ReLU-KAN with various activations and their function combinations. This novel KAN also incorporates parameter reduction methods, primarily attention mechanisms and data normalization, to enhance performance on image classification datasets. We explore different activation functions, function combinations, grid sizes, and spline orders to validate the effectiveness of AF-KAN and determine its optimal configuration. In the experiments, AF-KAN significantly outperforms MLP, ReLU-KAN, and other KANs with the same parameter count. It also remains competitive even when using fewer than 6 to 10 times the parameters while maintaining the same network structure. However, AF-KAN requires a longer training time and consumes more FLOPs. The repository for this work is available at https://github.com/hoangthangta/All-KAN.
Chinese: AF-KAN提出了一种新型的科尔莫戈罗夫-阿诺德网络,采用多种激活函数和参数精简方法,在图像分类任务中以更少参数显著优于MLP和其他KAN,但需要更长的训练时间。
English: AF-KAN introduces a novel Kolmogorov-Arnold Network that utilizes various activation functions and parameter reduction methods, significantly outperforming MLPs and other KANs in image classification while requiring fewer parameters but longer training times.
Authors:Xudong Lu, Haohao Gao, Renshou Wu, Shuai Ren, Xiaoxin Chen, Hongsheng Li, Fangyuan Li
Abstract:
Large Language Models (LLMs) have become integral to daily life, especially advancing as intelligent assistants through on-device deployment on smartphones. However, existing LLM evaluation benchmarks predominantly focus on objective tasks like mathematics and coding in English, which do not necessarily reflect the practical use cases of on-device LLMs in real-world mobile scenarios, especially for Chinese users. To address these gaps, we introduce SmartBench, the first benchmark designed to evaluate the capabilities of on-device LLMs in Chinese mobile contexts. We analyze functionalities provided by representative smartphone manufacturers and divide them into five categories: text summarization, text Q&A, information extraction, content creation, and notification management, further detailed into 20 specific tasks. For each task, we construct high-quality datasets comprising 50 to 200 question-answer pairs that reflect everyday mobile interactions, and we develop automated evaluation criteria tailored for these tasks. We conduct comprehensive evaluations of on-device LLMs and MLLMs using SmartBench and also assess their performance after quantized deployment on real smartphone NPUs. Our contributions provide a standardized framework for evaluating on-device LLMs in Chinese, promoting further development and optimization in this critical area. Code and data will be available at https://github.com/vivo-ai-lab/SmartBench.
中文: SmartBench是首个针对中文移动场景下设备端大语言模型的评估基准,涵盖五大功能类别并提供自动化评估标准,旨在弥补实际应用场景中的评估空白。
English: SmartBench is the first benchmark designed to evaluate on-device LLMs in Chinese mobile contexts, covering five key functional categories and providing automated evaluation criteria to address the gap in practical usage scenarios.
Authors:Xiaohao Xu, Feng Xue, Xiang Li, Haowei Li, Shusheng Yang, Tianyi Zhang, Matthew Johnson-Roberson, Xiaonan Huang
Abstract:
Depth ambiguity is a fundamental challenge in spatial scene understanding, especially in transparent scenes where single-depth estimates fail to capture full 3D structure. Existing models, limited to deterministic predictions, overlook real-world multi-layer depth. To address this, we introduce a paradigm shift from single-prediction to multi-hypothesis spatial foundation models. We first present \texttt{MD-3k}, a benchmark exposing depth biases in expert and foundational models through multi-layer spatial relationship labels and new metrics. To resolve depth ambiguity, we propose Laplacian Visual Prompting (LVP), a training-free spectral prompting technique that extracts hidden depth from pre-trained models via Laplacian-transformed RGB inputs. By integrating LVP-inferred depth with standard RGB-based estimates, our approach elicits multi-layer depth without model retraining. Extensive experiments validate the effectiveness of LVP in zero-shot multi-layer depth estimation, unlocking more robust and comprehensive geometry-conditioned visual generation, 3D-grounded spatial reasoning, and temporally consistent video-level depth inference. Our benchmark and code will be available at https://github.com/Xiaohao-Xu/Ambiguity-in-Space.
中文摘要:本研究通过提出免训练的拉普拉斯视觉提示技术,从预训练模型中提取多层深度信息以解决透明场景中的深度模糊问题,并基于新基准验证了其在三维空间理解中的有效性。
English Summary: This research introduces a training-free Laplacian Visual Prompting technique to address depth ambiguity in transparent scenes by extracting multi-layer depth from pre-trained models, validated through a new benchmark for robust 3D spatial understanding.
Authors:Nils Graef, Andrew Wasielewski
Abstract:
Slim attention shrinks the context memory size by 2x for transformer models with MHA (multi-head attention), which can speed up inference by up to 2x for large context windows.
Slim attention is an exact, mathematically identical implementation of the standard attention mechanism and therefore doesn't compromise model accuracy. In other words, slim attention losslessly compresses the context memory by a factor of 2.
For encoder-decoder transformers, the context memory size can be reduced even further: For the Whisper models for example, slim attention reduces the context memory by 8x, which can speed up token generation by 5x for batch size 64 for example.
And for the T5-11B model for example, the memory can be reduced by 32x because its MHA projection dimension is larger than the embedding dimension.
See https://github.com/OpenMachine-ai/transformer-tricks for code and more transformer tricks, and https://www.youtube.com/watch?v=uVtk3B6YO4Y for this paper's YouTube video.
Chinese: Slim注意力是一种内存高效的多头注意力实现,可将标准Transformer的上下文内存减少2倍,特定模型最多减少32倍,在保持完全准确性的同时显著加速推理。
English: Slim attention is a memory-efficient implementation of multi-head attention that reduces context memory size by up to 2x for standard transformers and up to 32x for certain models while maintaining full accuracy and accelerating inference.
Authors:Yiming Li, Kaiying Yan, Shuo Shao, Tongqing Zhai, Shu-Tao Xia, Zhan Qin, Dacheng Tao
Abstract:
With the increasing adoption of deep learning in speaker verification, large-scale speech datasets have become valuable intellectual property. To audit and prevent the unauthorized usage of these valuable released datasets, especially in commercial or open-source scenarios, we propose a novel dataset ownership verification method. Our approach introduces a clustering-based backdoor watermark (CBW), enabling dataset owners to determine whether a suspicious third-party model has been trained on a protected dataset under a black-box setting. The CBW method consists of two key stages: dataset watermarking and ownership verification. During watermarking, we implant multiple trigger patterns in the dataset to make similar samples (measured by their feature similarities) close to the same trigger while dissimilar samples are near different triggers. This ensures that any model trained on the watermarked dataset exhibits specific misclassification behaviors when exposed to trigger-embedded inputs. To verify dataset ownership, we design a hypothesis-test-based framework that statistically evaluates whether a suspicious model exhibits the expected backdoor behavior. We conduct extensive experiments on benchmark datasets, verifying the effectiveness and robustness of our method against potential adaptive attacks. The code for reproducing main experiments is available at https://github.com/Radiant0726/CBW
中文: 本文提出了一种基于聚类的后门水印方法,通过在数据集中植入触发模式并利用假设检验来验证黑盒场景下第三方模型是否未经授权使用了受保护数据集。
English: This paper introduces a clustering-based backdoor watermark (CBW) method for verifying dataset ownership in speaker verification, which implants triggers during dataset watermarking and uses hypothesis testing to detect unauthorized model usage under black-box settings.
Authors:Yihang Wu, Ahmad Chaddad, Christian Desrosiers, Tareef Daqqaq, Reem Kateb
Abstract:
Despite the remarkable performance of vision language models (VLMs) such as Contrastive Language Image Pre-training (CLIP), the large size of these models is a considerable obstacle to their use in federated learning (FL) systems where the parameters of local client models need to be transferred to a global server for aggregation. Another challenge in FL is the heterogeneity of data from different clients, which affects the generalization performance of the solution. In addition, natural pre-trained VLMs exhibit poor generalization ability in the medical datasets, suggests there exists a domain gap. To solve these issues, we introduce a novel method for the Federated Adversarial Adaptation (FAA) of CLIP. Our method, named FAA-CLIP, handles the large communication costs of CLIP using a light-weight feature adaptation module (FAM) for aggregation, effectively adapting this VLM to each client's data while greatly reducing the number of parameters to transfer. By keeping CLIP frozen and only updating the FAM parameters, our method is also computationally efficient. Unlike existing approaches, our FAA-CLIP method directly addresses the problem of domain shifts across clients via a domain adaptation (DA) module. This module employs a domain classifier to predict if a given sample is from the local client or the global server, allowing the model to learn domain-invariant representations. Extensive experiments on six different datasets containing both natural and medical images demonstrate that FAA-CLIP can generalize well on both natural and medical datasets compared to recent FL approaches. Our codes are available at https://github.com/AIPMLab/FAA-CLIP.
Chinese: FAA-CLIP通过轻量级特征适配模块降低联邦学习中的通信成本,并利用域适配模块处理数据异构性和领域差异,在自然与医学数据集上均实现了优异的泛化性能。
English: FAA-CLIP introduces a lightweight feature adaptation module to reduce communication costs in federated learning and employs a domain adaptation module to handle data heterogeneity and domain shifts, achieving superior generalization on both natural and medical datasets.
Authors:Mst. Fahmida Sultana Naznin, Adnan Ibney Faruq, Mostafa Rifat Tazwar, Md Jobayer, Md. Mehedi Hasan Shawon, Md Rakibul Hasan
Abstract:
A radiology report comprises several sections, including the Findings and Impression of the diagnosis. Automatically generating the Impression from the Findings is crucial for reducing radiologists' workload and improving diagnostic accuracy. Pretrained models that excel in common abstractive summarization problems encounter challenges when applied to specialized medical domains largely due to the complex terminology and the necessity for accurate clinical context. Such tasks in medical domains demand extracting core information, avoiding context shifts, and maintaining proper flow. Misuse of medical terms can lead to drastic clinical errors. To address these issues, we introduce a sequential transfer learning that ensures key content extraction and coherent summarization. Sequential transfer learning often faces challenges like initial parameter decay and knowledge loss, which we resolve with the Fisher matrix regularization. Using MIMIC-CXR and Open-I datasets, our model, CSTRL - Context-driven Sequential TRansfer Learning - achieved state-of-the-art performance, showing 56.2% improvement in BLEU-1, 40.5% in BLEU-2, 84.3% in BLEU-3, 28.9% in ROUGE-1, 41.0% in ROUGE-2 and 26.5% in ROGUE-3 score over benchmark studies. We also analyze factual consistency scores while preserving the medical context. Our code is publicly available at https://github.com/fahmidahossain/Report_Summarization.
中文: 本研究提出CSTRL模型,通过Fisher矩阵正则化的顺序迁移学习方法解决医学报告摘要中的专业术语和临床语境难题,在MIMIC-CXR和Open-I数据集上实现了最先进的性能,各项评估指标显著提升。
English: This study introduces CSTRL, a context-driven sequential transfer learning model that overcomes challenges in medical report summarization by using Fisher matrix regularization to prevent knowledge loss, achieving state-of-the-art performance on MIMIC-CXR and Open-I datasets with significant improvements in BLEU and ROUGE scores.
Authors:Zengqun Zhao, Ziquan Liu, Yu Cao, Shaogang Gong, Ioannis Patras
Abstract:
Recent advances in generative models have sparked research on improving model fairness with AI-generated data. However, existing methods often face limitations in the diversity and quality of synthetic data, leading to compromised fairness and overall model accuracy. Moreover, many approaches rely on the availability of demographic group labels, which are often costly to annotate. This paper proposes AIM-Fair, aiming to overcome these limitations and harness the potential of cutting-edge generative models in promoting algorithmic fairness. We investigate a fine-tuning paradigm starting from a biased model initially trained on real-world data without demographic annotations. This model is then fine-tuned using unbiased synthetic data generated by a state-of-the-art diffusion model to improve its fairness. Two key challenges are identified in this fine-tuning paradigm, 1) the low quality of synthetic data, which can still happen even with advanced generative models, and 2) the domain and bias gap between real and synthetic data. To address the limitation of synthetic data quality, we propose Contextual Synthetic Data Generation (CSDG) to generate data using a text-to-image diffusion model (T2I) with prompts generated by a context-aware LLM, ensuring both data diversity and control of bias in synthetic data. To resolve domain and bias shifts, we introduce a novel selective fine-tuning scheme in which only model parameters more sensitive to bias and less sensitive to domain shift are updated. Experiments on CelebA and UTKFace datasets show that our AIM-Fair improves model fairness while maintaining utility, outperforming both fully and partially fine-tuned approaches to model fairness.
中文: 本文提出AIM-Fair方法,通过使用上下文感知提示生成的高质量合成数据对偏置模型进行选择性参数微调,在保持模型效用的同时显著提升了算法公平性。
English: This paper introduces AIM-Fair, a method that enhances algorithmic fairness by fine-tuning biased models with high-quality synthetic data generated via context-aware prompts and a selective parameter update scheme, effectively improving fairness without compromising utility.
Authors:Prashant K. Jha
Abstract:
This focused review explores a range of neural operator architectures for approximating solutions to parametric partial differential equations (PDEs), emphasizing high-level concepts and practical implementation strategies. The study covers foundational models such as Deep Operator Networks (DeepONet), Principal Component Analysis-based Neural Networks (PCANet), and Fourier Neural Operators (FNO), providing comparative insights into their core methodologies and performance. These architectures are demonstrated on two classical linear parametric PDEs: the Poisson equation and linear elastic deformation. Beyond forward problem-solving, the review delves into applying neural operators as surrogates in Bayesian inference problems, showcasing their effectiveness in accelerating posterior inference while maintaining accuracy. The paper concludes by discussing current challenges, particularly in controlling prediction accuracy and generalization. It outlines emerging strategies to address these issues, such as residual-based error correction and multi-level training. This review can be seen as a comprehensive guide to implementing neural operators and integrating them into scientific computing workflows.
中文: 本综述探讨了用于求解参数偏微分方程的神经算子架构,比较了DeepONet和FNO等方法在经典方程及贝叶斯推断中的应用,并针对精度和泛化挑战提出了残差修正等多层次解决策略。
English: This review examines neural operator architectures for solving parametric PDEs, comparing methods like DeepONet and FNO on classical equations and their use in Bayesian inference, while addressing challenges in accuracy and generalization with emerging strategies.
Authors:Shiping Yang, Jie Wu, Wenbiao Ding, Ning Wu, Shining Liang, Ming Gong, Hengyuan Zhang, Dongmei Zhang
Abstract:
Robustness has become a critical attribute for the deployment of RAG systems in real-world applications. Existing research focuses on robustness to explicit noise (e.g., document semantics) but overlooks spurious features (a.k.a. implicit noise). While previous works have explored spurious features in LLMs, they are limited to specific features (e.g., formats) and narrow scenarios (e.g., ICL). In this work, we statistically confirm the presence of spurious features in the RAG paradigm, a robustness problem caused by the sensitivity of LLMs to semantic-agnostic features. Moreover, we provide a comprehensive taxonomy of spurious features and empirically quantify their impact through controlled experiments. Further analysis reveals that not all spurious features are harmful and they can even be beneficial sometimes. Extensive evaluation results across multiple LLMs suggest that spurious features are a widespread and challenging problem in the field of RAG. The code and dataset will be released to facilitate future research. We release all codes and data at: $\\\href{https://github.com/maybenotime/RAG-SpuriousFeatures}{https://github.com/maybenotime/RAG-SpuriousFeatures}$.
中文: 本研究通过系统分类和实证评估,揭示了RAG系统中普遍存在的虚假特征问题,并发现这些特征具有既可能有害也可能有益的双重性质。
English: This study identifies spurious features as a widespread robustness issue in RAG systems, revealing their dual nature of being both harmful and beneficial through comprehensive taxonomy and empirical evaluation.
Authors:Raphael Trumpp, Ansgar Schäfftlein, Mirco Theile, Marco Caccamo
Abstract:
As image-based deep reinforcement learning tackles more challenging tasks, increasing model size has become an important factor in improving performance. Recent studies achieved this by focusing on the parameter efficiency of scaled networks, typically using Impala-CNN, a 15-layer ResNet-inspired network, as the image encoder. However, while Impala-CNN evidently outperforms older CNN architectures, potential advancements in network design for deep reinforcement learning-specific image encoders remain largely unexplored. We find that replacing the flattening of output feature maps in Impala-CNN with global average pooling leads to a notable performance improvement. This approach outperforms larger and more complex models in the Procgen Benchmark, particularly in terms of generalization. We call our proposed encoder model Impoola-CNN. A decrease in the network's translation sensitivity may be central to this improvement, as we observe the most significant gains in games without agent-centered observations. Our results demonstrate that network scaling is not just about increasing model size - efficient network design is also an essential factor. We make our code available at https://github.com/raphajaner/impoola.
中文摘要:该研究提出Impoola-CNN这一改进的深度强化学习图像编码器,通过在Impala-CNN中用全局平均池化替代展平操作,以更高效的网络设计而非单纯扩大模型规模,在Procgen基准测试中实现了更优的性能和泛化能力。
English Summary: The study introduces Impoola-CNN, an improved image encoder for deep reinforcement learning that replaces flattening with global average pooling in Impala-CNN, achieving superior performance and generalization on the Procgen Benchmark through more efficient network design rather than simply increasing model size.
Authors:Weigao Sun, Disen Lan, Tong Zhu, Xiaoye Qu, Yu Cheng
Abstract:
Linear Sequence Modeling (LSM) like linear attention, state space models and linear RNNs, and Mixture-of-Experts (MoE) have recently emerged as significant architectural improvements. In this paper, we introduce Linear-MoE, a production-level system for modeling and training large-scale models that integrate LSM with MoE. Linear-MoE leverages the advantages of both LSM modules for linear-complexity sequence modeling and MoE layers for sparsely activation, aiming to offer high performance with efficient training. The Linear-MoE system comprises: 1) Modeling subsystem, which provides a unified framework supporting all instances of LSM. and 2) Training subsystem, which facilitates efficient training by incorporating various advanced parallelism technologies, particularly Sequence Parallelism designed for Linear-MoE models. Additionally, we explore hybrid models that combine Linear-MoE layers with standard Transformer-MoE layers with its Sequence Parallelism to further enhance model flexibility and performance. Evaluations on two model series, A0.3B-2B and A1B-7B, demonstrate Linear-MoE achieves efficiency gains while maintaining competitive performance on various benchmarks, showcasing its potential as a next-generation foundational model architecture. Code: https://github.com/OpenSparseLLMs/Linear-MoE.
中文: 本文提出Linear-MoE系统,通过整合线性序列建模的线性复杂度优势与混合专家的稀疏激活特性,在多种基准测试中实现了高效训练与优异性能。
English: This paper introduces Linear-MoE, a production-level system that integrates Linear Sequence Modeling for linear-complexity sequence processing and Mixture-of-Experts for sparse activation, achieving efficient training and competitive performance across various benchmarks.
Authors:Run He, Di Fang, Yicheng Xu, Yawen Cui, Ming Li, Cen Chen, Ziqian Zeng, Huiping Zhuang
Abstract:
Exemplar-Free Class-Incremental Learning (EFCIL) aims to sequentially learn from distinct categories without retaining exemplars but easily suffers from catastrophic forgetting of learned knowledge. While existing EFCIL methods leverage knowledge distillation to alleviate forgetting, they still face two critical challenges: semantic shift and decision bias. Specifically, the embeddings of old tasks shift in the embedding space after learning new tasks, and the classifier becomes biased towards new tasks due to training solely with new data, hindering the balance between old and new knowledge. To address these issues, we propose the Dual-Projection Shift Estimation and Classifier Reconstruction (DPCR) approach for EFCIL. DPCR effectively estimates semantic shift through a dual-projection, which combines a learnable transformation with a row-space projection to capture both task-wise and category-wise shifts. Furthermore, to mitigate decision bias, DPCR employs ridge regression to reformulate a classifier reconstruction process. This reconstruction exploits previous in covariance and prototype of each class after calibration with estimated shift, thereby reducing decision bias. Extensive experiments demonstrate that, on various datasets, DPCR effectively balances old and new tasks, outperforming state-of-the-art EFCIL methods. Our codes are available at https://github.com/RHe502/ICML25-DPCR.
中文: 提出的双投影偏移估计与分类器重构(DPCR)方法通过双投影估计语义偏移并利用岭回归重构分类器,有效解决了无范例类增量学习中的语义偏移和决策偏差问题,在多个数据集上实现了最优性能。
English: The proposed Dual-Projection Shift Estimation and Classifier Reconstruction (DPCR) approach effectively addresses semantic shift and decision bias in exemplar-free class-incremental learning by estimating shifts through dual projections and reconstructing classifiers using ridge regression, achieving superior performance across multiple datasets.
Authors:Zitao Fang, Guodong DU, Shuyang Yu, Yifei Guo, Yiwei Zhang, Yiyao Cao, Jing Li, Ho-Kin Tang, Sim Kuan Goh
Abstract:
Fine-tuning pre-trained models on targeted datasets enhances task-specific performance but often comes at the expense of generalization. Model merging techniques, which integrate multiple fine-tuned models into a single multi-task model through task arithmetic, offer a promising solution. However, task interference remains a fundamental challenge, leading to performance degradation and suboptimal merged models. Existing approaches largely overlooked the fundamental roles of neurons, their connectivity, and activation, resulting in a merging process and a merged model that does not consider how neurons relay and process information. In this work, we present the first study that relies on neuronal mechanisms for model merging. Specifically, we decomposed task-specific representations into two complementary neuronal subspaces that regulate input sensitivity and task adaptability. Leveraging this decomposition, we introduced NeuroMerging, a novel merging framework developed to mitigate task interference within neuronal subspaces, enabling training-free model fusion across diverse tasks. Through extensive experiments, we demonstrated that NeuroMerging achieved superior performance compared to existing methods on multi-task benchmarks across both natural language and vision domains. Our findings highlighted the importance of aligning neuronal mechanisms in model merging, offering new insights into mitigating task interference and improving knowledge fusion. Our project is available at https://ZzzitaoFang.github.io/projects/NeuroMerging/.
Authors:Chengqi Zheng, Haiyan Yin, Jianda Chen, Terence Ng, Yew-Soon Ong, Ivor Tsang
Abstract:
Continual Reinforcement Learning (CRL) is essential for developing agents that can learn, adapt, and accumulate knowledge over time. However, a fundamental challenge persists as agents must strike a delicate balance between plasticity, which enables rapid skill acquisition, and stability, which ensures long-term knowledge retention while preventing catastrophic forgetting. In this paper, we introduce SSDE, a novel structure-based approach that enhances plasticity through a fine-grained allocation strategy with Structured Sparsity and Dormant-guided Exploration. SSDE decomposes the parameter space into forward-transfer (frozen) parameters and task-specific (trainable) parameters. Crucially, these parameters are allocated by an efficient co-allocation scheme under sparse coding, ensuring sufficient trainable capacity for new tasks while promoting efficient forward transfer through frozen parameters. However, structure-based methods often suffer from rigidity due to the accumulation of non-trainable parameters, limiting exploration and adaptability. To address this, we further introduce a sensitivity-guided neuron reactivation mechanism that systematically identifies and resets dormant neurons, which exhibit minimal influence in the sparse policy network during inference. This approach effectively enhance exploration while preserving structural efficiency. Extensive experiments on the CW10-v1 Continual World benchmark demonstrate that SSDE achieves state-of-the-art performance, reaching a success rate of 95%, surpassing prior methods significantly in both plasticity and stability trade-offs (code is available at: https://github.com/chengqiArchy/SSDE).
中文: SSDE是一种新颖的持续强化学习方法,通过结构化稀疏性和休眠神经元激活机制在保持稳定性的同时增强可塑性,在基准测试中达到了95%的最优成功率。
English: SSDE is a novel continual reinforcement learning method that enhances plasticity through structured sparsity and dormant neuron reactivation while maintaining stability, achieving state-of-the-art 95% success rate on benchmarks.
Authors:Baris Yilmaz, Erdem Akagündüz, Salih Tileylioglu
Abstract:
This study explores the use of deep learning for predicting the time averaged shear wave velocity in the top 30 m of the subsurface ($V_{s30}$) at strong motion recording stations in Türkiye. $V_{s30}$ is a key parameter in site characterization and, as a result for seismic hazard assessment. However, it is often unavailable due to the lack of direct measurements and is therefore estimated using empirical correlations. Such correlations however are commonly inadequate in capturing complex, site-specific variability and this motivates the need for data-driven approaches. In this study, we employ a hybrid deep learning model combining convolutional neural networks (CNNs) and long short-term memory (LSTM) networks to capture both spatial and temporal dependencies in strong motion records. Furthermore, we explore how using different parts of the signal influence our deep learning model. Our results suggest that the hybrid approach effectively learns complex, nonlinear relationships within seismic signals. We observed that an improved P-wave arrival time model increased the prediction accuracy of $V_{s30}$. We believe the study provides valuable insights into improving $V_{s30}$ predictions using a CNN-LSTM framework, demonstrating its potential for improving site characterization for seismic studies. Our codes are available via this repo: https://github.com/brsylmz23/CNNLSTM_DeepEQ
本研究开发了一种结合卷积神经网络和长短期记忆网络的混合深度学习模型,通过改进P波分析并捕捉复杂信号关系,有效预测了土耳其地震记录中的地下剪切波速度(Vs30)。
This study develops a hybrid deep learning model combining CNNs and LSTMs to predict subsurface shear wave velocity (Vs30) from seismic records in Türkiye, demonstrating improved accuracy through enhanced P-wave analysis and capturing complex signal relationships.
Authors:Yifan Liu, Yu Fang, Zhouhan Lin
Abstract:
Video-to-speech (V2S) synthesis, the task of generating speech directly from silent video input, is inherently more challenging than other speech synthesis tasks due to the need to accurately reconstruct both speech content and speaker characteristics from visual cues alone. Recently, audio-visual pre-training has eliminated the need for additional acoustic hints in V2S, which previous methods often relied on to ensure training convergence. However, even with pre-training, existing methods continue to face challenges in achieving a balance between acoustic intelligibility and the preservation of speaker-specific characteristics. We analyzed this limitation and were motivated to introduce DiVISe (Direct Visual-Input Speech Synthesis), an end-to-end V2S model that predicts Mel-spectrograms directly from video frames alone. Despite not taking any acoustic hints, DiVISe effectively preserves speaker characteristics in the generated audio, and achieves superior performance on both objective and subjective metrics across the LRS2 and LRS3 datasets. Our results demonstrate that DiVISe not only outperforms existing V2S models in acoustic intelligibility but also scales more effectively with increased data and model parameters. Code and weights can be found at https://github.com/PussyCat0700/DiVISe.
中文摘要:DiVISe是一种端到端的视频语音合成模型,仅从无声视频直接生成梅尔频谱图,无需声学提示即可在保持说话人特征的同时实现卓越的语音可懂度。
English Summary: DiVISe is an end-to-end video-to-speech model that directly generates Mel-spectrograms from silent video, achieving superior acoustic intelligibility and speaker characteristic preservation without requiring acoustic hints.
Authors:Yunkai Gao, Jiaming Guo, Fan Wu, Rui Zhang
Abstract:
Offline reinforcement learning (RL) aims to optimize a policy by using pre-collected datasets, to maximize cumulative rewards. However, offline reinforcement learning suffers challenges due to the distributional shift between the learned and behavior policies, leading to errors when computing Q-values for out-of-distribution (OOD) actions. To mitigate this issue, policy constraint methods aim to constrain the learned policy's distribution with the distribution of the behavior policy or confine action selection within the support of the behavior policy. However, current policy constraint methods tend to exhibit excessive conservatism, hindering the policy from further surpassing the behavior policy's performance. In this work, we present Only Support Constraint (OSC) which is derived from maximizing the total probability of learned policy in the support of behavior policy, to address the conservatism of policy constraint. OSC presents a regularization term that only restricts policies to the support without imposing extra constraints on actions within the support. Additionally, to fully harness the performance of the new policy constraints, OSC utilizes a diffusion model to effectively characterize the support of behavior policies. Experimental evaluations across a variety of offline RL benchmarks demonstrate that OSC significantly enhances performance, alleviating the challenges associated with distributional shifts and mitigating conservatism of policy constraints. Code is available at https://github.com/MoreanP/OSC.
中文摘要:离线强化学习面临分布偏移和策略约束过度保守的问题,本文提出的仅支持约束(OSC)方法通过将策略限制在行为策略支持集内并利用扩散模型,有效缓解了这些问题,在多个基准测试中显著提升了性能。
English Summary: Offline reinforcement learning faces challenges from distributional shifts and excessive conservatism in policy constraints, which the proposed Only Support Constraint (OSC) method addresses by restricting policies to behavior policy support using a diffusion model, significantly improving performance across benchmarks.
Authors:Linqi Ye, Rankun Li, Xiaowen Hu, Jiayi Li, Boyang Xing, Yan Peng, Bin Liang
Abstract:
This paper introduces Unity RL Playground, an open-source reinforcement learning framework built on top of Unity ML-Agents. Unity RL Playground automates the process of training mobile robots to perform various locomotion tasks such as walking, running, and jumping in simulation, with the potential for seamless transfer to real hardware. Key features include one-click training for imported robot models, universal compatibility with diverse robot configurations, multi-mode motion learning capabilities, and extreme performance testing to aid in robot design optimization and morphological evolution. The attached video can be found at https://linqi-ye.github.io/video/iros25.mp4 and the code is coming soon.
Authors:Wenhao Wang, Zijie Yu, Rui Ye, Jianqing Zhang, Siheng Chen, Yanfeng Wang
Abstract:
Mobile agents have attracted tremendous research participation recently. Traditional approaches to mobile agent training rely on centralized data collection, leading to high cost and limited scalability. Distributed training utilizing federated learning offers an alternative by harnessing real-world user data, providing scalability and reducing costs. However, pivotal challenges, including the absence of standardized benchmarks, hinder progress in this field.
To tackle the challenges, we introduce FedMABench, the first benchmark for federated training and evaluation of mobile agents, specifically designed for heterogeneous scenarios. FedMABench features 6 datasets with 30+ subsets, 8 federated algorithms, 10+ base models, and over 800 apps across 5 categories, providing a comprehensive framework for evaluating mobile agents across diverse environments. Through extensive experiments, we uncover several key insights: federated algorithms consistently outperform local training; the distribution of specific apps plays a crucial role in heterogeneity; and, even apps from distinct categories can exhibit correlations during training. FedMABench is publicly available at: https://github.com/wwh0411/FedMABench with the datasets at: https://huggingface.co/datasets/wwh0411/FedMABench.
中文摘要:为解决移动智能体联邦学习领域缺乏标准化基准的问题,FedMABench作为首个综合性评估框架被提出,其通过多组实验证实联邦算法优于本地训练,并揭示了应用分布对异构环境的关键影响及跨类别应用间的训练关联性。
English Summary: To address the lack of standardized benchmarks in federated learning for mobile agents, FedMABench is introduced as the first comprehensive evaluation framework, featuring diverse datasets and algorithms that demonstrate federated training's superiority over local methods while revealing key insights about app distribution and inter-category correlations.
Authors:Hengguang Zhou, Xirui Li, Ruochen Wang, Minhao Cheng, Tianyi Zhou, Cho-Jui Hsieh
Abstract:
Recently DeepSeek R1 demonstrated how reinforcement learning with simple rule-based incentives can enable autonomous development of complex reasoning in large language models, characterized by the "aha moment", in which the model manifest self-reflection and increased response length during training. However, attempts to extend this success to multimodal reasoning often failed to reproduce these key characteristics. In this report, we present the first successful replication of these emergent characteristics for multimodal reasoning on only a non-SFT 2B model. Starting with Qwen2-VL-2B and applying reinforcement learning directly on the SAT dataset, our model achieves 59.47% accuracy on CVBench, outperforming the base model by approximately ~30% and exceeding both SFT setting by ~2%. In addition, we share our failed attempts and insights in attempting to achieve R1-like reasoning using RL with instruct models. aiming to shed light on the challenges involved. Our key observations include: (1) applying RL on instruct model often results in trivial reasoning trajectories, and (2) naive length reward are ineffective in eliciting reasoning capabilities. The project code is available at https://github.com/turningpoint-ai/VisualThinker-R1-Zero
中文摘要:DeepSeek R1首次在20亿参数多模态模型上成功复现了复杂推理能力的涌现,通过SAT数据的强化学习实现了准确率大幅提升,同时揭示了指令模型和简单奖励机制在激发推理能力方面的局限性。
English Summary: DeepSeek R1 successfully replicated complex reasoning emergence in a 2B multimodal model using reinforcement learning on SAT data, achieving significant accuracy gains while revealing challenges with instruct models and naive rewards.
Authors:Shibo Feng, Wanjin Feng, Xingyu Gao, Peilin Zhao, Zhiqi Shen
Abstract:
Spiking Neural Networks (SNNs) offer a promising, biologically inspired approach for processing spatiotemporal data, particularly for time series forecasting. However, conventional neuron models like the Leaky Integrate-and-Fire (LIF) struggle to capture long-term dependencies and effectively process multi-scale temporal dynamics. To overcome these limitations, we introduce the Temporal Segment Leaky Integrate-and-Fire (TS-LIF) model, featuring a novel dual-compartment architecture. The dendritic and somatic compartments specialize in capturing distinct frequency components, providing functional heterogeneity that enhances the neuron's ability to process both low- and high-frequency information. Furthermore, the newly introduced direct somatic current injection reduces information loss during intra-neuronal transmission, while dendritic spike generation improves multi-scale information extraction. We provide a theoretical stability analysis of the TS-LIF model and explain how each compartment contributes to distinct frequency response characteristics. Experimental results show that TS-LIF outperforms traditional SNNs in time series forecasting, demonstrating better accuracy and robustness, even with missing data. TS-LIF advances the application of SNNs in time-series forecasting, providing a biologically inspired approach that captures complex temporal dynamics and offers potential for practical implementation in diverse forecasting scenarios. The source code is available at https://github.com/kkking-kk/TS-LIF.
Chinese: TS-LIF模型采用双区室架构,通过树突和胞体分别处理不同频率信息,显著提升了脉冲神经网络处理多尺度时间动态的能力,在时间序列预测中展现出优于传统模型的精度和鲁棒性。
English: The TS-LIF model introduces a dual-compartment architecture that enhances spiking neural networks' ability to process multi-scale temporal dynamics, demonstrating superior accuracy and robustness in time series forecasting compared to traditional models.
Authors:Yordan P. Raykov, Hengrui Luo, Justin D. Strait, Wasiur R. KhudaBukhsh
Abstract:
We propose causal effect estimators based on empirical Fréchet means and operator-valued kernels, tailored to functional data spaces. These methods address the challenges of high-dimensionality, sequential ordering, and model complexity while preserving robustness to treatment misspecification. Using structural assumptions, we obtain compact representations of potential outcomes, enabling scalable estimation of causal effects over time and across covariates. We provide both theoretical, regarding the consistency of functional causal effects, as well as empirical comparison of a range of proposed causal effect estimators.
Applications to binary treatment settings with functional outcomes illustrate the framework's utility in biomedical monitoring, where outcomes exhibit complex temporal dynamics. Our estimators accommodate scenarios with registered covariates and outcomes, aligning them to the Fréchet means, as well as cases requiring higher-order representations to capture intricate covariate-outcome interactions. These advancements extend causal inference to dynamic and non-linear domains, offering new tools for understanding complex treatment effects in functional data settings.
中文: 本研究提出基于经验Fréchet均值和算子值核的因果效应估计方法,针对函数型数据的高维性和时序动态特性,在生物医学监测中实现了鲁棒且可扩展的因果推断。
English: This study introduces causal effect estimators using empirical Fréchet means and operator-valued kernels to handle functional data's high dimensionality and temporal dynamics, ensuring robustness and scalability in biomedical applications.
Authors:Zhenghao Peng, Zhizheng Liu, Bolei Zhou
Abstract:
Mobile robots are essential in applications such as autonomous delivery and hospitality services. Applying learning-based methods to address mobile robot tasks has gained popularity due to its robustness and generalizability. Traditional methods such as Imitation Learning (IL) and Reinforcement Learning (RL) offer adaptability but require large datasets, carefully crafted reward functions, and face sim-to-real gaps, making them challenging for efficient and safe real-world deployment. We propose an online human-in-the-loop learning method PVP4Real that combines IL and RL to address these issues. PVP4Real enables efficient real-time policy learning from online human intervention and demonstration, without reward or any pretraining, significantly improving data efficiency and training safety. We validate our method by training two different robots -- a legged quadruped, and a wheeled delivery robot -- in two mobile robot tasks, one of which even uses raw RGBD image as observation. The training finishes within 15 minutes. Our experiments show the promising future of human-in-the-loop learning in addressing the data efficiency issue in real-world robotic tasks. More information is available at: https://metadriverse.github.io/pvp4real/
Authors:Albert Wilcox, Mohamed Ghanem, Masoud Moghani, Pierre Barroso, Benjamin Joffe, Animesh Garg
Abstract:
Imitation Learning can train robots to perform complex and diverse manipulation tasks, but learned policies are brittle with observations outside of the training distribution. 3D scene representations that incorporate observations from calibrated RGBD cameras have been proposed as a way to mitigate this, but in our evaluations with unseen embodiments and camera viewpoints they show only modest improvement. To address those challenges, we propose Adapt3R, a general-purpose 3D observation encoder which synthesizes data from calibrated RGBD cameras into a vector that can be used as conditioning for arbitrary IL algorithms. The key idea is to use a pretrained 2D backbone to extract semantic information, using 3D only as a medium to localize this information with respect to the end-effector. We show across 93 simulated and 6 real tasks that when trained end-to-end with a variety of IL algorithms, Adapt3R maintains these algorithms' learning capacity while enabling zero-shot transfer to novel embodiments and camera poses.
Authors:Donghyeok Shin, HeeSun Bae, Gyuwon Sim, Wanmo Kang, Il-Chul Moon
Abstract:
Utilizing a large-scale dataset is essential for training high-performance deep learning models, but it also comes with substantial computation and storage costs. To overcome these challenges, dataset distillation has emerged as a promising solution by compressing the large-scale dataset into a smaller synthetic dataset that retains the essential information needed for training. This paper proposes a novel parameterization framework for dataset distillation, coined Distilling Dataset into Neural Field (DDiF), which leverages the neural field to store the necessary information of the large-scale dataset. Due to the unique nature of the neural field, which takes coordinates as input and output quantity, DDiF effectively preserves the information and easily generates various shapes of data. We theoretically confirm that DDiF exhibits greater expressiveness than some previous literature when the utilized budget for a single synthetic instance is the same. Through extensive experiments, we demonstrate that DDiF achieves superior performance on several benchmark datasets, extending beyond the image domain to include video, audio, and 3D voxel. We release the code at https://github.com/aailab-kaist/DDiF.
中文: 本文提出DDiF这一新型数据集蒸馏框架,通过神经场将大规模数据集高效压缩为紧凑的合成数据并保留关键信息,在图像、视频、音频和3D体素等多个领域展现出卓越性能。
English: This paper introduces DDiF, a novel dataset distillation framework that uses neural fields to efficiently compress large datasets into compact synthetic forms while preserving essential information, demonstrating superior performance across multiple domains including images, video, audio, and 3D data.
Authors:Stephen Chung, Wenyu Du, Jie Fu
Abstract:
Recent advancements in reinforcement learning (RL) for large language models (LLMs), exemplified by DeepSeek R1, have shown that even a simple question-answering task can substantially improve an LLM's reasoning capabilities. In this work, we extend this approach by modifying the task into a multi-attempt setting. Instead of generating a single response per question, the model is given multiple attempts, with feedback provided after incorrect responses. The multi-attempt task encourages the model to refine its previous attempts and improve search efficiency. Experimental results show that even a small LLM trained on a multi-attempt task achieves significantly higher accuracy when evaluated with more attempts, improving from 45.6% with 1 attempt to 52.5% with 2 attempts on the math benchmark. In contrast, the same LLM trained on a standard single-turn task exhibits only a marginal improvement, increasing from 42.3% to 43.2% when given more attempts during evaluation. The results indicate that, compared to the standard single-turn task, an LLM trained on a multi-attempt task achieves slightly better performance on math benchmarks while also learning to refine its responses more effectively based on user feedback. Full code is available at https://github.com/DualityRL/multi-attempt
中文:最新研究表明,通过在多轮尝试任务中结合反馈来训练大语言模型,可显著提升其推理准确性和答案优化能力,效果优于单轮训练方法。
English: Recent research demonstrates that training large language models on multi-attempt tasks with feedback significantly enhances their reasoning accuracy and ability to refine responses, outperforming single-turn training methods.
Authors:Jules Viennot, Guillaume Baudart, Emilio Jesùs Gallego Arias, Marc Lelarge
Abstract:
In this work, we conduct an experiment using state-of-the-art LLMs to translate MiniF2F into Rocq. The translation task focuses on generating a Rocq theorem based on three sources: a natural language description, the Lean formalization, and the Isabelle formalization. We conducted our experiment in 3 stages of increasing complexity, from basic one-shot prompting to multi-turn conversations that incorporate feedback from unsuccessful attempts. At each stage, we perform multiple rounds of translation using increasingly advanced models: GPT-4o mini, Claude 3.5 Sonnet, o1 mini, and o1. We successfully translated 478 out of 488 theorems. The dataset is opensource: https://github.com/LLM4Rocq/miniF2F-rocq.
中文: 本研究通过逐步复杂的提示阶段,利用先进的大语言模型成功将488个定理中的478个从MiniF2F翻译为Rocq,并公开了数据集。
English: This study successfully translated 478 out of 488 theorems from MiniF2F to Rocq using advanced LLMs through progressively complex prompting stages, with the dataset made publicly available.
Authors:Hritik Bansal, Pratyush Maini
Abstract:
The rapid advancement in building large language models (LLMs) has intensified competition among big-tech companies and AI startups. In this regard, model evaluations are critical for product and investment-related decision-making. While open evaluation sets like MMLU initially drove progress, concerns around data contamination and data bias have constantly questioned their reliability. As a result, it has led to the rise of private data curators who have begun conducting hidden evaluations with high-quality self-curated test prompts and their own expert annotators. In this paper, we argue that despite potential advantages in addressing contamination issues, private evaluations introduce inadvertent financial and evaluation risks. In particular, the key concerns include the potential conflict of interest arising from private data curators' business relationships with their clients (leading LLM firms). In addition, we highlight that the subjective preferences of private expert annotators will lead to inherent evaluation bias towards the models trained with the private curators' data. Overall, this paper lays the foundation for studying the risks of private evaluations that can lead to wide-ranging community discussions and policy changes.
Authors:Houyi Li, Wenzhen Zheng, Qiufeng Wang, Hanshan Zhang, Zili Wang, Shijie Xuyang, Yuantao Fan, Zhenyu Ding, Haoying Wang, Ning Ding, Shuigeng Zhou, Xiangyu Zhang, Daxin Jiang
Abstract:
The impressive capabilities of Large Language Models (LLMs) across diverse tasks are now well established, yet their effective deployment necessitates careful hyperparameter optimization. Although existing methods have explored the influence of hyperparameters on model performance, a principled and generalizable framework across model architectures and data recipes remains absent. In this study, we conduct an unprecedented empirical investigation training over 3,700 LLMs from scratch across 100 trillion tokens, consuming nearly one million NVIDIA H800 GPU hours to establish a universal Scaling Law for hyperparameter optimization in LLM Pre-training, called Step Law. We empirically observe that, under fixed model size ($N$) and dataset size ($D$), the hyperparameter landscape exhibits convexity with a broad optimum, substantially reducing the complexity of hyperparameter search. Building on this insight, we formally define and empirically validate the Step Law: The optimal learning rate follows a power-law relationship with $N$ and $D$, while the optimal batch size is primarily influenced by $D$ and remains largely invariant to $N$.Notably, our estimated optima deviate from the global best performance found via exhaustive search by merely 0.094\% on the test set. To our best known, Step Law is the first that unifies different model shapes and structures, such as Mixture-of-Experts models and dense transformers, as well as establishes optimal hyperparameter scaling laws across diverse data recipes. We contribute a universal, plug-and-play optimal hyperparameter tool for the community, which is expected to advance efficient LLM training at scale. All experimental code, data and checkpoints are publicly available at https://github.com/step-law/steplaw
中文摘要:本研究提出了Step Law这一适用于大语言模型预训练的超参数优化通用缩放定律,通过揭示超参数空间的凸性特征及学习率与模型/数据规模间的幂律关系,大幅简化了超参数搜索复杂度,其估计最优值与全局最优解的测试集差异仅为0.094%。
English Summary: This study introduces Step Law, a universal scaling law for hyperparameter optimization in LLM pre-training, which simplifies hyperparameter search by revealing convex landscapes and power-law relationships between optimal learning rates and model/data sizes, achieving near-optimal performance with minimal deviation.
Authors:Anuj Diwan, Zhisheng Zheng, David Harwath, Eunsol Choi
Abstract:
We introduce Paralinguistic Speech Captions (ParaSpeechCaps), a large-scale dataset that annotates speech utterances with rich style captions. While rich abstract tags (e.g. guttural, nasal, pained) have been explored in small-scale human-annotated datasets, existing large-scale datasets only cover basic tags (e.g. low-pitched, slow, loud). We combine off-the-shelf text and speech embedders, classifiers and an audio language model to automatically scale rich tag annotations for the first time. ParaSpeechCaps covers a total of 59 style tags, including both speaker-level intrinsic tags and utterance-level situational tags. It consists of 342 hours of human-labelled data (PSC-Base) and 2427 hours of automatically annotated data (PSC-Scaled). We finetune Parler-TTS, an open-source style-prompted TTS model, on ParaSpeechCaps, and achieve improved style consistency (+7.9% Consistency MOS) and speech quality (+15.5% Naturalness MOS) over the best performing baseline that combines existing rich style tag datasets. We ablate several of our dataset design choices to lay the foundation for future work in this space. Our dataset, models and code are released at https://github.com/ajd12342/paraspeechcaps .
中文: 我们推出ParaSpeechCaps大规模语音风格标注数据集,结合人工标注与自动扩展数据,显著提升了语音合成模型的风格一致性和自然度表现。
English: We introduce ParaSpeechCaps, a large-scale dataset with rich style annotations for speech, combining human-labeled and automatically scaled data to significantly improve text-to-speech model performance in style consistency and naturalness.
Authors:Dou Hu, Lingwei Wei, Wei Zhou, Songlin Hu
Abstract:
This paper proposes a new principled multi-task representation learning framework (InfoMTL) to extract noise-invariant sufficient representations for all tasks. It ensures sufficiency of shared representations for all tasks and mitigates the negative effect of redundant features, which can enhance language understanding of pre-trained language models (PLMs) under the multi-task paradigm. Firstly, a shared information maximization principle is proposed to learn more sufficient shared representations for all target tasks. It can avoid the insufficiency issue arising from representation compression in the multi-task paradigm. Secondly, a task-specific information minimization principle is designed to mitigate the negative effect of potential redundant features in the input for each task. It can compress task-irrelevant redundant information and preserve necessary information relevant to the target for multi-task prediction. Experiments on six classification benchmarks show that our method outperforms 12 comparative multi-task methods under the same multi-task settings, especially in data-constrained and noisy scenarios. Extensive experiments demonstrate that the learned representations are more sufficient, data-efficient, and robust.
中文摘要:本文提出InfoMTL多任务学习框架,通过共享信息最大化和任务特定信息最小化原则,提取噪声不变的充分表征,增强预训练语言模型在多任务场景下的理解能力,在多个基准测试中优于现有方法。
English Summary: The paper introduces InfoMTL, a multi-task learning framework that enhances language models by extracting noise-invariant sufficient representations through shared information maximization and task-specific information minimization, outperforming existing methods in various scenarios.
Authors:Shengzhuang Chen, Yikai Liao, Xiaoxiao Sun, Kede Ma, Ying Wei
Abstract:
The advent of the foundation model era has sparked significant research interest in leveraging pre-trained representations for continual learning (CL), yielding a series of top-performing CL methods on standard evaluation benchmarks. Nonetheless, there are growing concerns regarding potential data contamination during the pre-training stage. Furthermore, standard evaluation benchmarks, which are typically static, fail to capture the complexities of real-world CL scenarios, resulting in saturated performance. To address these issues, we describe CL on dynamic benchmarks (CLDyB), a general computational framework based on Markov decision processes for evaluating CL methods reliably. CLDyB dynamically identifies inherently difficult and algorithm-dependent tasks for the given CL methods, and determines challenging task orders using Monte Carlo tree search. Leveraging CLDyB, we first conduct a joint evaluation of multiple state-of-the-art CL methods, leading to a set of commonly challenging and generalizable task sequences where existing CL methods tend to perform poorly. We then conduct separate evaluations of individual CL methods using CLDyB, discovering their respective strengths and weaknesses. The source code and generated task sequences are publicly accessible at https://github.com/szc12153/CLDyB.
中文: 本研究提出CLDyB动态基准框架,通过马尔可夫决策过程和蒙特卡洛树搜索动态识别困难任务序列,可靠评估持续学习方法,揭示其局限性和优势。
English: The study introduces CLDyB, a dynamic benchmark framework using Markov decision processes and Monte Carlo tree search to reliably evaluate continual learning methods by identifying challenging task sequences, revealing their limitations and strengths.
Authors:Jiang Li, Xiaoping Wang
Abstract:
Protein-protein interaction (PPI) prediction is an instrumental means in elucidating the mechanisms underlying cellular operations, holding significant practical implications for the realms of pharmaceutical development and clinical treatment. Presently, the majority of research methods primarily concentrate on the analysis of amino acid sequences, while investigations predicated on protein structures remain in the nascent stages of exploration. Despite the emergence of several structure-based algorithms in recent years, these are still confronted with inherent challenges: (1) the extraction of intrinsic structural information of proteins typically necessitates the expenditure of substantial computational resources; (2) these models are overly reliant on seen protein data, struggling to effectively unearth interaction cues between unknown proteins. To further propel advancements in this domain, this paper introduces a novel PPI prediction method jointing masked reconstruction and contrastive learning, termed JmcPPI. This methodology dissects the PPI prediction task into two distinct phases: during the residue structure encoding phase, JmcPPI devises two feature reconstruction tasks and employs graph attention mechanism to capture structural information between residues; during the protein interaction inference phase, JmcPPI perturbs the original PPI graph and employs a multi-graph contrastive learning strategy to thoroughly mine extrinsic interaction information of novel proteins. Extensive experiments conducted on three widely utilized PPI datasets demonstrate that JmcPPI surpasses existing optimal baseline models across various data partition schemes. The associated code can be accessed via https://github.com/lijfrank-open/JmcPPI.
中文: 本文提出了一种结合掩码重建与对比学习的蛋白质相互作用预测方法JmcPPI,通过分阶段处理结构编码和相互作用推断,在多个数据集上展现出优于现有模型的性能表现。
English: This paper introduces JmcPPI, a novel protein-protein interaction prediction method combining masked reconstruction and contrastive learning, which effectively captures structural features and interaction patterns while demonstrating superior performance over existing models across multiple datasets.
Authors:Yuqi Hu, Longguang Wang, Xian Liu, Ling-Hao Chen, Yuwei Guo, Yukai Shi, Ce Liu, Anyi Rao, Zeyu Wang, Hui Xiong
Abstract:
Understanding and replicating the real world is a critical challenge in Artificial General Intelligence (AGI) research. To achieve this, many existing approaches, such as world models, aim to capture the fundamental principles governing the physical world, enabling more accurate simulations and meaningful interactions. However, current methods often treat different modalities, including 2D (images), videos, 3D, and 4D representations, as independent domains, overlooking their interdependencies. Additionally, these methods typically focus on isolated dimensions of reality without systematically integrating their connections. In this survey, we present a unified survey for multimodal generative models that investigate the progression of data dimensionality in real-world simulation. Specifically, this survey starts from 2D generation (appearance), then moves to video (appearance+dynamics) and 3D generation (appearance+geometry), and finally culminates in 4D generation that integrate all dimensions. To the best of our knowledge, this is the first attempt to systematically unify the study of 2D, video, 3D and 4D generation within a single framework. To guide future research, we provide a comprehensive review of datasets, evaluation metrics and future directions, and fostering insights for newcomers. This survey serves as a bridge to advance the study of multimodal generative models and real-world simulation within a unified framework.
中文摘要:本综述首次提出多模态生成模型的统一框架,系统梳理从二维到四维生成的演进路径,旨在克服现有方法在模态整合方面的局限,推动人工智能通用智能中的真实世界模拟研究。
English Summary: This survey presents a unified framework for multimodal generative models, systematically reviewing the progression from 2D to 4D generation to advance real-world simulation in AGI research while addressing current limitations in modality integration.
Authors:Aoxiong Yin, Kai Shen, Yichong Leng, Xu Tan, Xinyu Zhou, Juncheng Li, Siliang Tang
Abstract:
Recent advancements in text-to-video (T2V) generation have been driven by two competing paradigms: autoregressive language models and diffusion models. However, each paradigm has intrinsic limitations: language models struggle with visual quality and error accumulation, while diffusion models lack semantic understanding and causal modeling. In this work, we propose LanDiff, a hybrid framework that synergizes the strengths of both paradigms through coarse-to-fine generation. Our architecture introduces three key innovations: (1) a semantic tokenizer that compresses 3D visual features into compact 1D discrete representations through efficient semantic compression, achieving a $\sim$14,000$\times$ compression ratio; (2) a language model that generates semantic tokens with high-level semantic relationships; (3) a streaming diffusion model that refines coarse semantics into high-fidelity videos. Experiments show that LanDiff, a 5B model, achieves a score of 85.43 on the VBench T2V benchmark, surpassing the state-of-the-art open-source models Hunyuan Video (13B) and other commercial models such as Sora, Kling, and Hailuo. Furthermore, our model also achieves state-of-the-art performance in long video generation, surpassing other open-source models in this field. Our demo can be viewed at https://landiff.github.io/.
中文:LanDiff是一种结合自回归语言模型与扩散模型优势的混合文本到视频框架,通过从粗到细的生成方式克服了两者固有缺陷,在标准及长视频生成基准测试中均实现了最先进的性能表现。
English: LanDiff is a hybrid text-to-video framework that combines autoregressive language models and diffusion models to overcome their individual limitations, achieving state-of-the-art performance in both standard and long video generation benchmarks.
Authors:Zhijian Zhuo, Yutao Zeng, Ya Wang, Sijun Zhang, Jian Yang, Xiaoqing Li, Xun Zhou, Jinwen Ma
Abstract:
Transformers have become the de facto architecture for a wide range of machine learning tasks, particularly in large language models (LLMs). Despite their remarkable performance, challenges remain in training deep transformer networks, especially regarding the position of layer normalization. While Pre-Norm structures facilitate more stable training owing to their stronger identity path, they often lead to suboptimal performance compared to Post-Norm. In this paper, we propose $\textbf{HybridNorm}$, a simple yet effective hybrid normalization strategy that integrates the advantages of both Pre-Norm and Post-Norm. Specifically, HybridNorm employs QKV normalization within the attention mechanism and Post-Norm in the feed-forward network (FFN) of each transformer block. We provide both theoretical insights and empirical evidence demonstrating that HybridNorm improves gradient flow and model robustness. Extensive experiments on large-scale transformer models, including both dense and sparse variants, show that HybridNorm consistently outperforms both Pre-Norm and Post-Norm approaches across multiple benchmarks. These findings highlight the potential of HybridNorm as a more stable and effective technique for improving the training and performance of deep transformer models. Code is available at https://github.com/BryceZhuo/HybridNorm.
中文: 本文提出HybridNorm混合归一化策略,通过结合Pre-Norm和Post-Norm的优势来改善深度Transformer的梯度流动和鲁棒性,在多个基准测试中持续优于现有方法。
English: The paper introduces HybridNorm, a hybrid normalization strategy that combines Pre-Norm and Post-Norm to enhance gradient flow and robustness in deep transformers, consistently outperforming existing methods across benchmarks.
Authors:Dimitri von Rütte, Janis Fluri, Yuhui Ding, Antonio Orvieto, Bernhard Schölkopf, Thomas Hofmann
Abstract:
While state-of-the-art language models achieve impressive results through next-token prediction, they have inherent limitations such as the inability to revise already generated tokens. This has prompted exploration of alternative approaches such as discrete diffusion. However, masked diffusion, which has emerged as a popular choice due to its simplicity and effectiveness, reintroduces this inability to revise words. To overcome this, we generalize masked diffusion, deriving a new family of general interpolating discrete diffusion (GIDD) which offers greater flexibility in the design of the noising processes. Leveraging a novel diffusion ELBO, we achieve compute-matched state-of-the-art performance in diffusion language modeling. Exploiting GIDD's flexibility, we explore a hybrid approach combining masking and uniform noise, leading to improved sample quality and unlocking the ability for the model to correct its own mistakes, an area where autoregressive models notoriously have struggled. Code: https://github.com/dvruette/gidd/
中文: 本文提出了一种新的通用插值离散扩散(GIDD)模型系列,通过灵活的噪声处理设计克服了现有方法的修订限制,实现了最先进的性能并具备了自我纠错能力。
English: This paper introduces a new family of general interpolating discrete diffusion (GIDD) models that overcome the revision limitations of existing approaches by offering flexible noising processes, achieving state-of-the-art performance and enabling self-correction capabilities.
Authors:Yi Shen, Jian Zhang, Jieyun Huang, Shuming Shi, Wenjing Zhang, Jiangze Yan, Ning Wang, Kai Wang, Zhaoxiang Liu, Shiguo Lian
Abstract:
Recent advancements in slow thinking reasoning models have shown exceptional performance in complex reasoning tasks. However, these models often exhibit overthinking (generating redundant reasoning steps for simple problems), leading to excessive computational resource usage. While current mitigation strategies uniformly reduce reasoning tokens, they risk degrading performance on challenging tasks that require extended reasoning. This paper introduces Difficulty-Adaptive Slow Thinking (DAST), a novel framework that enables models to autonomously adjust the length of Chain-of-Thought (CoT) based on problem difficulty. We first propose a Token Length Budget (TLB) metric to quantify difficulty, then leverage budget-aware reward shaping and budget preference optimization to implement DAST. DAST penalizes overlong responses for simple tasks while incentivizing sufficient reasoning for complex problems. Experiments on diverse datasets and model scales demonstrate that DAST effectively mitigates overthinking (reducing token usage by over 30\% on average) while preserving reasoning accuracy on complex problems. Our codes and models are available at https://github.com/AnonymousUser0520/AnonymousRepo01.
中文摘要:难度自适应慢思考(DAST)框架通过基于问题难度自主调整推理链长度,在保持复杂问题求解精度的同时,有效减少超过30%的冗余推理消耗。
English Summary: The Difficulty-Adaptive Slow Thinking (DAST) framework enables models to autonomously adjust reasoning length based on problem difficulty, effectively reducing overthinking by over 30% in token usage while maintaining accuracy on complex tasks.
Authors:Antonio Guillén-Teruel, Marcos Caracena, Jose A. Pardo, Fernando de-la-Gándara, José Palma, Juan A. BotÃa
Abstract:
This research addresses the challenges of handling unbalanced datasets for binary classification tasks. In such scenarios, standard evaluation metrics are often biased by the disproportionate representation of the minority class. Conducting experiments across seven datasets, we uncovered inconsistencies in evaluation metrics when determining the model that outperforms others for each binary classification problem. This justifies the need for a metric that provides a more consistent and unbiased evaluation across unbalanced datasets, thereby supporting robust model selection. To mitigate this problem, we propose a novel metric, the Unbiased Integration Coefficients (UIC), which exhibits significantly reduced bias ($p < 10^{-4}$) towards the minority class compared to conventional metrics. The UIC is constructed by aggregating existing metrics while penalising those more prone to imbalance. In addition, we introduce the Identical Partitions for Imbalance Problems (IPIP) algorithm for imbalanced ML problems, an ensemble-based approach. Our experimental results show that IPIP outperforms other baseline imbalance-aware approaches using Random Forest and Logistic Regression models in three out of seven datasets as assessed by the UIC metric, demonstrating its effectiveness in addressing imbalanced data challenges in binary classification tasks. This new framework for dealing with imbalanced datasets is materialized in the FILM (Framework for Imbalanced Learning Machines) R Package, accessible at https://github.com/antoniogt/FILM.
Chinese: 本研究针对不平衡二元分类中的评估偏差问题,提出了无偏积分系数(UIC)指标和针对不平衡问题的同质分区(IPIP)算法,通过实验验证了其有效性,并在FILM R软件包中实现了该框架。
English: This study introduces the Unbiased Integration Coefficients (UIC) metric and the Identical Partitions for Imbalance Problems (IPIP) algorithm to address evaluation inconsistencies in imbalanced binary classification, demonstrating their effectiveness through experiments and implementing them in the FILM R package.
Authors:Manh Cuong Dao, Phi Le Nguyen, Thao Nguyen Truong, Trong Nghia Hoang
Abstract:
Offline optimization has recently emerged as an increasingly popular approach to mitigate the prohibitively expensive cost of online experimentation. The key idea is to learn a surrogate of the black-box function that underlines the target experiment using a static (offline) dataset of its previous input-output queries. Such an approach is, however, fraught with an out-of-distribution issue where the learned surrogate becomes inaccurate outside the offline data regimes. To mitigate this, existing offline optimizers have proposed numerous conditioning techniques to prevent the learned surrogate from being too erratic. Nonetheless, such conditioning strategies are often specific to particular surrogate or search models, which might not generalize to a different model choice. This motivates us to develop a model-agnostic approach instead, which incorporates a notion of model sharpness into the training loss of the surrogate as a regularizer. Our approach is supported by a new theoretical analysis demonstrating that reducing surrogate sharpness on the offline dataset provably reduces its generalized sharpness on unseen data. Our analysis extends existing theories from bounding generalized prediction loss (on unseen data) with loss sharpness to bounding the worst-case generalized surrogate sharpness with its empirical estimate on training data, providing a new perspective on sharpness regularization. Our extensive experimentation on a diverse range of optimization tasks also shows that reducing surrogate sharpness often leads to significant improvement, marking (up to) a noticeable 9.6% performance boost. Our code is publicly available at https://github.com/cuong-dm/IGNITE
离线优化通过静态数据学习代理模型以降低在线实验成本,但存在分布外不准确性问题,我们提出的模型无关锐度正则化方法有效缓解了此问题,理论支持且性能提升高达9.6%。
Offline optimization reduces online experimentation costs by learning a surrogate model from static data, but it faces out-of-distribution inaccuracies, which our model-agnostic sharpness regularization method effectively mitigates, supported by theory and up to 9.6% performance gains.
Authors:Jie Xu, Na Zhao, Gang Niu, Masashi Sugiyama, Xiaofeng Zhu
Abstract:
Recently, multi-view learning (MVL) has garnered significant attention due to its ability to fuse discriminative information from multiple views. However, real-world multi-view datasets are often heterogeneous and imperfect, which usually causes MVL methods designed for specific combinations of views to lack application potential and limits their effectiveness. To address this issue, we propose a novel robust MVL method (namely RML) with simultaneous representation fusion and alignment. Specifically, we introduce a simple yet effective multi-view transformer fusion network where we transform heterogeneous multi-view data into homogeneous word embeddings, and then integrate multiple views by the sample-level attention mechanism to obtain a fused representation. Furthermore, we propose a simulated perturbation based multi-view contrastive learning framework that dynamically generates the noise and unusable perturbations for simulating imperfect data conditions. The simulated noisy and unusable data obtain two distinct fused representations, and we utilize contrastive learning to align them for learning discriminative and robust representations. Our RML is self-supervised and can also be applied for downstream tasks as a regularization. In experiments, we employ it in multi-view unsupervised clustering, noise-label classification, and as a plug-and-play module for cross-modal hashing retrieval. Extensive comparison experiments and ablation studies validate RML's effectiveness. Code is available at https://github.com/SubmissionsIn/RML.
中文:提出的鲁棒多视角学习方法RML通过基于Transformer的融合网络整合异构视角,并利用模拟扰动对比学习增强表征鲁棒性,在多种无监督和噪声标签任务中验证了其有效性。
English: The proposed robust multi-view learning method, RML, integrates heterogeneous views through a transformer-based fusion network and enhances representation robustness via simulated perturbation contrastive learning, demonstrating effectiveness in various unsupervised and noisy-label tasks.
Authors:Amin Karimi, Charalambos Poullis
Abstract:
Few-shot semantic segmentation (FSS) aims to enable models to segment novel/unseen object classes using only a limited number of labeled examples. However, current FSS methods frequently struggle with generalization due to incomplete and biased feature representations, especially when support images do not capture the full appearance variability of the target class. To improve the FSS pipeline, we propose a novel framework that utilizes large language models (LLMs) to adapt general class semantic information to the query image. Furthermore, the framework employs dense pixel-wise matching to identify similarities between query and support images, resulting in enhanced FSS performance. Inspired by reasoning-based segmentation frameworks, our method, named DSV-LFS, introduces an additional token into the LLM vocabulary, allowing a multimodal LLM to generate a "semantic prompt" from class descriptions. In parallel, a dense matching module identifies visual similarities between the query and support images, generating a "visual prompt". These prompts are then jointly employed to guide the prompt-based decoder for accurate segmentation of the query image. Comprehensive experiments on the benchmark datasets Pascal-$5^{i}$ and COCO-$20^{i}$ demonstrate that our framework achieves state-of-the-art performance-by a significant margin-demonstrating superior generalization to novel classes and robustness across diverse scenarios. The source code is available at \href{https://github.com/aminpdik/DSV-LFS}{https://github.com/aminpdik/DSV-LFS}
中文: 本文提出DSV-LFS框架,通过大语言模型生成语义提示并结合密集像素匹配的视觉提示,显著提升了小样本语义分割在新类别上的泛化能力和性能,在基准测试中达到最优效果。
English: This paper introduces DSV-LFS, a novel few-shot semantic segmentation framework that leverages large language models and dense pixel-wise matching to generate semantic and visual prompts, achieving state-of-the-art performance and superior generalization on benchmark datasets.
Authors:Sungwon Kim, Yoonho Lee, Yunhak Oh, Namkyeong Lee, Sukwon Yun, Junseok Lee, Sein Kim, Carl Yang, Chanyoung Park
Abstract:
Federated Learning (FL) on graphs enables collaborative model training to enhance performance without compromising the privacy of each client. However, existing methods often overlook the mutable nature of graph data, which frequently introduces new nodes and leads to shifts in label distribution. Since they focus solely on performing well on each client's local data, they are prone to overfitting to their local distributions (i.e., local overfitting), which hinders their ability to generalize to unseen data with diverse label distributions. In contrast, our proposed method, FedLoG, effectively tackles this issue by mitigating local overfitting. Our model generates global synthetic data by condensing the reliable information from each class representation and its structural information across clients. Using these synthetic data as a training set, we alleviate the local overfitting problem by adaptively generalizing the absent knowledge within each local dataset. This enhances the generalization capabilities of local models, enabling them to handle unseen data effectively. Our model outperforms baselines in our proposed experimental settings, which are designed to measure generalization power to unseen data in practical scenarios. Our code is available at https://github.com/sung-won-kim/FedLoG
Chinese: FedLoG通过整合各客户端类别表示和结构信息生成全局合成数据,有效缓解联邦图学习中的局部过拟合问题,从而提升模型对未见数据及多样化标签分布的泛化能力。
English: FedLoG addresses local overfitting in federated graph learning by generating global synthetic data from class representations and structural information across clients, thereby enhancing model generalization to unseen data with diverse label distributions.
Authors:Wenhui Zhu, Xin Li, Xiwen Chen, Peijie Qiu, Vamsi Krishna Vasa, Xuanzhao Dong, Yanxi Chen, Natasha Lepore, Oana Dumitrascu, Yi Su, Yalin Wang
Abstract:
Recently, Multimodal Large Language Models (MLLMs) have gained significant attention for their remarkable ability to process and analyze non-textual data, such as images, videos, and audio. Notably, several adaptations of general-domain MLLMs to the medical field have been explored, including LLaVA-Med. However, these medical adaptations remain insufficiently advanced in understanding and interpreting retinal images. In contrast, medical experts emphasize the importance of quantitative analyses for disease detection and interpretation. This underscores a gap between general-domain and medical-domain MLLMs: while general-domain MLLMs excel in broad applications, they lack the specialized knowledge necessary for precise diagnostic and interpretative tasks in the medical field. To address these challenges, we introduce \textit{RetinalGPT}, a multimodal conversational assistant for clinically preferred quantitative analysis of retinal images. Specifically, we achieve this by compiling a large retinal image dataset, developing a novel data pipeline, and employing customized visual instruction tuning to enhance both retinal analysis and enrich medical knowledge. In particular, RetinalGPT outperforms MLLM in the generic domain by a large margin in the diagnosis of retinal diseases in 8 benchmark retinal datasets. Beyond disease diagnosis, RetinalGPT features quantitative analyses and lesion localization, representing a pioneering step in leveraging LLMs for an interpretable and end-to-end clinical research framework. The code is available at https://github.com/Retinal-Research/RetinalGPT
中文: 多模态大语言模型在通用领域表现出色,但在医学领域,尤其是视网膜图像解读方面存在不足;为此开发的RetinalGPT通过定量分析、疾病诊断和病灶定位,显著提升了临床应用的准确性和可解释性。
English: Multimodal Large Language Models (MLLMs) are advancing but lack specialized capabilities for precise medical diagnostics, particularly in interpreting retinal images, leading to the development of RetinalGPT, which excels in quantitative analysis, disease diagnosis, and lesion localization for enhanced clinical applications.
Authors:Chaitanya K. Joshi, Xiang Fu, Yi-Lun Liao, Vahe Gharakhanyan, Benjamin Kurt Miller, Anuroop Sriram, Zachary W. Ulissi
Abstract:
Diffusion models are the standard toolkit for generative modelling of 3D atomic systems. However, for different types of atomic systems -- such as molecules and materials -- the generative processes are usually highly specific to the target system despite the underlying physics being the same. We introduce the All-atom Diffusion Transformer (ADiT), a unified latent diffusion framework for jointly generating both periodic materials and non-periodic molecular systems using the same model: (1) An autoencoder maps a unified, all-atom representations of molecules and materials to a shared latent embedding space; and (2) A diffusion model is trained to generate new latent embeddings that the autoencoder can decode to sample new molecules or materials. Experiments on MP20, QM9 and GEOM-DRUGS datasets demonstrate that jointly trained ADiT generates realistic and valid molecules as well as materials, obtaining state-of-the-art results on par with molecule and crystal-specific models. ADiT uses standard Transformers with minimal inductive biases for both the autoencoder and diffusion model, resulting in significant speedups during training and inference compared to equivariant diffusion models. Scaling ADiT up to half a billion parameters predictably improves performance, representing a step towards broadly generalizable foundation models for generative chemistry. Open source code: https://github.com/facebookresearch/all-atom-diffusion-transformer
中文:ADiT是一种统一的潜在扩散框架,能够使用同一模型联合生成分子和材料,通过可扩展的Transformer架构实现了最先进的性能。
English: ADiT is a unified latent diffusion framework that jointly generates both molecules and materials using the same model, achieving state-of-the-art performance with scalable transformer architecture.
Authors:Abdullah Mamun, Asiful Arefeen, Susan B. Racette, Dorothy D. Sears, Corrie M. Whisner, Matthew P. Buman, Hassan Ghasemzadeh
Abstract:
Postprandial hyperglycemia, marked by the blood glucose level exceeding the normal range after consuming a meal, is a critical indicator of progression toward type 2 diabetes in people with prediabetes and in healthy individuals. A key metric for understanding blood glucose dynamics after eating is the postprandial area under the curve (AUC). Predicting postprandial AUC in advance based on a person's lifestyle factors, such as diet and physical activity level, and explaining the factors that affect postprandial blood glucose could allow an individual to adjust their lifestyle accordingly to maintain normal glucose levels. In this study, we developed an explainable machine learning solution, GlucoLens, that takes sensor-driven inputs and uses advanced data processing, large language models, and trainable machine learning models to predict postprandial AUC and hyperglycemia from diet, physical activity, and recent glucose patterns. We used data obtained from wearables in a five-week clinical trial of 10 adults who worked full-time to develop and evaluate the proposed computational model that integrates wearable sensing, multimodal data, and machine learning. Our machine learning model takes multimodal data from wearable activity and glucose monitoring sensors, along with food and work logs, and provides an interpretable prediction of the postprandial glucose pattern. Our GlucoLens system achieves a normalized root mean squared error (NRMSE) of 0.123 in its best configuration. On average, the proposed technology provides a 16% better performance level compared to the comparison models. Additionally, our technique predicts hyperglycemia with an accuracy of 73.3% and an F1 score of 0.716 and recommends different treatment options to help avoid hyperglycemia through diverse counterfactual explanations. Code available: https://github.com/ab9mamun/GlucoLens.
中文: 本研究开发了可解释的机器学习系统GlucoLens,通过可穿戴传感器数据和生活方式输入预测餐后血糖水平及高血糖,在提升预测精度的同时提供可操作建议以帮助维持正常血糖水平。
English: This study introduces GlucoLens, an explainable machine learning system that predicts postprandial blood glucose levels and hyperglycemia using wearable sensor data and lifestyle inputs, achieving improved accuracy and providing actionable recommendations to help maintain normal glucose levels.
Authors:Shuhui Zhu, Baoxiang Wang, Sriram Ganapathi Subramanian, Pascal Poupart
Abstract:
The partial alignment and conflict of autonomous agents lead to mixed-motive scenarios in many real-world applications. However, agents may fail to cooperate in practice even when cooperation yields a better outcome. One well known reason for this failure comes from non-credible commitments. To facilitate commitments among agents for better cooperation, we define Markov Commitment Games (MCGs), a variant of commitment games, where agents can voluntarily commit to their proposed future plans. Based on MCGs, we propose a learnable commitment protocol via policy gradients. We further propose incentive-compatible learning to accelerate convergence to equilibria with better social welfare. Experimental results in challenging mixed-motive tasks demonstrate faster empirical convergence and higher returns for our method compared with its counterparts. Our code is available at https://github.com/shuhui-zhu/DCL.
中文: 本研究提出了马尔可夫承诺博弈和可学习的承诺协议,通过可信承诺促进自主智能体间的合作,在混合动机任务中实现了更快的收敛速度和更高的回报。
English: This study introduces Markov Commitment Games and a learnable commitment protocol to enhance cooperation among autonomous agents by enabling credible commitments, achieving faster convergence and higher returns in mixed-motive scenarios.
Authors:Fenglin Liu, Jinge Wu, Hongjian Zhou, Xiao Gu, Soheila Molaei, Anshul Thakur, Lei Clifton, Honghan Wu, David A. Clifton
Abstract:
The application of Large Language Models (LLMs) to various clinical applications has attracted growing research attention. However, real-world clinical decision-making differs significantly from the standardized, exam-style scenarios commonly used in current efforts. In this paper, we present the RiskAgent system to perform a broad range of medical risk predictions, covering over 387 risk scenarios across diverse complex diseases, e.g., cardiovascular disease and cancer. RiskAgent is designed to collaborate with hundreds of clinical decision tools, i.e., risk calculators and scoring systems that are supported by evidence-based medicine. To evaluate our method, we have built the first benchmark MedRisk specialized for risk prediction, including 12,352 questions spanning 154 diseases, 86 symptoms, 50 specialties, and 24 organ systems. The results show that our RiskAgent, with 8 billion model parameters, achieves 76.33% accuracy, outperforming the most recent commercial LLMs, o1, o3-mini, and GPT-4.5, and doubling the 38.39% accuracy of GPT-4o. On rare diseases, e.g., Idiopathic Pulmonary Fibrosis (IPF), RiskAgent outperforms o1 and GPT-4.5 by 27.27% and 45.46% accuracy, respectively. Finally, we further conduct a generalization evaluation on an external evidence-based diagnosis benchmark and show that our RiskAgent achieves the best results. These encouraging results demonstrate the great potential of our solution for diverse diagnosis domains. To improve the adaptability of our model in different scenarios, we have built and open-sourced a family of models ranging from 1 billion to 70 billion parameters. Our code, data, and models are all available at https://github.com/AI-in-Health/RiskAgent.
中文: RiskAgent系统在医疗风险预测中展现出卓越性能,其准确率达76.33%,显著超越主流商业大语言模型,并通过开源不同参数规模的模型系列提升医疗场景的适应性。
English: The RiskAgent system demonstrates superior performance in medical risk prediction, achieving 76.33% accuracy across diverse clinical scenarios and significantly outperforming leading commercial LLMs, with its open-source models enhancing adaptability for various healthcare applications.
Authors:Cristian Jimenez-Romero, Alper Yegenoglu, Christian Blum
Abstract:
This work examines the integration of large language models (LLMs) into multi-agent simulations by replacing the hard-coded programs of agents with LLM-driven prompts. The proposed approach is showcased in the context of two examples of complex systems from the field of swarm intelligence: ant colony foraging and bird flocking. Central to this study is a toolchain that integrates LLMs with the NetLogo simulation platform, leveraging its Python extension to enable communication with GPT-4o via the OpenAI API. This toolchain facilitates prompt-driven behavior generation, allowing agents to respond adaptively to environmental data. For both example applications mentioned above, we employ both structured, rule-based prompts and autonomous, knowledge-driven prompts. Our work demonstrates how this toolchain enables LLMs to study self-organizing processes and induce emergent behaviors within multi-agent environments, paving the way for new approaches to exploring intelligent systems and modeling swarm intelligence inspired by natural phenomena. We provide the code, including simulation files and data at https://github.com/crjimene/swarm_gpt.
本研究通过将传统智能体程序替换为LLM驱动提示,将大语言模型集成到多智能体模拟中,以蚁群觅食和鸟群聚集为例,利用NetLogo-GPT-4工具链生成适应性行为并诱导涌现智能。
This study integrates large language models into multi-agent simulations by replacing traditional agent programming with LLM-driven prompts, demonstrated through ant foraging and bird flocking examples using a NetLogo-GPT-4 toolchain to generate adaptive behaviors and emergent intelligence.
Authors:Jianqi Yan, Alex P. Leung, Zhiyuan Pei, David C. Y. Hui, Sangin Kim
Abstract:
This work introduces a novel deep learning-based approach for gravitational wave anomaly detection, aiming to overcome the limitations of traditional matched filtering techniques in identifying unknown waveform gravitational wave signals. We introduce a modified convolutional neural network architecture inspired by ResNet that leverages residual blocks to extract high-dimensional features, effectively capturing subtle differences between background noise and gravitational wave signals. This network architecture learns a high-dimensional projection while preserving discrepancies with the original input, facilitating precise identification of gravitational wave signals. In our experiments, we implement an innovative data augmentation strategy that generates new data by computing the arithmetic mean of multiple signal samples while retaining the key features of the original signals.
In the NSF HDR A3D3: Detecting Anomalous Gravitational Wave Signals competition, it is honorable for us (group name: easonyan123) to get to the first place at the end with our model achieving a true negative rate (TNR) of 0.9708 during development/validation phase and 0.9832 on an unseen challenge dataset during final/testing phase, the highest among all competitors. These results demonstrate that our method not only achieves excellent generalization performance but also maintains robust adaptability in addressing the complex uncertainties inherent in gravitational wave anomaly detection.
中文摘要:本研究提出了一种基于改进残差网络的新型深度学习模型,通过高维特征提取和创新的数据增强策略,在国际引力波异常检测竞赛中以98.32%的准确率获得最高性能,展现了卓越的泛化能力和适应性。
English Summary: This study presents a novel ResNet-inspired deep learning model that effectively identifies gravitational wave anomalies through advanced feature extraction and data augmentation, achieving record-breaking 98.32% detection accuracy in an international competition.
Authors:Hiroshi Takahashi, Tomoharu Iwata, Atsutoshi Kumagai, Yuuki Yamanaka, Tomoya Yamashita
Abstract:
Diffusion models are powerful generative models but often generate sensitive data that are unwanted by users, mainly because the unlabeled training data frequently contain such sensitive data. Since labeling all sensitive data in the large-scale unlabeled training data is impractical, we address this problem by using a small amount of labeled sensitive data. In this paper, we propose positive-unlabeled diffusion models, which prevent the generation of sensitive data using unlabeled and sensitive data. Our approach can approximate the evidence lower bound (ELBO) for normal (negative) data using only unlabeled and sensitive (positive) data. Therefore, even without labeled normal data, we can maximize the ELBO for normal data and minimize it for labeled sensitive data, ensuring the generation of only normal data. Through experiments across various datasets and settings, we demonstrated that our approach can prevent the generation of sensitive images without compromising image quality.
中文: 本文提出的正未标记扩散模型通过少量标记敏感数据和未标记数据,有效防止生成敏感内容,在保持图像质量的同时确保仅生成正常数据,并在多种数据集上验证了其有效性。
English: The proposed positive-unlabeled diffusion model prevents the generation of sensitive data by utilizing minimal labeled sensitive data alongside unlabeled data, effectively maximizing the evidence lower bound for normal data while maintaining image quality across diverse datasets.
Authors:Xuandong Zhao, Will Cai, Tianneng Shi, David Huang, Licong Lin, Song Mei, Dawn Song
Abstract:
Existing training-time safety alignment techniques for large language models (LLMs) remain vulnerable to jailbreak attacks. Direct preference optimization (DPO), a widely deployed alignment method, exhibits limitations in both experimental and theoretical contexts as its loss function proves suboptimal for refusal learning. Through gradient-based analysis, we identify these shortcomings and propose an improved safety alignment that disentangles DPO objectives into two components: (1) robust refusal training, which encourages refusal even when partial unsafe generations are produced, and (2) targeted unlearning of harmful knowledge. This approach significantly increases LLM robustness against a wide range of jailbreak attacks, including prefilling, suffix, and multi-turn attacks across both in-distribution and out-of-distribution scenarios. Furthermore, we introduce a method to emphasize critical refusal tokens by incorporating a reward-based token-level weighting mechanism for refusal learning, which further improves the robustness against adversarial exploits. Our research also suggests that robustness to jailbreak attacks is correlated with token distribution shifts in the training process and internal representations of refusal and harmful tokens, offering valuable directions for future research in LLM safety alignment. The code is available at https://github.com/wicai24/DOOR-Alignment
中文: 本研究揭示了直接偏好优化(DPO)在大语言模型安全对齐中的缺陷,提出通过分离拒绝训练与有害知识遗忘的改进方法,借助词级加权和分布分析显著提升了针对各类越狱攻击的鲁棒性。
English: This study identifies vulnerabilities in Direct Preference Optimization (DPO) for LLM safety alignment and proposes an enhanced method that separates refusal training from harmful knowledge unlearning, significantly boosting robustness against diverse jailbreak attacks through token-level weighting and distribution analysis.
Authors:Lida Chen, Dong Xu, Chenxin An, Xintao Wang, Yikai Zhang, Jiangjie Chen, Zujie Liang, Feng Wei, Jiaqing Liang, Yanghua Xiao, Wei Wang
Abstract:
Large Language Models (LLMs) face efficiency bottlenecks due to the quadratic complexity of the attention mechanism when processing long contexts. Sparse attention methods offer a promising solution, but existing approaches often suffer from incomplete effective context and/or require complex implementation of pipeline. We present a comprehensive analysis of sparse attention for autoregressive LLMs from the respective of receptive field, recognize the suboptimal nature of existing methods for expanding the receptive field, and introduce PowerAttention, a novel sparse attention design that facilitates effective and complete context extension through the theoretical analysis. PowerAttention achieves exponential receptive field growth in $d$-layer LLMs, allowing each output token to attend to $2^d$ tokens, ensuring completeness and continuity of the receptive field. Experiments demonstrate that PowerAttention outperforms existing static sparse attention methods by $5\sim 40\%$, especially on tasks demanding long-range dependencies like Passkey Retrieval and RULER, while maintaining a comparable time complexity to sliding window attention. Efficiency evaluations further highlight PowerAttention's superior speedup in both prefilling and decoding phases compared with dynamic sparse attentions and full attention ($3.0\times$ faster on 128K context), making it a highly effective and user-friendly solution for processing long sequences in LLMs.
中文总结:PowerAttention是一种新颖的稀疏注意力设计,能够在自回归大语言模型中实现指数级感受野扩展,在保持高计算效率的同时,显著提升了长上下文任务的处理性能。
English Summary: PowerAttention is a novel sparse attention design that enables exponential receptive field growth in autoregressive LLMs, achieving superior performance on long-context tasks while maintaining high computational efficiency.
Authors:Wonjun Kang, Kevin Galim, Yuchen Zeng, Minjae Lee, Hyung Il Koo, Nam Ik Cho
Abstract:
State Space Models (SSMs) have emerged as efficient alternatives to Transformers, mitigating their quadratic computational cost. However, the application of Parameter-Efficient Fine-Tuning (PEFT) methods to SSMs remains largely unexplored. In particular, prompt-based methods like Prompt Tuning and Prefix-Tuning, which are widely used in Transformers, do not perform well on SSMs. To address this, we propose state-based methods as a superior alternative to prompt-based methods. This new family of methods naturally stems from the architectural characteristics of SSMs. State-based methods adjust state-related features directly instead of depending on external prompts. Furthermore, we introduce a novel state-based PEFT method: State-offset Tuning. At every timestep, our method directly affects the state at the current step, leading to more effective adaptation. Through extensive experiments across diverse datasets, we demonstrate the effectiveness of our method. Code is available at https://github.com/furiosa-ai/ssm-state-tuning.
中文: 状态空间模型(SSMs)虽比Transformer计算高效,但现有参数高效微调方法效果不佳,因此我们提出基于状态的优化方法,如状态偏移调优,直接调整状态特征以提升性能。
English: State Space Models (SSMs) offer computational efficiency over Transformers, but current Parameter-Efficient Fine-Tuning methods underperform, prompting the introduction of state-based techniques like State-offset Tuning that directly modify state features for better adaptation.
Authors:Ahmed E. Samy, Zekarias T. Kefato, Sarunas Girdzijauskas
Abstract:
Link prediction is a crucial task in many downstream applications of graph machine learning. To this end, Graph Neural Network (GNN) is a widely used technique for link prediction, mainly in transductive settings, where the goal is to predict missing links between existing nodes. However, many real-life applications require an inductive setting that accommodates for new nodes, coming into an existing graph. Thus, recently inductive link prediction has attracted considerable attention, and a multi-layer perceptron (MLP) is the popular choice of most studies to learn node representations. However, these approaches have limited expressivity and do not fully capture the graph's structural signal. Therefore, in this work we propose LEAP, an inductive link prediction method based on LEArnable toPology augmentation. Unlike previous methods, LEAP models the inductive bias from both the structure and node features, and hence is more expressive. To the best of our knowledge, this is the first attempt to provide structural contexts for new nodes via learnable augmentation in inductive settings. Extensive experiments on seven real-world homogeneous and heterogeneous graphs demonstrates that LEAP significantly surpasses SOTA methods. The improvements are up to 22\% and 17\% in terms of AUC and average precision, respectively. The code and datasets are available on GitHub (https://github.com/AhmedESamy/LEAP/)
Chinese: LEAP是一种创新的归纳式链接预测方法,通过结合结构和节点特征的可学习拓扑增强来提高表达能力,在AUC指标上显著超越现有最优方法,提升幅度高达22%。
English: LEAP is a novel inductive link prediction method that enhances expressiveness by incorporating learnable topology augmentation from both structural and feature perspectives, significantly outperforming state-of-the-art techniques with up to 22% improvement in AUC.
Authors:Xi Zhu, Haochen Xue, Ziwei Zhao, Wujiang Xu, Jingyuan Huang, Minghao Guo, Qifan Wang, Kaixiong Zhou, Yongfeng Zhang
Abstract:
Text-Attributed Graphs (TAGs), where each node is associated with text descriptions, are ubiquitous in real-world scenarios. They typically exhibit distinctive structure and domain-specific knowledge, motivating the development of a Graph Foundation Model (GFM) that generalizes across diverse graphs and tasks. Despite large efforts to integrate Large Language Models (LLMs) and Graph Neural Networks (GNNs) for TAGs, existing approaches suffer from decoupled architectures with two-stage alignment, limiting their synergistic potential. Even worse, existing methods assign out-of-vocabulary (OOV) tokens to graph nodes, leading to graph-specific semantics, token explosion, and incompatibility with task-oriented prompt templates, which hinders cross-graph and cross-task transferability. To address these challenges, we propose PromptGFM, a versatile GFM for TAGs grounded in graph vocabulary learning. PromptGFM comprises two key components: (1) Graph Understanding Module, which explicitly prompts LLMs to replicate the finest GNN workflow within the text space, facilitating seamless GNN-LLM integration and elegant graph-text alignment; (2) Graph Inference Module, which establishes a language-based graph vocabulary ensuring expressiveness, transferability, and scalability, enabling readable instructions for LLM fine-tuning. Extensive experiments demonstrate our superiority and transferability across diverse graphs and tasks. The code is available at this: https://github.com/agiresearch/PromptGFM.
中文: PromptGFM是一种基于图词汇学习的通用图基础模型,通过图理解模块和图推理模块无缝融合大语言模型与图神经网络,显著提升了跨图和跨任务的迁移性能。
English: PromptGFM is a versatile Graph Foundation Model that integrates Large Language Models and Graph Neural Networks through graph vocabulary learning to enhance cross-graph and cross-task transferability.
Authors:Alexander Kolpakov, Igor Rivin
Abstract:
We propose a new dimensionality reduction toolkit designed to address some of the challenges faced by traditional methods like UMAP and tSNE such as loss of global structure and computational efficiency. Built on the JAX framework, DiRe leverages modern hardware acceleration to provide an efficient, scalable, and interpretable solution for visualizing complex data structures, and for quantitative analysis of lower-dimensional embeddings. The toolkit shows considerable promise in preserving both local and global structures within the data as compared to state-of-the-art UMAP and tSNE implementations. This makes it suitable for a wide range of applications in machine learning, bio-informatics, and data science.
中文: DiRe是一种基于JAX的新型降维工具包,能高效、可扩展且可解释地实现数据可视化和分析,在保留数据局部与全局结构方面优于UMAP和tSNE。
English: DiRe is a new JAX-based dimensionality reduction toolkit that offers efficient, scalable, and interpretable solutions for data visualization and analysis, outperforming UMAP and tSNE in preserving both local and global data structures.
Authors:Xihan Qin, Li Liao
Abstract:
Comorbidity, the co-occurrence of multiple medical conditions in a single patient, profoundly impacts disease management and outcomes. Understanding these complex interconnections is crucial, especially in contexts where comorbidities exacerbate outcomes. Leveraging insights from the human interactome (HI) and advancements in graph-based methodologies, this study introduces Transformer with Subgraph Positional Encoding (TSPE) for disease comorbidity prediction. Inspired by Biologically Supervised Embedding (BSE), TSPE employs Transformer's attention mechanisms and Subgraph Positional Encoding (SPE) to capture interactions between nodes and disease associations. Our proposed SPE proves more effective than LPE, as used in Dwivedi et al.'s Graph Transformer, underscoring the importance of integrating clustering and disease-specific information for improved predictive accuracy. Evaluated on real clinical benchmark datasets (RR0 and RR1), TSPE demonstrates substantial performance enhancements over the state-of-the-art method, achieving up to 28.24% higher ROC AUC and 4.93% higher accuracy. This method shows promise for adaptation to other complex graph-based tasks and applications. The source code is available in the GitHub repository at: https://github.com/xihan-qin/TSPE-GraphTransformer.
中文: 本研究提出TSPE图变换器模型,通过子图位置编码和生物监督嵌入机制,在疾病共病预测中实现了显著性能提升,准确率提高达4.93%,ROC AUC提升28.24%。
English: This study introduces TSPE, a graph-based Transformer model with subgraph positional encoding that significantly improves disease comorbidity prediction accuracy by capturing complex disease interactions through biological insights.
Authors:Yufei Wang, Ziyu Wang, Mino Nakura, Pratik Bhowal, Chia-Liang Kuo, Yi-Ting Chen, Zackory Erickson, David Held
Abstract:
This paper presents ArticuBot, in which a single learned policy enables a robotics system to open diverse categories of unseen articulated objects in the real world. This task has long been challenging for robotics due to the large variations in the geometry, size, and articulation types of such objects. Our system, Articubot, consists of three parts: generating a large number of demonstrations in physics-based simulation, distilling all generated demonstrations into a point cloud-based neural policy via imitation learning, and performing zero-shot sim2real transfer to real robotics systems. Utilizing sampling-based grasping and motion planning, our demonstration generalization pipeline is fast and effective, generating a total of 42.3k demonstrations over 322 training articulated objects. For policy learning, we propose a novel hierarchical policy representation, in which the high-level policy learns the sub-goal for the end-effector, and the low-level policy learns how to move the end-effector conditioned on the predicted goal. We demonstrate that this hierarchical approach achieves much better object-level generalization compared to the non-hierarchical version. We further propose a novel weighted displacement model for the high-level policy that grounds the prediction into the existing 3D structure of the scene, outperforming alternative policy representations. We show that our learned policy can zero-shot transfer to three different real robot settings: a fixed table-top Franka arm across two different labs, and an X-Arm on a mobile base, opening multiple unseen articulated objects across two labs, real lounges, and kitchens. Videos and code can be found on our project website: https://articubot.github.io/.
Authors:Yue Meng, Nathalie Majcherczyk, Wenliang Liu, Scott Kiesel, Chuchu Fan, Federico Pecora
Abstract:
Multi-agent coordination is crucial for reliable multi-robot navigation in shared spaces such as automated warehouses. In regions of dense robot traffic, local coordination methods may fail to find a deadlock-free solution. In these scenarios, it is appropriate to let a central unit generate a global schedule that decides the passing order of robots. However, the runtime of such centralized coordination methods increases significantly with the problem scale. In this paper, we propose to leverage Graph Neural Network Variational Autoencoders (GNN-VAE) to solve the multi-agent coordination problem at scale faster than through centralized optimization. We formulate the coordination problem as a graph problem and collect ground truth data using a Mixed-Integer Linear Program (MILP) solver. During training, our learning framework encodes good quality solutions of the graph problem into a latent space. At inference time, solution samples are decoded from the sampled latent variables, and the lowest-cost sample is selected for coordination. Finally, the feasible proposal with the highest performance index is selected for the deployment. By construction, our GNN-VAE framework returns solutions that always respect the constraints of the considered coordination problem. Numerical results show that our approach trained on small-scale problems can achieve high-quality solutions even for large-scale problems with 250 robots, being much faster than other baselines. Project page: https://mengyuest.github.io/gnn-vae-coord
Authors:Yue Meng, Chuchu fan
Abstract:
Generating realistic simulations is critical for autonomous system applications such as self-driving and human-robot interactions. However, driving simulators nowadays still have difficulty in generating controllable, diverse, and rule-compliant behaviors for road participants: Rule-based models cannot produce diverse behaviors and require careful tuning, whereas learning-based methods imitate the policy from data but are not designed to follow the rules explicitly. Besides, the real-world datasets are by nature "single-outcome", making the learning method hard to generate diverse behaviors. In this paper, we leverage Signal Temporal Logic (STL) and Diffusion Models to learn controllable, diverse, and rule-aware policy. We first calibrate the STL on the real-world data, then generate diverse synthetic data using trajectory optimization, and finally learn the rectified diffusion policy on the augmented dataset. We test on the NuScenes dataset and our approach can achieve the most diverse rule-compliant trajectories compared to other baselines, with a runtime 1/17X to the second-best approach. In the closed-loop testing, our approach reaches the highest diversity, rule satisfaction rate, and the least collision rate. Our method can generate varied characteristics conditional on different STL parameters in testing. A case study on human-robot encounter scenarios shows our approach can generate diverse and closed-to-oracle trajectories. The annotation tool, augmented dataset, and code are available at https://github.com/mengyuest/pSTL-diffusion-policy.
中文: 本文提出了一种结合信号时序逻辑和扩散模型的方法,用于生成可控、多样且符合规则的自动驾驶行为,在多样性、规则遵循和防撞性能上均优于现有基准方法。
English: This paper introduces a method combining Signal Temporal Logic and Diffusion Models to generate controllable, diverse, and rule-compliant behaviors for autonomous systems, achieving superior performance in diversity, rule adherence, and collision avoidance compared to existing approaches.
Authors:Saurabh Koju, Saurav Bastola, Prashant Shrestha, Sanskar Amgain, Yash Raj Shrestha, Rudra P. K. Poudel, Binod Bhattarai
Abstract:
Realistic and interactive surgical simulation has the potential to facilitate crucial applications, such as medical professional training and autonomous surgical agent training. In the natural visual domain, world models have enabled action-controlled data generation, demonstrating the potential to train autonomous agents in interactive simulated environments when large-scale real data acquisition is infeasible. However, such works in the surgical domain have been limited to simplified computer simulations, and lack realism. Furthermore, existing literature in world models has predominantly dealt with action-labeled data, limiting their applicability to real-world surgical data, where obtaining action annotation is prohibitively expensive. Inspired by the recent success of Genie in leveraging unlabeled video game data to infer latent actions and enable action-controlled data generation, we propose the first surgical vision world model. The proposed model can generate action-controllable surgical data and the architecture design is verified with extensive experiments on the unlabeled SurgToolLoc-2022 dataset. Codes and implementation details are available at https://github.com/bhattarailab/Surgical-Vision-World-Model
Chinese: 本文提出了首个手术视觉世界模型,通过利用未标记视频数据生成可控制动作的手术数据,并在SurgToolLoc-2022数据集上进行了验证。
English: This paper introduces the first surgical vision world model, which generates action-controllable surgical data by leveraging unlabeled video data and has been validated on the SurgToolLoc-2022 dataset.
Authors:Siming Huang, Yuliang Xu, Mingmeng Geng, Yao Wan, Dongping Chen
Abstract:
In this paper, we present a thorough analysis of the impact of Large Language Models (LLMs) on Wikipedia, examining the evolution of Wikipedia through existing data and using simulations to explore potential risks. We begin by analyzing page views and article content to study Wikipedia's recent changes and assess the impact of LLMs. Subsequently, we evaluate how LLMs affect various Natural Language Processing (NLP) tasks related to Wikipedia, including machine translation and retrieval-augmented generation (RAG). Our findings and simulation results reveal that Wikipedia articles have been influenced by LLMs, with an impact of approximately 1%-2% in certain categories. If the machine translation benchmark based on Wikipedia is influenced by LLMs, the scores of the models may become inflated, and the comparative results among models might shift as well. Moreover, the effectiveness of RAG might decrease if the knowledge base becomes polluted by LLM-generated content. While LLMs have not yet fully changed Wikipedia's language and knowledge structures, we believe that our empirical findings signal the need for careful consideration of potential future risks.
本研究分析了大语言模型对维基百科的影响,显示某些类别受到约1%-2%的影响,并指出因内容污染可能导致基准测试分数虚高及检索增强生成效果下降等风险。
This study analyzes how Large Language Models are influencing Wikipedia, showing a 1%-2% impact in some areas and highlighting risks like inflated benchmark scores and reduced effectiveness of retrieval-augmented generation due to potential content pollution.
Authors:Belinda Z. Li, Zifan Carl Guo, Jacob Andreas
Abstract:
Transformer language models (LMs) exhibit behaviors -- from storytelling to code generation -- that appear to require tracking the unobserved state of an evolving world. How do they do so? We study state tracking in LMs trained or fine-tuned to compose permutations (i.e., to compute the order of a set of objects after a sequence of swaps). Despite the simple algebraic structure of this problem, many other tasks (e.g., simulation of finite automata and evaluation of boolean expressions) can be reduced to permutation composition, making it a natural model for state tracking in general. We show that LMs consistently learn one of two state tracking mechanisms for this task. The first closely resembles the "associative scan" construction used in recent theoretical work by Liu et al. (2023) and Merrill et al. (2024). The second uses an easy-to-compute feature (permutation parity) to partially prune the space of outputs, then refines this with an associative scan. The two mechanisms exhibit markedly different robustness properties, and we show how to steer LMs toward one or the other with intermediate training tasks that encourage or suppress the heuristics. Our results demonstrate that transformer LMs, whether pretrained or fine-tuned, can learn to implement efficient and interpretable state tracking mechanisms, and the emergence of these mechanisms can be predicted and controlled.
中文: 变换器语言模型能够学习可解释的状态跟踪机制,如关联扫描和启发式剪枝,用于排列组合等任务,且这些机制的出现可通过训练进行预测和控制。
English: Transformer language models learn interpretable state tracking mechanisms, such as associative scans and heuristic-based pruning, for tasks like permutation composition, and their emergence can be predicted and controlled through training.
Authors:Songming Zhang, Xue Zhang, Tong Zhang, Bojie Hu, Yufeng Chen, Jinan Xu
Abstract:
In modern large language models (LLMs), LLM alignment is of crucial importance and is typically achieved through methods such as reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO). However, in most existing methods for LLM alignment, all tokens in the response are optimized using a sparse, response-level reward or preference annotation. The ignorance of token-level rewards may erroneously punish high-quality tokens or encourage low-quality tokens, resulting in suboptimal performance and slow convergence speed. To address this issue, we propose AlignDistil, an RLHF-equivalent distillation method for token-level reward optimization. Specifically, we introduce the reward learned by DPO into the RLHF objective and theoretically prove the equivalence between this objective and a token-level distillation process, where the teacher distribution linearly combines the logits from the DPO model and a reference model. On this basis, we further bridge the accuracy gap between the reward from the DPO model and the pure reward model, by building a contrastive DPO reward with a normal and a reverse DPO model. Moreover, to avoid under- and over-optimization on different tokens, we design a token adaptive logit extrapolation mechanism to construct an appropriate teacher distribution for each token. Experimental results demonstrate the superiority of our AlignDistil over existing methods and showcase fast convergence due to its token-level distributional reward optimization.
中文摘要:AlignDistil是一种基于令牌级奖励优化的对齐方法,通过将DPO奖励融入RLHF目标并设计自适应对数外推机制,有效提升大语言模型对齐性能并加快收敛速度。
English Summary: AlignDistil is a token-level reward optimization method that enhances LLM alignment by integrating DPO rewards into RLHF objectives, using adaptive token-level distillation to improve performance and accelerate convergence.
Authors:Marta Skreta, Tara Akhound-Sadegh, Viktor Ohanesian, Roberto Bondesan, Alán Aspuru-Guzik, Arnaud Doucet, Rob Brekelmans, Alexander Tong, Kirill Neklyudov
Abstract:
While score-based generative models are the model of choice across diverse domains, there are limited tools available for controlling inference-time behavior in a principled manner, e.g. for composing multiple pretrained models. Existing classifier-free guidance methods use a simple heuristic to mix conditional and unconditional scores to approximately sample from conditional distributions. However, such methods do not approximate the intermediate distributions, necessitating additional `corrector' steps. In this work, we provide an efficient and principled method for sampling from a sequence of annealed, geometric-averaged, or product distributions derived from pretrained score-based models. We derive a weighted simulation scheme which we call Feynman-Kac Correctors (FKCs) based on the celebrated Feynman-Kac formula by carefully accounting for terms in the appropriate partial differential equations (PDEs). To simulate these PDEs, we propose Sequential Monte Carlo (SMC) resampling algorithms that leverage inference-time scaling to improve sampling quality. We empirically demonstrate the utility of our methods by proposing amortized sampling via inference-time temperature annealing, improving multi-objective molecule generation using pretrained models, and improving classifier-free guidance for text-to-image generation. Our code is available at https://github.com/martaskrt/fkc-diffusion.
中文: 本文提出了Feynman-Kac校正器(FKCs),这是一种基于Feynman-Kac公式的高效原理性采样方法,通过序列蒙特卡洛算法改进了预训练得分模型的受控采样,有效解决了现有无分类器引导方法的局限性。
English: This paper introduces Feynman-Kac Correctors (FKCs), an efficient and principled method for controlled sampling from pretrained score-based models, addressing limitations in existing classifier-free guidance by leveraging Sequential Monte Carlo algorithms to improve sampling quality across various applications.
Authors:Jie Wu, Haoling Li, Xin Zhang, Xiao Liu, Yangyu Huang, Jianwen Luo, Yizhen Zhang, Zuchao Li, Ruihang Chu, Yujiu Yang, Scarlett Li
Abstract:
Preference learning extends the performance of Code LLMs beyond traditional supervised fine-tuning by leveraging relative quality comparisons. In existing approaches, a set of n candidate solutions is evaluated based on test case success rates, with the candidate demonstrating a higher pass rate being labeled as positive and its counterpart with a lower pass rate as negative. However, because this approach aligns entire failing code blocks rather than pinpointing specific errors, it lacks the granularity necessary to capture meaningful error-correction relationships. As a result, the model is unable to learn more informative error-correction patterns. To address these issues, we propose Target-DPO, a new preference alignment framework that mimics human iterative debugging to refine Code LLMs. Target-DPO explicitly locates error regions and aligns the corresponding tokens via a tailored DPO algorithm. To facilitate it, we introduce the CodeFlow dataset, where samples are iteratively refined until passing tests, with modifications capturing error corrections. Extensive experiments show that a diverse suite of Code LLMs equipped with Target-DPO achieves significant performance gains in code generation and improves on challenging tasks like BigCodeBench. In-depth analysis reveals that Target-DPO yields fewer errors. Code, model and datasets are in: https://github.com/JieWu02/Target-DPO.
Chinese: Target-DPO提出了一种新的代码大模型偏好对齐框架,通过精确定位错误区域并对其相应标记进行对齐来模拟人类调试过程,从而在代码生成任务中实现显著性能提升并减少错误。
English: Target-DPO introduces a novel preference alignment framework for Code LLMs that mimics human debugging by explicitly locating error regions and aligning corresponding tokens, leading to significant performance gains and fewer errors in code generation tasks.
Authors:Zirun Guo, Tao Jin
Abstract:
Test-Time Adaptation (TTA) aims to tackle distribution shifts using unlabeled test data without access to the source data. In the context of multimodal data, there are more complex noise patterns than unimodal data such as simultaneous corruptions for multiple modalities and missing modalities. Besides, in real-world applications, corruptions from different distribution shifts are always mixed. Existing TTA methods always fail in such multimodal scenario because the abrupt distribution shifts will destroy the prior knowledge from the source model, thus leading to performance degradation. To this end, we reveal a new challenge named multimodal wild TTA. To address this challenging problem, we propose two novel strategies: sample identification with interquartile range Smoothing and unimodal assistance, and Mutual information sharing (SuMi). SuMi smooths the adaptation process by interquartile range which avoids the abrupt distribution shifts. Then, SuMi fully utilizes the unimodal features to select low-entropy samples with rich multimodal information for optimization. Furthermore, mutual information sharing is introduced to align the information, reduce the discrepancies and enhance the information utilization across different modalities. Extensive experiments on two public datasets show the effectiveness and superiority over existing methods under the complex noise patterns in multimodal data. Code is available at https://github.com/zrguo/SuMi.
中文摘要:测试时自适应(TTA)旨在利用未标记测试数据应对分布偏移,但在复杂多模态噪声场景中表现不佳;提出的SuMi方法通过平滑适应过程和增强跨模态信息共享,有效解决了这一难题。
English Summary: Test-Time Adaptation (TTA) addresses distribution shifts using unlabeled test data, but struggles with complex multimodal noise patterns; the proposed SuMi method overcomes this by smoothing adaptation processes and enhancing cross-modal information sharing.
Authors:Yujiao Yang, Jing Lian, Linhui Li
Abstract:
Mixture-of-Experts (MoE) enhances model performance while maintaining computational efficiency, making it well-suited for large-scale applications. Conventional mixture-of-experts (MoE) architectures suffer from suboptimal coordination dynamics, where isolated expert operations expose the model to overfitting risks. Moreover, they have not been effectively extended to attention blocks, which limits further efficiency improvements. To tackle these issues, we propose Union-of-Experts (UoE), which decomposes the transformer model into an equivalent group of experts and applies a hierarchical routing mechanism to allocate input subspaces to specialized experts. Our approach advances MoE design with four key innovations: (1) Constructing expert groups by partitioning non-MoE models into functionally equivalent specialists (2) Developing a hierarchical routing paradigm that integrates patch-wise data selection and expert selection strategies. (3) Extending the MoE design to attention blocks. (4) Proposing a hardware-optimized parallelization scheme that exploits batched matrix multiplications for efficient expert computation. The experiments demonstrate that our UoE model surpasses Full Attention, state-of-the-art MoEs and efficient transformers in several tasks across image and natural language domains. In language modeling tasks, UoE achieves an average reduction of 2.38 in perplexity compared to the best-performing MoE method with only 76% of its FLOPs. In the Long Range Arena benchmark, it demonstrates an average score at least 0.68% higher than all comparison models, with only 50% of the FLOPs of the best MoE method. In image classification, it yields an average accuracy improvement of 1.75% over the best model while maintaining comparable FLOPs. The source codes are available at https://github.com/YujiaoYang-work/UoE.
中文: 提出的专家联盟(UoE)模型通过引入分层路由机制并将专家设计扩展至注意力模块,解决了传统MoE架构的局限性,在语言和图像任务中以更低计算成本实现了更优性能。
English: The proposed Union-of-Experts (UoE) model addresses limitations in conventional MoE architectures by introducing hierarchical routing and extending expert design to attention blocks, achieving superior performance with reduced computational costs across language and image tasks.
Authors:Nikita Kazeev, Wei Nong, Ignat Romanov, Ruiming Zhu, Andrey Ustyuzhanin, Shuya Yamazaki, Kedar Hippalgaonkar
Abstract:
Crystal symmetry plays a fundamental role in determining its physical, chemical, and electronic properties such as electrical and thermal conductivity, optical and polarization behavior, and mechanical strength. Almost all known crystalline materials have internal symmetry. However, this is often inadequately addressed by existing generative models, making the consistent generation of stable and symmetrically valid crystal structures a significant challenge. We introduce WyFormer, a generative model that directly tackles this by formally conditioning on space group symmetry. It achieves this by using Wyckoff positions as the basis for an elegant, compressed, and discrete structure representation. To model the distribution, we develop a permutation-invariant autoregressive model based on the Transformer encoder and an absence of positional encoding. Extensive experimentation demonstrates WyFormer's compelling combination of attributes: it achieves best-in-class symmetry-conditioned generation, incorporates a physics-motivated inductive bias, produces structures with competitive stability, predicts material properties with competitive accuracy even without atomic coordinates, and exhibits unparalleled inference speed.
中文: WyFormer是一种生成模型,通过基于空间群对称性并使用Wyckoff位置进行结构表示,解决了生成稳定且对称有效晶体结构的难题,实现了卓越的对称性遵循、竞争性稳定性和快速推理能力。
English: WyFormer is a generative model that overcomes the challenge of producing stable, symmetry-valid crystal structures by conditioning on space group symmetry and using Wyckoff positions for representation, achieving superior symmetry compliance, competitive stability, and fast inference.
Authors:Yunzhen He, Yusuke Takase, Yoichi Ishibashi, Hidetoshi Shimodaira
Abstract:
Large Language Models (LLMs) are increasingly being used in real-world applications. However, concerns about the reliability of the content they generate persist, as it frequently deviates from factual correctness or exhibits deficiencies in logical reasoning. This paper proposes a novel decoding strategy aimed at enhancing both factual accuracy and inferential reasoning without requiring any modifications to the architecture or pre-trained parameters of LLMs. Our approach adjusts next-token probabilities by analyzing the trajectory of logits from lower to higher layers in Transformers and applying linear regression. We find that this Decoding by Logit Trajectory-based approach (DeLTa) effectively reinforces factuality and reasoning while mitigating incorrect generation. Experiments on TruthfulQA demonstrate that DeLTa attains up to a 4.9% improvement over the baseline. Furthermore, it enhances performance by up to 8.1% on StrategyQA and 7.3% on GSM8K, both of which demand strong reasoning capabilities.
Chinese: 本文提出DeLTa解码策略,通过基于对数轨迹分析调整令牌概率来增强大语言模型的事实准确性和推理能力,在不改变模型架构的情况下,在推理任务上实现了高达8.1%的性能提升。
English: This paper introduces DeLTa, a novel decoding strategy that enhances the factual accuracy and reasoning capabilities of Large Language Models by adjusting token probabilities based on logit trajectory analysis, achieving improvements of up to 8.1% on reasoning tasks without modifying model architecture.
Authors:Xueliang Zhao, Wei Wu, Jian Guan, Lingpeng Kong
Abstract:
The ability of large language models to solve complex mathematical problems has progressed significantly, particularly for tasks requiring advanced reasoning. However, the scarcity of sufficiently challenging problems, particularly at the Olympiad level, hinders further advancements. In this work, we introduce PromptCoT, a novel approach for automatically generating high-quality Olympiad-level math problems. The proposed method synthesizes complex problems based on mathematical concepts and the rationale behind problem construction, emulating the thought processes of experienced problem designers. We provide a theoretical analysis demonstrating that an optimal rationale should maximize both the likelihood of rationale generation given the associated concepts and the likelihood of problem generation conditioned on both the rationale and the concepts. Our method is evaluated on standard benchmarks including GSM8K, MATH-500, and AIME2024, where it consistently outperforms existing problem generation methods. Furthermore, we demonstrate that PromptCoT exhibits superior data scalability, consistently maintaining high performance as the dataset size increases, outperforming the baselines. The implementation is available at https://github.com/zhaoxlpku/PromptCoT.
Chinese: PromptCoT提出了一种模拟专家推理自动生成高质量奥林匹克数学题的新方法,在基准测试中优于现有方法,并展现出随数据增长更优的扩展性。
English: PromptCoT introduces a novel method for automatically generating high-quality Olympiad-level math problems by emulating expert reasoning, which outperforms existing approaches on benchmarks and demonstrates superior scalability with increasing data.
Authors:Yusei Ito, Tatsunori Taniai, Ryo Igarashi, Yoshitaka Ushiku, Kanta Ono
Abstract:
Crystal structure modeling with graph neural networks is essential for various applications in materials informatics, and capturing SE(3)-invariant geometric features is a fundamental requirement for these networks. A straightforward approach is to model with orientation-standardized structures through structure-aligned coordinate systems, or"frames." However, unlike molecules, determining frames for crystal structures is challenging due to their infinite and highly symmetric nature. In particular, existing methods rely on a statically fixed frame for each structure, determined solely by its structural information, regardless of the task under consideration. Here, we rethink the role of frames, questioning whether such simplistic alignment with the structure is sufficient, and propose the concept of dynamic frames. While accommodating the infinite and symmetric nature of crystals, these frames provide each atom with a dynamic view of its local environment, focusing on actively interacting atoms. We demonstrate this concept by utilizing the attention mechanism in a recent transformer-based crystal encoder, resulting in a new architecture called CrystalFramer. Extensive experiments show that CrystalFramer outperforms conventional frames and existing crystal encoders in various crystal property prediction tasks.
Authors:Saeed Ranjbar Alvar, Gursimran Singh, Mohammad Akbari, Yong Zhang
Abstract:
Large Multimodal Models (LMMs) have emerged as powerful models capable of understanding various data modalities, including text, images, and videos. LMMs encode both text and visual data into tokens that are then combined and processed by an integrated Large Language Model (LLM). Including visual tokens substantially increases the total token count, often by thousands. The increased input length for LLM significantly raises the complexity of inference, resulting in high latency in LMMs. To address this issue, token pruning methods, which remove part of the visual tokens, are proposed. The existing token pruning methods either require extensive calibration and fine-tuning or rely on suboptimal importance metrics which results in increased redundancy among the retained tokens. In this paper, we first formulate token pruning as Max-Min Diversity Problem (MMDP) where the goal is to select a subset such that the diversity among the selected {tokens} is maximized. Then, we solve the MMDP to obtain the selected subset and prune the rest. The proposed method, DivPrune, reduces redundancy and achieves the highest diversity of the selected tokens. By ensuring high diversity, the selected tokens better represent the original tokens, enabling effective performance even at high pruning ratios without requiring fine-tuning. Extensive experiments with various LMMs show that DivPrune achieves state-of-the-art accuracy over 16 image- and video-language datasets. Additionally, DivPrune reduces both the end-to-end latency and GPU memory usage for the tested models. The code is available $\href{https://github.com/vbdi/divprune}{\text{here}}$.
中文: 提出的DivPrune方法将令牌剪枝构建为最大最小多样性问题,通过最大化所选视觉令牌的多样性,在无需微调的情况下实现了跨多个数据集的最优精度,同时降低了延迟和内存使用。
English: The proposed DivPrune method formulates token pruning as a Max-Min Diversity Problem to maximize diversity among selected visual tokens, achieving state-of-the-art accuracy across multiple datasets while reducing latency and memory usage without requiring fine-tuning.
Authors:Jiacheng Zhang, Benjamin I. P. Rubinstein, Jingfeng Zhang, Feng Liu
Abstract:
Statistical adversarial data detection (SADD) detects whether an upcoming batch contains adversarial examples (AEs) by measuring the distributional discrepancies between clean examples (CEs) and AEs. In this paper, we explore the strength of SADD-based methods by theoretically showing that minimizing distributional discrepancy can help reduce the expected loss on AEs. Despite these advantages, SADD-based methods have a potential limitation: they discard inputs that are detected as AEs, leading to the loss of useful information within those inputs. To address this limitation, we propose a two-pronged adversarial defense method, named Distributional-discrepancy-based Adversarial Defense (DAD). In the training phase, DAD first optimizes the test power of the maximum mean discrepancy (MMD) to derive MMD-OPT, which is a stone that kills two birds. MMD-OPT first serves as a guiding signal to minimize the distributional discrepancy between CEs and AEs to train a denoiser. Then, it serves as a discriminator to differentiate CEs and AEs during inference. Overall, in the inference stage, DAD consists of a two-pronged process: (1) directly feeding the detected CEs into the classifier, and (2) removing noise from the detected AEs by the distributional-discrepancy-based denoiser. Extensive experiments show that DAD outperforms current state-of-the-art (SOTA) defense methods by simultaneously improving clean and robust accuracy on CIFAR-10 and ImageNet-1K against adaptive white-box attacks. Codes are publicly available at: https://github.com/tmlr-group/DAD.
中文: 提出的基于分布差异的对抗防御方法(DAD)通过训练去噪器最小化干净样本与对抗样本的分布差异,在推理阶段对检测出的对抗样本进行净化处理,在CIFAR-10和ImageNet-1K数据集上实现了超越现有方法的清洁准确率与鲁棒准确率。
English: The proposed Distributional-discrepancy-based Adversarial Defense (DAD) method enhances security by training a denoiser to minimize distributional gaps between clean and adversarial examples, then using it to purify detected adversarial inputs during inference, achieving superior clean and robust accuracy on benchmark datasets.
Authors:Chia-Wei Hsu, Nien-Ti Tsou, Yu-Cheng Chen, Yang Jeong Park, Ju Li
Abstract:
Gradient-based optimization drives the unprecedented performance of modern deep neural network models across diverse applications. Adaptive algorithms have accelerated neural network training due to their rapid convergence rates; however, they struggle to find ``flat minima" reliably, resulting in suboptimal generalization compared to stochastic gradient descent (SGD). By revisiting various adaptive algorithms' mechanisms, we propose the Frankenstein optimizer, which combines their advantages. The proposed Frankenstein dynamically adjusts first- and second-momentum coefficients according to the optimizer's current state to directly maintain consistent learning dynamics and immediately reflect sudden gradient changes. Extensive experiments across several research domains such as computer vision, natural language processing, few-shot learning, and scientific simulations show that Frankenstein surpasses existing adaptive algorithms and SGD empirically regarding convergence speed and generalization performance. Furthermore, this research deepens our understanding of adaptive algorithms through centered kernel alignment analysis and loss landscape visualization during the learning process. Code is available at https://github.com/acctouhou/Frankenstein_optimizer
中文:弗兰肯斯坦优化器通过动态调整动量系数来保持稳定的学习动态并快速响应梯度变化,在多个领域中其收敛速度和泛化能力均超越了现有自适应算法及随机梯度下降法。
English: The Frankenstein optimizer dynamically adjusts momentum coefficients to maintain consistent learning dynamics and rapidly respond to gradient changes, outperforming both adaptive algorithms and SGD in convergence speed and generalization across multiple domains.
Authors:Zhixuan Lin, Evgenii Nikishin, Xu Owen He, Aaron Courville
Abstract:
An essential component of modern recurrent sequence models is the forget gate. While Transformers do not have an explicit recurrent form, we show that a forget gate can be naturally incorporated into Transformers by down-weighting the unnormalized attention scores in a data-dependent way. We name this attention mechanism Forgetting Attention and the resulting model the Forgetting Transformer (FoX). We show that FoX outperforms the Transformer on long-context language modeling, length extrapolation, and short-context downstream tasks, while performing on par with the Transformer on long-context downstream tasks. Moreover, it is compatible with the FlashAttention algorithm and does not require any positional embeddings. Several analyses, including the needle-in-the-haystack test, show that FoX also retains the Transformer's superior long-context capabilities over recurrent sequence models such as Mamba-2, HGRN2, and DeltaNet. We also introduce a "Pro" block design that incorporates some common architectural components in recurrent sequence models and find it significantly improves the performance of both FoX and the Transformer. Our code is available at https://github.com/zhixuan-lin/forgetting-transformer.
Chinese: 遗忘变换器(FoX)通过数据依赖性地降低未归一化注意力分数,将遗忘门自然融入变换器中,在长上下文语言建模和短上下文任务中表现卓越,同时与FlashAttention算法兼容且无需位置嵌入。
English: The Forgetting Transformer (FoX) incorporates a forget gate into Transformers by adaptively down-weighting attention scores, achieving superior performance in long-context language modeling and short-context tasks while maintaining compatibility with FlashAttention and eliminating the need for positional embeddings.
Authors:Ruth Crasto
Abstract:
Geographic distribution shift arises when the distribution of locations on Earth in a training dataset is different from what is seen at test time. The most common approaches to tackling geographic distribution shift treat regions delimited by administrative boundaries such as countries or continents as separate domains and apply standard domain adaptation methods, ignoring geographic coordinates that are often available as metadata. This paper proposes the use of location encoders for modeling continuous, learnable domain assignment. We show how both non-parametric sine-cosine encoders and pre-trained location encoders can be used in conjunction with standard domain adaptation methods for improved robustness to geographic distribution shift. Our proposed methods achieve new state-of-the-art results on two geo-tagged remote sensing datasets from the WILDS benchmark. We have made our code publicly available at: https://github.com/crastoru/wilds-geoshift.
中文摘要:本文提出使用位置编码器来建模连续可学习的域分配,通过将其与标准域适应方法结合,在两个地理标记遥感数据集上实现了最先进的性能,有效应对地理分布偏移问题。
English Summary: This paper introduces location encoders to model continuous domain assignments for addressing geographic distribution shift, achieving state-of-the-art results on geo-tagged datasets by integrating them with standard domain adaptation methods.
Authors:Mingjie Wen, Jiahe Han, Wenjuan Li, Xiaoya Chang, Qingzhao Chu, Dongping Chen
Abstract:
The discovery and optimization of high-energy materials (HEMs) are constrained by the prohibitive computational expense and prolonged development cycles inherent in conventional approaches. In this work, we develop a general neural network potential (NNP) that efficiently predicts the structural, mechanical, and decomposition properties of HEMs composed of C, H, N, and O. Our framework leverages pre-trained NNP models, fine-tuned using transfer learning on energy and force data derived from density functional theory (DFT) calculations. This strategy enables rapid adaptation across 20 different HEM systems while maintaining DFT-level accuracy, significantly reducing computational costs. A key aspect of this work is the ability of NNP model to capture the chemical activity space of HEMs, accurately describe the key atomic interactions and reaction mechanisms during thermal decomposition. The general NNP model has been applied in molecular dynamics (MD) simulations and validated with experimental data for various HEM structures. Results show that the NNP model accurately predicts the structural, mechanical, and decomposition properties of HEMs by effectively describing their chemical activity space. Compared to traditional force fields, it offers superior DFT-level accuracy and generalization across both microscopic and macroscopic properties, reducing the computational and experimental costs. This work provides an efficient strategy for the design and development of HEMs and proposes a promising framework for integrating DFT, machine learning, and experimental methods in materials research. (To facilitate further research and practical applications, we open-source our NNP model on GitHub: https://github.com/MingjieWen/General-NNP-model-for-C-H-N-O-Energetic-Materials.)
Chinese: 本研究开发了一种通用神经网络势,能够准确预测高能材料的结构、力学和分解特性,在保持密度泛函理论精度的同时显著降低了计算成本。
English: This study develops a general neural network potential that accurately predicts the structural, mechanical, and decomposition properties of high-energy materials, significantly reducing computational costs while maintaining DFT-level accuracy.
Authors:Wang YuHang, Junkang Guo, Aolei Liu, Kaihao Wang, Zaitong Wu, Zhenyu Liu, Wenfei Yin, Jian Liu
Abstract:
Adversarial robustness is a critical challenge in deploying deep neural networks for real-world applications. While adversarial training is a widely recognized defense strategy, most existing studies focus on balanced datasets, overlooking the prevalence of long-tailed distributions in real-world data, which significantly complicates robustness. This paper provides a comprehensive analysis of adversarial training under long-tailed distributions and identifies limitations in the current state-of-the-art method, AT-BSL, in achieving robust performance under such conditions. To address these challenges, we propose a novel training framework, TAET, which integrates an initial stabilization phase followed by a stratified equalization adversarial training phase. Additionally, prior work on long-tailed robustness has largely ignored the crucial evaluation metric of balanced accuracy. To bridge this gap, we introduce the concept of balanced robustness, a comprehensive metric tailored for assessing robustness under long-tailed distributions. Extensive experiments demonstrate that our method surpasses existing advanced defenses, achieving significant improvements in both memory and computational efficiency. This work represents a substantial advancement in addressing robustness challenges in real-world applications. Our code is available at: https://github.com/BuhuiOK/TAET-Two-Stage-Adversarial-Equalization-Training-on-Long-Tailed-Distributions.
中文: 本文提出TAET这一新型两阶段对抗训练框架,通过稳定化与分层均衡化训练显著提升长尾数据分布下的模型鲁棒性与效率,并创新性地提出平衡鲁棒性指标以更准确评估现实场景中的防御性能。
English: This paper introduces TAET, a novel two-stage adversarial training framework that enhances robustness and efficiency for deep neural networks on long-tailed data distributions, while proposing the balanced robustness metric to better evaluate performance under such real-world conditions.
Authors:Lily Xu, Bryan Wilder, Elias B. Khalil, Milind Tambe
Abstract:
Reinforcement learning (RL) has increasingly been applied to solve real-world planning problems, with progress in handling large state spaces and time horizons. However, a key bottleneck in many domains is that RL methods cannot accommodate large, combinatorially structured action spaces. In such settings, even representing the set of feasible actions at a single step may require a complex discrete optimization formulation. We leverage recent advances in embedding trained neural networks into optimization problems to propose SEQUOIA, an RL algorithm that directly optimizes for long-term reward over the feasible action space. Our approach embeds a Q-network into a mixed-integer program to select a combinatorial action in each timestep. Here, we focus on planning over restless bandits, a class of planning problems which capture many real-world examples of sequential decision making. We introduce coRMAB, a broader class of restless bandits with combinatorial actions that cannot be decoupled across the arms of the restless bandit, requiring direct solving over the joint, exponentially large action space. We empirically validate SEQUOIA on four novel restless bandit problems with combinatorial constraints: multiple interventions, path constraints, bipartite matching, and capacity constraints. Our approach significantly outperforms existing methods -- which cannot address sequential planning and combinatorial selection simultaneously -- by an average of 24.8\% on these difficult instances.
中文摘要:强化学习在处理大规模组合动作空间时存在瓶颈,而SEQUOIA算法通过将Q网络嵌入混合整数规划直接优化长期奖励,在组合约束的 restless bandit 问题上比现有方法平均提升24.8%的性能。
English Summary: Reinforcement learning faces challenges with large combinatorial action spaces, but SEQUOIA overcomes this by embedding Q-networks into optimization programs to directly maximize long-term rewards, achieving 24.8% better performance on complex restless bandit problems.
Authors:Michal Spiegel, Michal Å tefánik, Marek KadlÄÃk, Josef KuchaÅ
Abstract:
Can transformers learn to perform algorithmic tasks reliably across previously unseen input/output domains? While pre-trained language models show solid accuracy on benchmarks incorporating algorithmic reasoning, assessing the reliability of these results necessitates an ability to distinguish genuine algorithmic understanding from memorization. In this paper, we propose AttentionSpan, an algorithmic benchmark comprising five tasks of infinite input domains where we can disentangle and trace the correct, robust algorithm necessary for the task. This allows us to assess (i) models' ability to extrapolate to unseen types of inputs, including new lengths, value ranges or input domains, but also (ii)to assess the robustness of their learned mechanisms. By analyzing attention maps and performing targeted interventions, we show that attention mechanism directly causes failures in extrapolation. We make the implementation of all our tasks and interpretability methods publicly available at https://github.com/michalspiegel/AttentionSpan .
Chinese: 本研究提出AttentionSpan基准,通过测试模型对未见输入的泛化能力及机制稳健性,评估Transformer是否真正学习算法而非仅记忆数据,揭示注意力机制是导致泛化失败的直接原因。
English: The study introduces AttentionSpan, a benchmark to evaluate if transformers genuinely learn algorithms or merely memorize data by testing their ability to extrapolate to unseen inputs and assessing the robustness of their mechanisms, revealing that attention causes extrapolation failures.
Authors:Jiawei Zhang, Shuang Yang, Bo Li
Abstract:
Large Language Model (LLM) agents equipped with external tools have become increasingly powerful for complex tasks such as web shopping, automated email replies, and financial trading. However, these advancements amplify the risks of adversarial attacks, especially when agents can access sensitive external functionalities. Nevertheless, manipulating LLM agents into performing targeted malicious actions or invoking specific tools remains challenging, as these agents extensively reason or plan before executing final actions. In this work, we present UDora, a unified red teaming framework designed for LLM agents that dynamically hijacks the agent's reasoning processes to compel malicious behavior. Specifically, UDora first generates the model's reasoning trace for the given task, then automatically identifies optimal points within this trace to insert targeted perturbations. The resulting perturbed reasoning is then used as a surrogate response for optimization. By iteratively applying this process, the LLM agent will then be induced to undertake designated malicious actions or to invoke specific malicious tools. Our approach demonstrates superior effectiveness compared to existing methods across three LLM agent datasets. The code is available at https://github.com/AI-secure/UDora.
中文: UDora是一个统一红队测试框架,通过针对性扰动劫持大语言模型智能体的推理过程,迫使其执行恶意操作,在多个数据集上表现优于现有方法。
English: UDora is a unified red teaming framework that hijacks LLM agents' reasoning processes through targeted perturbations to induce malicious actions, outperforming existing methods across multiple datasets.
Authors:Sunghyeon Woo, Sol Namkung, Sunwoo Lee, Inho Jeong, Beomseok Kim, Dongsuk Jeon
Abstract:
Prior parameter-efficient fine-tuning (PEFT) algorithms reduce memory usage and computational costs of fine-tuning large neural network models by training only a few additional adapter parameters, rather than the entire model. However, the reduction in computational costs due to PEFT does not necessarily translate to a reduction in training time; although the computational costs of the adapter layers are much smaller than the pretrained layers, it is well known that those two types of layers are processed sequentially on GPUs, resulting in significant latency overhead. LoRA and its variants merge low-rank adapter matrices with pretrained weights during inference to avoid latency overhead, but during training, the pretrained weights remain frozen while the adapter matrices are continuously updated, preventing such merging. To mitigate this issue, we propose Partial Connection Adaptation (PaCA), which fine-tunes randomly selected partial connections within the pretrained weights instead of introducing adapter layers in the model. PaCA not only enhances training speed by eliminating the time overhead due to the sequential processing of the adapter and pretrained layers but also reduces activation memory since only partial activations, rather than full activations, need to be stored for gradient computation. Compared to LoRA, PaCA reduces training time by 22% and total memory usage by 16%, while maintaining comparable accuracy across various fine-tuning scenarios, such as fine-tuning on the MMLU dataset and instruction tuning on the Oasst1 dataset. PaCA can also be combined with quantization, enabling the fine-tuning of large models such as LLaMA3.1-70B. In addition, PaCA enables training with 23% longer sequence and improves throughput by 16% on both NVIDIA A100 GPU and INTEL Gaudi2 HPU compared to LoRA. The code is available at https://github.com/WooSunghyeon/paca.
中文: PaCA通过微调预训练权重中的部分连接,消除了适配器带来的延迟和内存开销,在保持精度的同时显著提升了训练效率。
English: PaCA enhances training efficiency by fine-tuning partial connections within pretrained weights, eliminating adapter-related latency and memory overhead while maintaining accuracy.
Authors:Haoxin Liu, Zhiyuan Zhao, Shiduo Li, B. Aditya Prakash
Abstract:
Reasoning ability is crucial for solving challenging tasks. With the advancement of foundation models, such as the emergence of large language models (LLMs), a wide range of reasoning strategies has been proposed, including test-time enhancements, such as Chain-ofThought, and post-training optimizations, as used in DeepSeek-R1. While these reasoning strategies have demonstrated effectiveness across various challenging language or vision tasks, their applicability and impact on time-series forecasting (TSF), particularly the challenging zero-shot TSF, remain largely unexplored. In particular, it is unclear whether zero-shot TSF benefits from reasoning and, if so, what types of reasoning strategies are most effective. To bridge this gap, we propose ReC4TS, the first benchmark that systematically evaluates the effectiveness of popular reasoning strategies when applied to zero-shot TSF tasks. ReC4TS conducts comprehensive evaluations across datasets spanning eight domains, covering both unimodal and multimodal with short-term and longterm forecasting tasks. More importantly, ReC4TS provides key insights: (1) Self-consistency emerges as the most effective test-time reasoning strategy; (2) Group-relative policy optimization emerges as a more suitable approach for incentivizing reasoning ability during post-training; (3) Multimodal TSF benefits more from reasoning strategies compared to unimodal TSF. Beyond these insights, ReC4TS establishes two pioneering starting blocks to support future zero-shot TSF reasoning research: (1) A novel dataset, TimeThinking, containing forecasting samples annotated with reasoning trajectories from multiple advanced LLMs, and (2) A new and simple test-time scaling-law validated on foundational TSF models enabled by self-consistency reasoning strategy. All data and code are publicly accessible at: https://github.com/AdityaLab/OpenTimeR
中文摘要:ReC4TS基准首次系统评估了零样本时间序列预测中的推理策略,发现自我一致性是最有效的测试时方法,并为该领域研究提供了开创性数据集与测试框架。
English Summary: The ReC4TS benchmark evaluates reasoning strategies for zero-shot time-series forecasting, revealing self-consistency as the most effective test-time method and providing key resources for future research.
Authors:Luise Ge, Michael Lanier, Anindya Sarkar, Bengisu Guresti, Chongjie Zhang, Yevgeniy Vorobeychik
Abstract:
Many dynamic decision problems, such as robotic control, involve a series of tasks, many of which are unknown at training time. Typical approaches for these problems, such as multi-task and meta reinforcement learning, do not generalize well when the tasks are diverse. On the other hand, approaches that aim to tackle task diversity, such as using task embedding as policy context and task clustering, typically lack performance guarantees and require a large number of training tasks. To address these challenges, we propose a novel approach for learning a policy committee that includes at least one near-optimal policy with high probability for tasks encountered during execution. While we show that this problem is in general inapproximable, we present two practical algorithmic solutions. The first yields provable approximation and task sample complexity guarantees when tasks are low-dimensional (the best we can do due to inapproximability), whereas the second is a general and practical gradient-based approach. In addition, we provide a provable sample complexity bound for few-shot learning. Our experiments on MuJoCo and Meta-World show that the proposed approach outperforms state-of-the-art multi-task, meta-, and task clustering baselines in training, generalization, and few-shot learning, often by a large margin. Our code is available at https://github.com/CERL-WUSTL/PACMAN.
中文摘要:该研究提出了一种学习策略委员会的新方法,确保在执行任务时以高概率获得近似最优性能,既为低维任务提供理论保证又提供实用的梯度解决方案,实验表明其性能显著优于现有方法。
English Summary: The proposed approach learns a policy committee to ensure near-optimal performance with high probability for diverse tasks, offering both theoretical guarantees for low-dimensional tasks and a practical gradient-based solution, outperforming existing methods in experiments.
Authors:Dayal Singh Kalra, John Kirchenbauer, Maissam Barkeshli, Tom Goldstein
Abstract:
Adam is the go-to optimizer for training modern machine learning models, but it requires additional memory to maintain the moving averages of the gradients and their squares. While various low-memory optimizers have been proposed that sometimes match the performance of Adam, their lack of reliability has left Adam as the default choice. In this work, we apply a simple layer-wise Signal-to-Noise Ratio (SNR) analysis to quantify when second-moment tensors can be effectively replaced by their means across different dimensions. Our SNR analysis reveals how architecture, training hyperparameters, and dataset properties impact compressibility along Adam's trajectory, naturally leading to $\textit{SlimAdam}$, a memory-efficient Adam variant. $\textit{SlimAdam}$ compresses the second moments along dimensions with high SNR when feasible, and leaves when compression would be detrimental. Through experiments across a diverse set of architectures and training scenarios, we show that $\textit{SlimAdam}$ matches Adam's performance and stability while saving up to $98\%$ of total second moments. Code for $\textit{SlimAdam}$ is available at https://github.com/dayal-kalra/low-memory-adam.
中文:SlimAdam是一种基于逐层信噪比分析选择性压缩二阶矩的内存优化版Adam优化器,在保持与Adam同等性能的同时,最高可减少98%的内存占用。
English: SlimAdam is a memory-efficient variant of Adam that selectively compresses second moments based on layer-wise SNR analysis, matching Adam's performance while reducing memory usage by up to 98%.
Authors:Adrià López Escoriza, Nicklas Hansen, Stone Tao, Tongzhou Mu, Hao Su
Abstract:
Long-horizon tasks in robotic manipulation present significant challenges in reinforcement learning (RL) due to the difficulty of designing dense reward functions and effectively exploring the expansive state-action space. However, despite a lack of dense rewards, these tasks often have a multi-stage structure, which can be leveraged to decompose the overall objective into manageable subgoals. In this work, we propose DEMO3, a framework that exploits this structure for efficient learning from visual inputs. Specifically, our approach incorporates multi-stage dense reward learning, a bi-phasic training scheme, and world model learning into a carefully designed demonstration-augmented RL framework that strongly mitigates the challenge of exploration in long-horizon tasks. Our evaluations demonstrate that our method improves data-efficiency by an average of 40% and by 70% on particularly difficult tasks compared to state-of-the-art approaches. We validate this across 16 sparse-reward tasks spanning four domains, including challenging humanoid visual control tasks using as few as five demonstrations.
中文:DEMO3框架通过将长时程机器人操作任务分解为子目标,并融合多阶段密集奖励学习、双阶段训练方案与世界模型学习,在稀疏奖励环境中实现了高达70%的数据效率提升。
English: The DEMO3 framework addresses the challenges of long-horizon robotic manipulation tasks by decomposing them into subgoals and integrating multi-stage dense reward learning, bi-phasic training, and world model learning, achieving up to 70% higher data efficiency in sparse-reward environments.
Authors:Yi-Lin Sung, Prateek Yadav, Jialu Li, Jaehong Yoon, Mohit Bansal
Abstract:
Layer-wise quantization is a key technique for efficiently compressing large models without expensive retraining. Previous methods typically quantize the weights of each layer by "uniformly" optimizing the layer reconstruction loss across all output tokens. However, in this paper, we demonstrate that better-quantized models can be obtained by prioritizing learning from important tokens (e.g. which have large attention scores). Building on this finding, we propose RSQ (Rotate, Scale, then Quantize), which (1) applies rotations (orthogonal transformation) to the model to mitigate outliers (those with exceptionally large magnitude), (2) scales the token feature based on its importance, and (3) quantizes the model using the GPTQ framework with the second-order statistics computed by scaled tokens. To compute token importance, we explore both heuristic and dynamic strategies. Based on a thorough analysis of all approaches, we adopt attention concentration, which uses attention scores of each token as its importance, as the best approach. We demonstrate that RSQ consistently outperforms baseline methods across multiple downstream tasks and three model families: LLaMA3, Mistral, and Qwen2.5. Additionally, models quantized with RSQ achieve superior performance on long-context tasks, further highlighting its effectiveness. Lastly, RSQ demonstrates generalizability across various setups, including different model sizes, calibration datasets, bit precisions, and quantization methods.
中文: RSQ通过旋转和缩放优先处理重要标记来优化模型量化,在多项任务和模型中表现优于基线方法,具备卓越的长上下文处理能力和广泛的适用性。
English: RSQ enhances model quantization by prioritizing important tokens through rotation and scaling, outperforming baselines across multiple tasks and models with superior long-context performance and broad generalizability.
Authors:Nicholas Carlini, Javier Rando, Edoardo Debenedetti, Milad Nasr, Florian Tramèr
Abstract:
We introduce AutoAdvExBench, a benchmark to evaluate if large language models (LLMs) can autonomously exploit defenses to adversarial examples. Unlike existing security benchmarks that often serve as proxies for real-world tasks, bench directly measures LLMs' success on tasks regularly performed by machine learning security experts. This approach offers a significant advantage: if a LLM could solve the challenges presented in bench, it would immediately present practical utility for adversarial machine learning researchers. We then design a strong agent that is capable of breaking 75% of CTF-like ("homework exercise") adversarial example defenses. However, we show that this agent is only able to succeed on 13% of the real-world defenses in our benchmark, indicating the large gap between difficulty in attacking "real" code, and CTF-like code. In contrast, a stronger LLM that can attack 21% of real defenses only succeeds on 54% of CTF-like defenses. We make this benchmark available at https://github.com/ethz-spylab/AutoAdvExBench.
中文:AutoAdvExBench是一个旨在评估大型语言模型能否自主利用对抗性示例防御的基准,揭示了攻击简化CTF类挑战与现实世界防御之间的显著性能差距。
English: AutoAdvExBench is a benchmark designed to assess whether large language models can autonomously exploit adversarial example defenses, revealing a significant performance gap between attacking simplified CTF-like challenges and real-world defenses.
Authors:Tiansheng Wen, Yifei Wang, Zequn Zeng, Zhong Peng, Yudi Su, Xinyang Liu, Bo Chen, Hongwei Liu, Stefanie Jegelka, Chenyu You
Abstract:
Many large-scale systems rely on high-quality deep representations (embeddings) to facilitate tasks like retrieval, search, and generative modeling. Matryoshka Representation Learning (MRL) recently emerged as a solution for adaptive embedding lengths, but it requires full model retraining and suffers from noticeable performance degradations at short lengths. In this paper, we show that sparse coding offers a compelling alternative for achieving adaptive representation with minimal overhead and higher fidelity. We propose Contrastive Sparse Representation (CSR), a method that sparsifies pre-trained embeddings into a high-dimensional but selectively activated feature space. By leveraging lightweight autoencoding and task-aware contrastive objectives, CSR preserves semantic quality while allowing flexible, cost-effective inference at different sparsity levels. Extensive experiments on image, text, and multimodal benchmarks demonstrate that CSR consistently outperforms MRL in terms of both accuracy and retrieval speed-often by large margins-while also cutting training time to a fraction of that required by MRL. Our results establish sparse coding as a powerful paradigm for adaptive representation learning in real-world applications where efficiency and fidelity are both paramount. Code is available at https://github.com/neilwen987/CSR_Adaptive_Rep
Chinese: 对比稀疏表示(CSR)提出了一种稀疏编码方法,将预训练嵌入转换为高维稀疏特征,相比嵌套表示学习,能以更高精度、更快检索速度和大幅缩短的训练时间实现自适应表示。
English: Contrastive Sparse Representation (CSR) introduces a sparse coding method that transforms pre-trained embeddings into high-dimensional sparse features, enabling adaptive representation with superior accuracy, faster retrieval, and significantly reduced training time compared to Matryoshka Representation Learning.
Authors:Sam Bowyer, Laurence Aitchison, Desi R. Ivanova
Abstract:
Rigorous statistical evaluations of large language models (LLMs), including valid error bars and significance testing, are essential for meaningful and reliable performance assessment. Currently, when such statistical measures are reported, they typically rely on the Central Limit Theorem (CLT). In this position paper, we argue that while CLT-based methods for uncertainty quantification are appropriate when benchmarks consist of thousands of examples, they fail to provide adequate uncertainty estimates for LLM evaluations that rely on smaller, highly specialized benchmarks. In these small-data settings, we demonstrate that CLT-based methods perform very poorly, usually dramatically underestimating uncertainty (i.e. producing error bars that are too small). We give recommendations for alternative frequentist and Bayesian methods that are both easy to implement and more appropriate in these increasingly common scenarios. We provide a simple Python library for these Bayesian methods at https://github.com/sambowyer/bayes_evals .
中文: 对大语言模型的严谨统计评估需要精确的不确定性估计,但当前基于中心极限定理的方法在小数据场景下常低估不确定性,因此建议采用替代的频率学派和贝叶斯方法。
English: Rigorous statistical evaluations of large language models require accurate uncertainty estimates, but current methods relying on the Central Limit Theorem often underestimate uncertainty in small-data scenarios, prompting the recommendation of alternative frequentist and Bayesian approaches.
Authors:Ryien Hosseini, Filippo Simini, Venkatram Vishwanath, Rebecca Willett, Henry Hoffmann
Abstract:
Deep generative models have recently achieved significant success in modeling graph data, including dynamic graphs, where topology and features evolve over time. However, unlike in vision and natural language domains, evaluating generative models for dynamic graphs is challenging due to the difficulty of visualizing their output, making quantitative metrics essential. In this work, we develop a new quality metric for evaluating generative models of dynamic graphs. Current metrics for dynamic graphs typically involve discretizing the continuous-evolution of graphs into static snapshots and then applying conventional graph similarity measures. This approach has several limitations: (a) it models temporally related events as i.i.d. samples, failing to capture the non-uniform evolution of dynamic graphs; (b) it lacks a unified measure that is sensitive to both features and topology; (c) it fails to provide a scalar metric, requiring multiple metrics without clear superiority; and (d) it requires explicitly instantiating each static snapshot, leading to impractical runtime demands that hinder evaluation at scale. We propose a novel metric based on the \textit{Johnson-Lindenstrauss} lemma, applying random projections directly to dynamic graph data. This results in an expressive, scalar, and application-agnostic measure of dynamic graph similarity that overcomes the limitations of traditional methods. We also provide a comprehensive empirical evaluation of metrics for continuous-time dynamic graphs, demonstrating the effectiveness of our approach compared to existing methods. Our implementation is available at https://github.com/ryienh/jl-metric.
中文摘要:本文提出了一种基于随机投影的新型质量评估指标,用于评估动态图的生成模型,通过提供统一、标量的度量方法克服了传统方法的局限性,无需离散化即可同时捕捉拓扑和特征的动态演化。
English Summary: This paper introduces a novel quality metric based on random projections to evaluate generative models for dynamic graphs, addressing limitations of current methods by providing a unified, scalar measure sensitive to both topology and features without requiring discretization.
Authors:Youngbin Choi, Seunghyuk Cho, Minjong Lee, MoonJeong Park, Yesong Ko, Jungseul Ok, Dongwoo Kim
Abstract:
Personalizing large language models (LLMs) is important for aligning outputs with diverse user preferences, yet existing methods struggle with flexibility and generalization. We propose CoPL (Collaborative Preference Learning), a graph-based collaborative filtering framework that models user-response relationships to enhance preference estimation, particularly in sparse annotation settings. By integrating a mixture of LoRA experts, CoPL efficiently fine-tunes LLMs while dynamically balancing shared and user-specific preferences. Additionally, an optimization-free adaptation strategy enables generalization to unseen users without fine-tuning. Experiments on UltraFeedback-P demonstrate that CoPL outperforms existing personalized reward models, effectively capturing both common and controversial preferences, making it a scalable solution for personalized LLM alignment. The code is available at https://github.com/ml-postech/CoPL.
中文: CoPL提出了一种基于图的协同过滤框架,通过建模用户-响应关系并利用LoRA专家动态平衡共享与个体偏好,在稀疏标注场景下显著提升了大型语言模型的个性化性能。
English: CoPL introduces a graph-based collaborative filtering framework that enhances LLM personalization by modeling user-response relationships and dynamically balancing shared and individual preferences through LoRA experts, achieving superior performance in sparse data scenarios.
Authors:Yingxue Xu, Fengtao Zhou, Chenyu Zhao, Yihui Wang, Can Yang, Hao Chen
Abstract:
The integration of multimodal data including pathology images and gene profiles is widely applied in precise survival prediction. Despite recent advances in multimodal survival models, collecting complete modalities for multimodal fusion still poses a significant challenge, hindering their application in clinical settings. Current approaches tackling incomplete modalities often fall short, as they typically compensate for only a limited part of the knowledge of missing modalities. To address this issue, we propose a Distilled Prompt Learning framework (DisPro) to utilize the strong robustness of Large Language Models (LLMs) to missing modalities, which employs two-stage prompting for compensation of comprehensive information for missing modalities. In the first stage, Unimodal Prompting (UniPro) distills the knowledge distribution of each modality, preparing for supplementing modality-specific knowledge of the missing modality in the subsequent stage. In the second stage, Multimodal Prompting (MultiPro) leverages available modalities as prompts for LLMs to infer the missing modality, which provides modality-common information. Simultaneously, the unimodal knowledge acquired in the first stage is injected into multimodal inference to compensate for the modality-specific knowledge of the missing modality. Extensive experiments covering various missing scenarios demonstrated the superiority of the proposed method. The code is available at https://github.com/Innse/DisPro.
中文摘要:提出的蒸馏提示学习框架(DisPro)通过两阶段提示方法,利用大型语言模型的强大鲁棒性,有效补偿缺失模态的特定信息和通用信息,解决了生存预测中多模态数据不完整的难题。
English Summary: The proposed Distilled Prompt Learning framework (DisPro) effectively addresses incomplete multimodal data in survival prediction by leveraging Large Language Models through two-stage prompting to compensate for both modality-specific and modality-common information of missing modalities.
Authors:Zhanghao Hu, Hanqi Yan, Qinglin Zhu, Zhenyi Shen, Yulan He, Lin Gui
Abstract:
Large language models have recently pushed open domain question answering (ODQA) to new frontiers. However, prevailing retriever-reader pipelines often depend on multiple rounds of prompt level instructions, leading to high computational overhead, instability, and suboptimal retrieval coverage. In this paper, we propose EmbQA, an embedding-level framework that alleviates these shortcomings by enhancing both the retriever and the reader. Specifically, we refine query representations via lightweight linear layers under an unsupervised contrastive learning objective, thereby reordering retrieved passages to highlight those most likely to contain correct answers. Additionally, we introduce an exploratory embedding that broadens the model's latent semantic space to diversify candidate generation and employs an entropy-based selection mechanism to choose the most confident answer automatically. Extensive experiments across three open-source LLMs, three retrieval methods, and four ODQA benchmarks demonstrate that EmbQA substantially outperforms recent baselines in both accuracy and efficiency.
中文:EmbQA框架通过优化查询表示和扩展候选答案多样性,显著提升了开放域问答的准确性和效率,超越了现有方法。
English: The EmbQA framework enhances open domain question answering by improving query representations and diversifying candidate generation, significantly boosting both accuracy and efficiency over existing methods.
Authors:Ramkrishna Acharya
Abstract:
This study reviews popular stochastic gradient-based schemes based on large least-square problems. These schemes, often called optimizers in machine learning, play a crucial role in finding better model parameters. Hence, this study focuses on viewing such optimizers with different hyper-parameters and analyzing them based on least square problems. Codes that produced results in this work are available on https://github.com/q-viper/gradients-based-methods-on-large-least-square.
中文: 本研究针对大规模最小二乘问题,分析了不同超参数下的随机梯度优化器,并在GitHub上提供了相关代码。
English: This study analyzes various stochastic gradient-based optimizers with different hyperparameters for large least-square problems, providing code for the results on GitHub.
Authors:Ramkrishna Acharya
Abstract:
In this paper, we propose a novel approach for air drawing that uses image processing techniques to draw on the screen by moving fingers in the air. This approach benefits a wide range of applications such as sign language, in-air drawing, and 'writing' in the air as a new way of input. The approach starts with preparing ROI (Region of Interest) background images by taking a running average in initial camera frames and later subtracting it from the live camera frames to get a binary mask image. We calculate the pointer's position as the top of the contour on the binary image. When drawing a circle on the canvas in that position, it simulates the drawing. Furthermore, we combine the pre-trained Tesseract model for OCR purposes. To address the false contours, we perform hand detection based on the haar cascade before performing the background subtraction. In an experimental setup, we achieved a latency of only 100ms in air drawing. The code used to this research are available in GitHub as https://github.com/q-viper/Contour-Based-Writing
中文: 本文提出一种基于图像处理和背景减除的空中绘图方法,通过追踪手指运动实现手语识别和空中书写等功能,实验延迟仅100毫秒且代码已开源。
English: This paper introduces an air drawing method using image processing and background subtraction to track finger movements for applications like sign language and in-air writing, achieving 100ms latency with code available on GitHub.
Authors:Disen Lan, Weigao Sun, Jiaxi Hu, Jusen Du, Yu Cheng
Abstract:
Transformers with linear recurrent modeling offer linear-time training and constant-memory inference. Despite their demonstrated efficiency and performance, pretraining such non-standard architectures from scratch remains costly and risky. The linearization of large language models (LLMs) transforms pretrained standard models into linear recurrent structures, enabling more efficient deployment. However, current linearization methods typically introduce additional feature map modules that require extensive fine-tuning and overlook the gating mechanisms used in state-of-the-art linear recurrent models. To address these issues, this paper presents Liger, short for Linearizing LLMs to gated recurrent structures. Liger is a novel approach for converting pretrained LLMs into gated linear recurrent models without adding extra parameters. It repurposes the pretrained key matrix weights to construct diverse gating mechanisms, facilitating the formation of various gated recurrent structures while avoiding the need to train additional components from scratch. Using lightweight fine-tuning with Low-Rank Adaptation (LoRA), Liger restores the performance of the linearized gated recurrent models to match that of the original LLMs. Additionally, we introduce Liger Attention, an intra-layer hybrid attention mechanism, which significantly recovers 93\% of the Transformer-based LLM at 0.02\% pre-training tokens during the linearization process, achieving competitive results across multiple benchmarks, as validated on models ranging from 1B to 8B parameters. Code is available at https://github.com/OpenSparseLLMs/Linearization.
中文摘要:Liger是一种创新方法,可将预训练大语言模型转化为无需添加参数的线性门控循环结构,通过轻量微调保持模型性能,并在多项基准测试中取得优异表现。
English Summary: Liger is a novel method that converts pretrained LLMs into efficient gated linear recurrent models without adding parameters, using lightweight fine-tuning to maintain performance while enabling competitive results across benchmarks.
Authors:Huifeng Yin, Yu Zhao, Minghao Wu, Xuanfan Ni, Bo Zeng, Hao Wang, Tianqi Shi, Liangying Shao, Chenyang Lyu, Longyue Wang, Weihua Luo, Kaifu Zhang
Abstract:
Large Reasoning Models(LRMs) such as OpenAI o1 and DeepSeek-R1 have shown remarkable reasoning capabilities by scaling test-time compute and generating long Chain-of-Thought(CoT). Distillation--post-training on LRMs-generated data--is a straightforward yet effective method to enhance the reasoning abilities of smaller models, but faces a critical bottleneck: we found that distilled long CoT data poses learning difficulty for small models and leads to the inheritance of biases (i.e. over-thinking) when using Supervised Fine-tuning (SFT) and Reinforcement Learning (RL) methods. To alleviate this bottleneck, we propose constructing tree-based CoT data from scratch via Monte Carlo Tree Search(MCTS). We then exploit a set of CoT-aware approaches, including Thoughts Length Balance, Fine-grained DPO, and Joint Post-training Objective, to enhance SFT and RL on the constructed data. We conduct evaluation on various benchmarks such as math (GSM8K, MATH, AIME). instruction-following (Multi-IF) and planning (Blocksworld), results demonstrate our approaches substantially improve the reasoning performance of distilled models compared to standard distilled models via reducing the hallucinations in long-time thinking. The project homepage is https://github.com/AIDC-AI/Marco-o1.
中文: 从大型推理模型蒸馏长思维链数据会导致小模型学习困难和偏见继承,通过蒙特卡洛树搜索构建树状推理数据并采用思维链感知训练方法,可显著提升蒸馏模型的推理性能。
English: Distilling long chain-of-thought data from large reasoning models causes learning difficulties and bias inheritance in smaller models, which is mitigated by constructing tree-based reasoning data via Monte Carlo Tree Search and implementing CoT-aware training techniques to significantly enhance reasoning performance.
Authors:Max Eissler, Tim Korjakow, Stefan Ganscha, Oliver T. Unke, Klaus-Robert Müller, Stefan Gugler
Abstract:
Most current neural networks for molecular dynamics (MD) include physical inductive biases, resulting in specialized and complex architectures. This is in contrast to most other machine learning domains, where specialist approaches are increasingly replaced by general-purpose architectures trained on vast datasets. In line with this trend, several recent studies have questioned the necessity of architectural features commonly found in MD models, such as built-in rotational equivariance or energy conservation. In this work, we contribute to the ongoing discussion by evaluating the performance of an MD model with as few specialized architectural features as possible. We present a recipe for MD using an Edge Transformer, an "off-the-shelf'' transformer architecture that has been minimally modified for the MD domain, termed MD-ET. Our model implements neither built-in equivariance nor energy conservation. We use a simple supervised pre-training scheme on $\sim$30 million molecular structures from the QCML database. Using this "off-the-shelf'' approach, we show state-of-the-art results on several benchmarks after fine-tuning for a small number of steps. Additionally, we examine the effects of being only approximately equivariant and energy conserving for MD simulations, proposing a novel method for distinguishing the errors resulting from non-equivariance from other sources of inaccuracies like numerical rounding errors. While our model exhibits runaway energy increases on larger structures, we show approximately energy-conserving NVE simulations for a range of small structures.
中文摘要:本研究提出MD-ET模型,这是一种仅对通用Transformer进行最小修改的分子动力学方法,在不内置旋转等变性或能量守恒等物理约束的情况下实现了最先进的性能,同时分析了近似等变性对模拟效果的影响。
English Summary: This study introduces MD-ET, a minimally modified transformer model for molecular dynamics that achieves state-of-the-art results without built-in physical constraints like rotational equivariance or energy conservation, while analyzing the effects of approximate equivariance in simulations.
Authors:Xinyi Wan, Penghui Qi, Guangxing Huang, Min Lin, Jialin Li
Abstract:
Pipeline parallelism (PP) is widely used for training large language models (LLMs), yet its scalability is often constrained by high activation memory consumption as the number of in-flight microbatches grows with the degree of PP. In this paper, we focus on addressing this challenge by leveraging the under-explored memory offload strategy in PP. With empirical study, we discover that in the majority of standard configurations, at least half, and potentially all, of the activations can be offloaded with negligible overhead. In the cases where full overload is not possible, we introduce a novel selective offload strategy that decreases peak activation memory in a better-than-linear manner. Furthermore, we integrate memory offload with other techniques to jointly consider overall throughput and memory limitation. Our experiments proves that the per-device activation memory effectively reduces with the total number of stages, making PP a stronger alternative than TP, offering up to a 19\% acceleration with even lower memory consumption. The implementation is open-sourced at \href{https://github.com/sail-sg/zero-bubble-pipeline-parallelism}{this url}.
中文: 本文提出一种用于流水线并行的内存卸载策略,能以可忽略的开销大幅减少激活内存,相比张量并行实现了更好的可扩展性和高达19%的加速效果。
English: This paper introduces a memory offload strategy for pipeline parallelism that significantly reduces activation memory with minimal overhead, enabling better scalability and up to 19% acceleration compared to tensor parallelism.
Authors:Yogesh Verma, Ayush Bharti, Vikas Garg
Abstract:
Simulation-based inference (SBI) methods typically require fully observed data to infer parameters of models with intractable likelihood functions. However, datasets often contain missing values due to incomplete observations, data corruptions (common in astrophysics), or instrument limitations (e.g., in high-energy physics applications). In such scenarios, missing data must be imputed before applying any SBI method. We formalize the problem of missing data in SBI and demonstrate that naive imputation methods can introduce bias in the estimation of SBI posterior. We also introduce a novel amortized method that addresses this issue by jointly learning the imputation model and the inference network within a neural posterior estimation (NPE) framework. Extensive empirical results on SBI benchmarks show that our approach provides robust inference outcomes compared to standard baselines for varying levels of missing data. Moreover, we demonstrate the merits of our imputation model on two real-world bioactivity datasets (Adrenergic and Kinase assays). Code is available at https://github.com/Aalto-QuML/RISE.
中文: 模拟推理方法通常需要完整数据,但缺失值若简单填补会引入偏差,为此我们提出了一种新型摊销方法,在神经框架内联合学习填补与推理,确保参数估计的稳健性。
English: Simulation-based inference methods often require complete data, but missing values can introduce bias when naively imputed, prompting the development of a novel amortized approach that jointly learns imputation and inference within a neural framework to ensure robust parameter estimation.
Authors:Rin Ashizawa, Yoichi Hirose, Nozomu Yoshinari, Kento Uchida, Shinichi Shirakawa
Abstract:
Prompt optimization aims to search for effective prompts that enhance the performance of large language models (LLMs). Although existing prompt optimization methods have discovered effective prompts, they often differ from sophisticated prompts carefully designed by human experts. Prompt design strategies, representing best practices for improving prompt performance, can be key to improving prompt optimization. Recently, a method termed the Autonomous Prompt Engineering Toolbox (APET) has incorporated various prompt design strategies into the prompt optimization process. In APET, the LLM is needed to implicitly select and apply the appropriate strategies because prompt design strategies can have negative effects. This implicit selection may be suboptimal due to the limited optimization capabilities of LLMs. This paper introduces Optimizing Prompts with sTrategy Selection (OPTS), which implements explicit selection mechanisms for prompt design. We propose three mechanisms, including a Thompson sampling-based approach, and integrate them into EvoPrompt, a well-known prompt optimizer. Experiments optimizing prompts for two LLMs, Llama-3-8B-Instruct and GPT-4o mini, were conducted using BIG-Bench Hard. Our results show that the selection of prompt design strategies improves the performance of EvoPrompt, and the Thompson sampling-based mechanism achieves the best overall results. Our experimental code is provided at https://github.com/shiralab/OPTS .
中文摘要:OPTS通过引入显式策略选择机制优化大语言模型的提示设计,其中基于汤普森采样的方法在提升EvoPrompt性能方面表现最佳。
English Summary: OPTS introduces explicit strategy selection mechanisms to optimize prompts for large language models, with a Thompson sampling-based approach showing the best performance in enhancing EvoPrompt's effectiveness.
Authors:Kaiwen Zheng, Yongxin Chen, Huayu Chen, Guande He, Ming-Yu Liu, Jun Zhu, Qinsheng Zhang
Abstract:
While likelihood-based generative models, particularly diffusion and autoregressive models, have achieved remarkable fidelity in visual generation, the maximum likelihood estimation (MLE) objective, which minimizes the forward KL divergence, inherently suffers from a mode-covering tendency that limits the generation quality under limited model capacity. In this work, we propose Direct Discriminative Optimization (DDO) as a unified framework that integrates likelihood-based generative training and GAN-type discrimination to bypass this fundamental constraint by exploiting reverse KL and self-generated negative signals. Our key insight is to parameterize a discriminator implicitly using the likelihood ratio between a learnable target model and a fixed reference model, drawing parallels with the philosophy of Direct Preference Optimization (DPO). Unlike GANs, this parameterization eliminates the need for joint training of generator and discriminator networks, allowing for direct, efficient, and effective finetuning of a well-trained model to its full potential beyond the limits of MLE. DDO can be performed iteratively in a self-play manner for progressive model refinement, with each round requiring less than 1% of pretraining epochs. Our experiments demonstrate the effectiveness of DDO by significantly advancing the previous SOTA diffusion model EDM, reducing FID scores from 1.79/1.58/1.96 to new records of 1.30/0.97/1.26 on CIFAR-10/ImageNet-64/ImageNet 512x512 datasets without any guidance mechanisms, and by consistently improving both guidance-free and CFG-enhanced FIDs of visual autoregressive models on ImageNet 256x256.
Chinese: 本文提出直接判别优化(DDO)框架,将基于似然的生成训练与GAN式判别相结合,突破最大似然估计的限制,在无需引导机制的情况下于多个数据集上实现了最先进的生成性能。
English: This paper introduces Direct Discriminative Optimization (DDO), a unified framework that combines likelihood-based generative training with GAN-type discrimination to overcome the limitations of maximum likelihood estimation, achieving state-of-the-art performance across multiple datasets without guidance mechanisms.
Authors:Jacob Beck
Abstract:
While internet-scale image and textual data have enabled strong generalization in Vision-Language Models (VLMs), the absence of internet-scale control data has impeded the development of similar generalization in standard reinforcement learning (RL) agents. Although VLMs are fundamentally limited in their ability to solve control tasks due to their lack of action-conditioned training data, their capacity for image understanding allows them to provide valuable feedback in RL tasks by recognizing successful outcomes. A key challenge in Reinforcement Learning from AI Feedback (RLAIF) is determining how best to integrate VLM-derived signals into the learning process. We explore this question in the context of offline RL and introduce a class of methods called Sub-Trajectory Filtered Optimization (SFO). We identify three key insights. First, trajectory length plays a crucial role in offline RL, as full-trajectory preference learning exacerbates the stitching problem, necessitating the use of sub-trajectories. Second, even in Markovian environments, a non-Markovian reward signal from a sequence of images is required to assess trajectory improvement, as VLMs do not interpret control actions and must rely on visual cues over time. Third, a simple yet effective approach--filtered and weighted behavior cloning--consistently outperforms more complex RLHF-based methods. We propose Sub-Trajectory Filtered Behavior Cloning (SFBC), a method that leverages VLM feedback on sub-trajectories while incorporating a retrospective filtering mechanism that removes sub-trajectories preceding failures to improve robustness and prevent turbulence. Please enjoy our airport puns.
中文: 视觉语言模型通过为子轨迹提供非马尔可夫反馈来增强强化学习,由此提出的子轨迹过滤行为克隆方法通过过滤行为克隆,优于更复杂的方法。
English: Vision-Language Models enhance reinforcement learning by providing non-Markovian feedback on sub-trajectories, leading to the development of Sub-Trajectory Filtered Behavior Cloning, which outperforms complex methods through filtered behavior cloning.
Authors:Xingzhuo Guo, Yu Zhang, Baixu Chen, Haoran Xu, Jianmin Wang, Mingsheng Long
Abstract:
Diffusion models have emerged as powerful generative frameworks by progressively adding noise to data through a forward process and then reversing this process to generate realistic samples. While these models have achieved strong performance across various tasks and modalities, their application to temporal predictive learning remains underexplored. Existing approaches treat predictive learning as a conditional generation problem, but often fail to fully exploit the temporal dynamics inherent in the data, leading to challenges in generating temporally coherent sequences. To address this, we introduce Dynamical Diffusion (DyDiff), a theoretically sound framework that incorporates temporally aware forward and reverse processes. Dynamical Diffusion explicitly models temporal transitions at each diffusion step, establishing dependencies on preceding states to better capture temporal dynamics. Through the reparameterization trick, Dynamical Diffusion achieves efficient training and inference similar to any standard diffusion model. Extensive experiments across scientific spatiotemporal forecasting, video prediction, and time series forecasting demonstrate that Dynamical Diffusion consistently improves performance in temporal predictive tasks, filling a crucial gap in existing methodologies. Code is available at this repository: https://github.com/thuml/dynamical-diffusion.
中文: Dynamical Diffusion (DyDiff) 提出了一种时间感知框架,通过显式建模时间动态增强扩散模型,从而在多种预测任务中提升性能。
English: Dynamical Diffusion (DyDiff) introduces a temporally aware framework that enhances diffusion models by explicitly modeling temporal dynamics, leading to improved performance in various predictive tasks.
Authors:Jing Peng, Meiqi Yang, Qiong Zhang, Xiaoxiao Li
Abstract:
Multivariate time series data play a pivotal role in a wide range of real-world applications. However, the presence of block missing data introduces significant challenges, often compromising the performance of predictive models. Traditional two-step approaches, which first impute missing values and then perform forecasting, are prone to error accumulation, particularly in complex multivariate settings characterized by high missing ratios and intricate dependency structures. In this work, we introduce S4M, an end-to-end time series forecasting framework that seamlessly integrates missing data handling into the Structured State Space Sequence (S4) model architecture. Unlike conventional methods that treat imputation as a separate preprocessing step, S4M leverages the latent space of S4 models to directly recognize and represent missing data patterns, thereby more effectively capturing the underlying temporal and multivariate dependencies. Our framework comprises two key components: the Adaptive Temporal Prototype Mapper (ATPM) and the Missing-Aware Dual Stream S4 (MDS-S4). The ATPM employs a prototype bank to derive robust and informative representations from historical data patterns, while the MDS-S4 processes these representations alongside missingness masks as dual input streams to enable accurate forecasting. Through extensive empirical evaluations on diverse real-world datasets, we demonstrate that S4M consistently achieves state-of-the-art performance. These results underscore the efficacy of our integrated approach in handling missing data, showcasing its robustness and superiority over traditional imputation-based methods. Our findings highlight the potential of S4M to advance reliable time series forecasting in practical applications, offering a promising direction for future research and deployment. Code is available at https://github.com/WINTERWEEL/S4M.git.
中文: S4M框架将缺失数据处理直接整合到结构化状态空间序列模型中,通过自适应时间映射和双流处理,无需独立插补步骤即可实现最先进的预测性能。
English: The S4M framework integrates missing data handling directly into the Structured State Space Sequence model, using adaptive temporal mapping and dual-stream processing to achieve state-of-the-art forecasting performance without separate imputation steps.
Authors:Qia Hu, Bo Jiao
Abstract:
Graph sampling-based Graph Convolutional Networks (GCNs) decouple sampling from forward and backward propagation during minibatch training, enhancing scalability with respect to layer depth and graph size. We propose HIS_GCNs, a hierarchical importance sampling-based learning method. By constructing minibatches using sampled subgraphs, HIS_GCNs focuses on the importance of both the core and periphery in a scale-free training graph. Specifically, it preserves the centrum of the core in most minibatches, which maintains connectivity between periphery nodes, and samples periphery edges without core node interference, which allows longer chains composed entirely of low-degree nodes remain within the same minibatch. HIS_GCNs can maximize the discrete Ricci curvature (i.e., Ollivier-Ricci curvatures) of the edges in a subgraph, enabling preservation of important chains for information propagation. This approach can achieve a low node embedding variance and a high convergence speed. Diverse experiments on Graph Neural Networks (GNNs) with node classification tasks confirmed the superior performance of HIS_GCNs in terms of both accuracy and training time. Open-source code (https://github.com/HuQiaCHN/HIS-GCN).
中文摘要:HIS_GCNs提出了一种基于层次重要性采样的图卷积网络方法,通过保持无标度图中核心-边缘结构来构建小批量训练样本,从而优化信息传播路径,实现了更高精度和更快收敛速度。
English Summary: HIS_GCNs introduces a hierarchical importance sampling method that constructs minibatches by preserving core-periphery structures in scale-free graphs, achieving higher accuracy and faster convergence through optimized information propagation paths.
Authors:Yalun Dai, Lingao Xiao, Ivor W. Tsang, Yang He
Abstract:
Existing dataset pruning techniques primarily focus on classification tasks, limiting their applicability to more complex and practical tasks like instance segmentation. Instance segmentation presents three key challenges: pixel-level annotations, instance area variations, and class imbalances, which significantly complicate dataset pruning efforts. Directly adapting existing classification-based pruning methods proves ineffective due to their reliance on time-consuming model training process. To address this, we propose a novel Training-Free Dataset Pruning (TFDP) method for instance segmentation. Specifically, we leverage shape and class information from image annotations to design a Shape Complexity Score (SCS), refining it into a Scale-Invariant (SI-SCS) and Class-Balanced (CB-SCS) versions to address instance area variations and class imbalances, all without requiring model training. We achieve state-of-the-art results on VOC 2012, Cityscapes, and COCO datasets, generalizing well across CNN and Transformer architectures. Remarkably, our approach accelerates the pruning process by an average of 1349$\times$ on COCO compared to the adapted baselines. Source code is available at: https://github.com/he-y/dataset-pruning-for-instance-segmentation
中文: 本文提出了一种无需训练的实例分割数据集剪枝方法(TFDP),利用标注中的形状和类别信息设计复杂度评分,无需模型训练即可在多个数据集上取得最优性能,并将剪枝速度提升了1300倍以上。
English: This paper introduces a Training-Free Dataset Pruning (TFDP) method for instance segmentation that uses shape and class information from annotations to create complexity scores, achieving state-of-the-art results and accelerating pruning by over 1300 times without model training.
Authors:Jayden Teoh, Pradeep Varakantham, Peter Vamplew
Abstract:
Real-world sequential decision-making tasks often require balancing trade-offs between multiple conflicting objectives, making Multi-Objective Reinforcement Learning (MORL) an increasingly prominent field of research. Despite recent advances, existing MORL literature has narrowly focused on performance within static environments, neglecting the importance of generalizing across diverse settings. Conversely, existing research on generalization in RL has always assumed scalar rewards, overlooking the inherent multi-objectivity of real-world problems. Generalization in the multi-objective context is fundamentally more challenging, as it requires learning a Pareto set of policies addressing varying preferences across multiple objectives. In this paper, we formalize the concept of generalization in MORL and how it can be evaluated. We then contribute a novel benchmark featuring diverse multi-objective domains with parameterized environment configurations to facilitate future studies in this area. Our baseline evaluations of state-of-the-art MORL algorithms on this benchmark reveals limited generalization capabilities, suggesting significant room for improvement. Our empirical findings also expose limitations in the expressivity of scalar rewards, emphasizing the need for multi-objective specifications to achieve effective generalization. We further analyzed the algorithmic complexities within current MORL approaches that could impede the transfer in performance from the single- to multiple-environment settings. This work fills a critical gap and lays the groundwork for future research that brings together two key areas in reinforcement learning: solving multi-objective decision-making problems and generalizing across diverse environments. We make our code available at https://github.com/JaydenTeoh/MORL-Generalization.
Chinese: 本文提出了多目标强化学习的新基准,旨在解决跨环境泛化的关键问题,揭示了现有算法的局限性,并强调了多目标规范对于实现有效泛化的重要性。
English: This paper introduces a novel benchmark for Multi-Objective Reinforcement Learning (MORL) to address the critical gap in generalization across diverse environments, revealing limited capabilities of current algorithms and emphasizing the need for multi-objective specifications.
Authors:Elahe Delavari, Aws Khalil, Jaerock Kwon
Abstract:
End-to-end vision-based imitation learning has demonstrated promising results in autonomous driving by learning control commands directly from expert demonstrations. However, traditional approaches rely on either regressionbased models, which provide precise control but lack confidence estimation, or classification-based models, which offer confidence scores but suffer from reduced precision due to discretization. This limitation makes it challenging to quantify the reliability of predicted actions and apply corrections when necessary. In this work, we introduce a dual-head neural network architecture that integrates both regression and classification heads to improve decision reliability in imitation learning. The regression head predicts continuous driving actions, while the classification head estimates confidence, enabling a correction mechanism that adjusts actions in low-confidence scenarios, enhancing driving stability. We evaluate our approach in a closed-loop setting within the CARLA simulator, demonstrating its ability to detect uncertain actions, estimate confidence, and apply real-time corrections. Experimental results show that our method reduces lane deviation and improves trajectory accuracy by up to 50%, outperforming conventional regression-only models. These findings highlight the potential of classification-guided confidence estimation in enhancing the robustness of vision-based imitation learning for autonomous driving. The source code is available at https://github.com/ElaheDlv/Confidence_Aware_IL.
中文摘要:本研究提出一种结合回归与分类的双头神经网络,通过实时置信度估计和动作校正机制提升自动驾驶可靠性,在CARLA模拟器中显著减少车道偏离并提高轨迹精度达50%。
English Summary: This study introduces a dual-head neural network combining regression and classification to enhance autonomous driving reliability by enabling real-time confidence estimation and action correction, significantly reducing lane deviation and improving trajectory accuracy in CARLA simulations.
Authors:Ziwei Huang, Jianan Zhou, Zhiguang Cao, Yixin Xu
Abstract:
Light decoder-based solvers have gained popularity for solving vehicle routing problems (VRPs) due to their efficiency and ease of integration with reinforcement learning algorithms. However, they often struggle with generalization to larger problem instances or different VRP variants. This paper revisits light decoder-based approaches, analyzing the implications of their reliance on static embeddings and the inherent challenges that arise. Specifically, we demonstrate that in the light decoder paradigm, the encoder is implicitly tasked with capturing information for all potential decision scenarios during solution construction within a single set of embeddings, resulting in high information density. Furthermore, our empirical analysis reveals that the overly simplistic decoder struggles to effectively utilize this dense information, particularly as task complexity increases, which limits generalization to out-of-distribution (OOD) settings. Building on these insights, we show that enhancing the decoder capacity, with a simple addition of identity mapping and a feed-forward layer, can considerably alleviate the generalization issue. Experimentally, our method significantly enhances the OOD generalization of light decoder-based approaches on large-scale instances and complex VRP variants, narrowing the gap with the heavy decoder paradigm. Our code is available at: https://github.com/ziweileonhuang/reld-nco.
Chinese: 轻量解码器求解器在车辆路径问题中因静态嵌入信息密度过高和解码器设计过于简化而面临泛化挑战,但通过增加恒等映射和前馈层提升解码器能力,可显著改善分布外泛化性能。
English: Light decoder-based solvers for vehicle routing problems face generalization issues due to high information density in static embeddings and simplistic decoder design, but enhancing decoder capacity with identity mapping and a feed-forward layer significantly improves out-of-distribution generalization.
Authors:Xingbo Fu, Yinhan He, Jundong Li
Abstract:
Pre-training powerful Graph Neural Networks (GNNs) with unlabeled graph data in a self-supervised manner has emerged as a prominent technique in recent years. However, inevitable objective gaps often exist between pre-training and downstream tasks. To bridge this gap, graph prompt tuning techniques design and learn graph prompts by manipulating input graphs or reframing downstream tasks as pre-training tasks without fine-tuning the pre-trained GNN models. While recent graph prompt tuning methods have proven effective in adapting pre-trained GNN models for downstream tasks, they overlook the crucial role of edges in graph prompt design, which can significantly affect the quality of graph representations for downstream tasks. In this study, we propose EdgePrompt, a simple yet effective graph prompt tuning method from the perspective of edges. Unlike previous studies that design prompt vectors on node features, EdgePrompt manipulates input graphs by learning additional prompt vectors for edges and incorporates the edge prompts through message passing in the pre-trained GNN models to better embed graph structural information for downstream tasks. Our method is compatible with prevalent GNN architectures pre-trained under various pre-training strategies and is universal for different downstream tasks. We provide comprehensive theoretical analyses of our method regarding its capability of handling node classification and graph classification as downstream tasks. Extensive experiments on ten graph datasets under four pre-training strategies demonstrate the superiority of our proposed method against six baselines. Our code is available at https://github.com/xbfu/EdgePrompt.
中文: 本文提出EdgePrompt方法,通过为边学习提示向量并借助消息传递机制融入预训练图神经网络,有效提升了图结构信息在下游任务中的表示质量,在多种预训练策略和数据集上展现出卓越性能。
English: This paper introduces EdgePrompt, an innovative graph prompt tuning method that enhances downstream task performance by learning edge-specific prompt vectors and integrating them through message passing in pre-trained GNN models, demonstrating superior results across diverse datasets and pre-training strategies.
Authors:Yifei He, Yang Liu, Chen Liang, Hany Hassan Awadalla
Abstract:
Mixture-of-Experts (MoE) models have become a key approach for scaling large language models efficiently by activating only a subset of experts during training and inference. Typically, the number of activated experts presents a trade-off: fewer experts reduce computational costs, while more experts improve performance. Recent studies reveal that not all activated experts contribute equally to model performance, with some providing minimal utility, particularly when finetuning pretrained MoE models for specialized downstream tasks. The co-existence of significant and redundant parameters in experts provides us an opportunity to reduce the number of activated experts while maintaining model performance. In this work, we propose the concept of compressed experts, lightweight modules that serve as compact representations of full experts. Our approach preserves the most important experts while replacing other auxiliary activated experts with compressed experts. The reduction of active parameters significantly lowers inference costs while achieving comparable performance. Extensive experiments on models including Phi-MoE and OLMoE demonstrate that compressed experts recover over 90% of full expert performance across various tasks while reducing more than 30% active parameters and saving 20% in inference costs. This approach enables efficient deployment of MoE models in resource-constrained settings and facilitates scaling to larger models with manageable overhead. Our code is available at https://github.com/yifei-he/Compressed-Experts.
中文: 本研究提出压缩专家概念,通过轻量级模块替换混合专家模型中的辅助专家,在保持90%以上任务性能的同时,显著减少激活参数并降低20%推理成本。
English: This study introduces compressed experts, lightweight modules that replace auxiliary experts in Mixture-of-Experts models to significantly reduce active parameters and inference costs while maintaining over 90% performance across tasks.
Authors:Nicky Kriplani, Minh Pham, Gowthami Somepalli, Chinmay Hegde, Niv Cohen
Abstract:
Recent works have shown that diffusion models are able to memorize training images and emit them at generation time. However, the metrics used to evaluate memorization and its mitigation techniques suffer from dataset-dependent biases and struggle to detect whether a given specific image has been memorized or not.
This paper begins with a comprehensive exploration of issues surrounding memorization metrics in diffusion models. Then, to mitigate these issues, we introduce $\rm \style{font-variant: small-caps}{SolidMark}$, a novel evaluation method that provides a per-image memorization score. We then re-evaluate existing memorization mitigation techniques. We also show that $\rm \style{font-variant: small-caps}{SolidMark}$ is capable of evaluating fine-grained pixel-level memorization. Finally, we release a variety of models based on $\rm \style{font-variant: small-caps}{SolidMark}$ to facilitate further research for understanding memorization phenomena in generative models. All of our code is available at https://github.com/NickyDCFP/SolidMark.
Chinese: 本文提出SolidMark这一新型评估方法,通过提供逐图像及像素级记忆分数解决扩散模型现有记忆度量偏差问题,重新评估缓解技术并发布模型以推动生成模型记忆机制研究。
English: This paper introduces SolidMark, a novel evaluation method that addresses biases in existing memorization metrics for diffusion models by providing per-image and pixel-level memorization scores, and re-evaluates mitigation techniques while releasing models to advance research.
Authors:TuÄrul Hasan Karabulut, İnci M. BaytaÅ
Abstract:
Graph Neural Networks (GNNs) set the state-of-the-art in representation learning for graph-structured data. They are used in many domains, from online social networks to complex molecules. Most GNNs leverage the message-passing paradigm and achieve strong performances on various tasks. However, the message-passing mechanism used in most models suffers from over-smoothing as a GNN's depth increases. The over-smoothing degrades GNN's performance due to the increased similarity between the representations of unrelated nodes. This study proposes an adaptive channel-wise message-passing approach to alleviate the over-smoothing. The proposed model, Channel-Attentive GNN, learns how to attend to neighboring nodes and their feature channels. Thus, much diverse information can be transferred between nodes during message-passing. Experiments with widely used benchmark datasets show that the proposed model is more resistant to over-smoothing than baselines and achieves state-of-the-art performances for various graphs with strong heterophily. Our code is at https://github.com/ALLab-Boun/CHAT-GNN.
Chinese: 本研究提出了一种通道注意力图神经网络,通过在消息传递过程中自适应关注相邻节点和特征通道,有效缓解了过平滑问题,并在异质图数据上实现了最优性能。
English: This study introduces a Channel-Attentive Graph Neural Network that adaptively attends to neighboring nodes and feature channels during message-passing, effectively mitigating over-smoothing and achieving state-of-the-art performance on heterophilic graphs.
Authors:Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim Furkan Tekin, Zachary Yahn, Yichang Xu, Ling Liu
Abstract:
Safety alignment is an important procedure before the official deployment of a Large Language Model (LLM). While safety alignment has been extensively studied for LLM, there is still a large research gap for Large Reasoning Models (LRMs) that equip with improved reasoning capability. We in this paper systematically examine a simplified pipeline for producing safety aligned LRMs. With our evaluation of various LRMs, we deliver two main findings: i) Safety alignment can be done upon the LRM to restore its safety capability. ii) Safety alignment leads to a degradation of the reasoning capability of LRMs. The two findings show that there exists a trade-off between reasoning and safety capability with the sequential LRM production pipeline. The discovered trade-off, which we name Safety Tax, should shed light on future endeavors of safety research on LRMs. As a by-product, we curate a dataset called DirectRefusal, which might serve as an alternative dataset for safety alignment. Our source code is available at https://github.com/git-disl/Safety-Tax.
中文: 大型推理模型的安全对齐能恢复其安全能力,但会降低推理性能,这种权衡被称为“安全税”,同时提出的DirectRefusal数据集可作为对齐的替代资源。
English: Safety alignment for Large Reasoning Models (LRMs) restores safety but reduces reasoning ability, revealing a trade-off termed "Safety Tax," and introduces the DirectRefusal dataset as a resource for alignment.
Authors:Haofei Lu, Dongqi Han, Yifei Shen, Dongsheng Li
Abstract:
Diffusion models have recently shown significant potential in solving decision-making problems, particularly in generating behavior plans -- also known as diffusion planning. While numerous studies have demonstrated the impressive performance of diffusion planning, the mechanisms behind the key components of a good diffusion planner remain unclear and the design choices are highly inconsistent in existing studies. In this work, we address this issue through systematic empirical experiments on diffusion planning in an offline reinforcement learning (RL) setting, providing practical insights into the essential components of diffusion planning. We trained and evaluated over 6,000 diffusion models, identifying the critical components such as guided sampling, network architecture, action generation and planning strategy. We revealed that some design choices opposite to the common practice in previous work in diffusion planning actually lead to better performance, e.g., unconditional sampling with selection can be better than guided sampling and Transformer outperforms U-Net as denoising network. Based on these insights, we suggest a simple yet strong diffusion planning baseline that achieves state-of-the-art results on standard offline RL benchmarks.
Chinese: 本研究通过评估6000多个模型,系统分析了离线强化学习中的扩散规划,发现无条件采样加选择和Transformer网络等非常规设计优于传统方法,从而提出了新的最先进基准。
English: This study systematically analyzes diffusion planning in offline reinforcement learning by evaluating over 6,000 models, revealing that unconventional design choices like unconditional sampling with selection and Transformer networks outperform common practices, leading to a new state-of-the-art baseline.
Authors:Haofei Lu, Zhe Wu, Junliang Xing, Jianshu Li, Ruoyu Li, Zhe Li, Yuanchun Shi
Abstract:
Embodiment co-design aims to optimize a robot's morphology and control policy simultaneously. While prior work has demonstrated its potential for generating environment-adaptive robots, this field still faces persistent challenges in optimization efficiency due to the (i) combinatorial nature of morphological search spaces and (ii) intricate dependencies between morphology and control. We prove that the ineffective morphology representation and unbalanced reward signals between the design and control stages are key obstacles to efficiency. To advance towards efficient embodiment co-design, we propose BodyGen, which utilizes (1) topology-aware self-attention for both design and control, enabling efficient morphology representation with lightweight model sizes; (2) a temporal credit assignment mechanism that ensures balanced reward signals for optimization. With our findings, Body achieves an average 60.03% performance improvement against state-of-the-art baselines. We provide codes and more results on the website: https://genesisorigin.github.io.
Chinese: BodyGen采用拓扑感知自注意力机制和时间信用分配,显著提升了具身协同设计的效率,相比现有方法性能平均提高60.03%。
English: BodyGen introduces a topology-aware self-attention mechanism and temporal credit assignment to enhance embodiment co-design efficiency, achieving a 60.03% performance gain over existing methods.
Authors:Wanli Hong, Yuliang Shi, Jonathan Niles-Weed
Abstract:
Motivated by applications in trajectory inference and particle tracking, we introduce Smooth Schrödinger Bridges. Our proposal generalizes prior work by allowing the reference process in the Schrödinger Bridge problem to be a smooth Gaussian process, leading to more regular and interpretable trajectories in applications. Though naïvely smoothing the reference process leads to a computationally intractable problem, we identify a class of processes (including the Matérn processes) for which the resulting Smooth Schrödinger Bridge problem can be lifted to a simpler problem on phase space, which can be solved in polynomial time. We develop a practical approximation of this algorithm that outperforms existing methods on numerous simulated and real single-cell RNAseq datasets. The code can be found at https://github.com/WanliHongC/Smooth_SB
Chinese: 本文提出平滑薛定谔桥方法,通过采用平滑高斯过程生成更规则的轨迹,在单细胞RNAseq数据上表现优异,并提供了计算高效的实现方案。
English: This paper introduces Smooth Schrödinger Bridges, a method that enhances trajectory inference by using smooth Gaussian processes for more regular paths and demonstrates superior performance on single-cell RNAseq data with a computationally efficient solution.
Authors:Zhuo Ouyang, Kaiwen Hu, Qi Zhang, Yifei Wang, Yisen Wang
Abstract:
Recently, contrastive learning has risen to be a promising paradigm for extracting meaningful data representations. Among various special designs, adding a projection head on top of the encoder during training and removing it for downstream tasks has proven to significantly enhance the performance of contrastive learning. However, despite its empirical success, the underlying mechanism of the projection head remains under-explored. In this paper, we develop an in-depth theoretical understanding of the projection head from the information-theoretic perspective. By establishing the theoretical guarantees on the downstream performance of the features before the projector, we reveal that an effective projector should act as an information bottleneck, filtering out the information irrelevant to the contrastive objective. Based on theoretical insights, we introduce modifications to projectors with training and structural regularizations. Empirically, our methods exhibit consistent improvement in the downstream performance across various real-world datasets, including CIFAR-10, CIFAR-100, and ImageNet-100. We believe our theoretical understanding on the role of the projection head will inspire more principled and advanced designs in this field. Code is available at https://github.com/PKU-ML/Projector_Theory.
Chinese: 本文从信息论角度理论分析了对比学习中投影头的作用,揭示其作为信息瓶颈过滤无关信息的机制,并提出正则化方法在多个数据集上实现了性能提升。
English: This paper provides a theoretical analysis of the projection head in contrastive learning, revealing its role as an information bottleneck that filters irrelevant data, and proposes regularization methods that improve performance across multiple datasets.
Authors:Lixu Wang, Bingqi Shang, Yi Li, Payal Mohapatra, Wei Dong, Xiao Wang, Qi Zhu
Abstract:
Vision Transformers (ViTs), extensively pre-trained on large-scale datasets, have become essential to foundation models, allowing excellent performance on diverse downstream tasks with minimal adaptation. Consequently, there is growing interest in adapting pre-trained ViTs across various fields, including privacy-sensitive domains where clients are often reluctant to share their data. Existing adaptation methods typically require direct data access, rendering them infeasible under these constraints. A straightforward solution may be sending the pre-trained ViT to clients for local adaptation, which poses issues of model intellectual property protection and incurs heavy client computation overhead. To address these issues, we propose a novel split adaptation (SA) method that enables effective downstream adaptation while protecting data and models. SA, inspired by split learning (SL), segments the pre-trained ViT into a frontend and a backend, with only the frontend shared with the client for data representation extraction. But unlike regular SL, SA replaces frontend parameters with low-bit quantized values, preventing direct exposure of the model. SA allows the client to add bi-level noise to the frontend and the extracted data representations, ensuring data protection. Accordingly, SA incorporates data-level and model-level out-of-distribution enhancements to mitigate noise injection's impact on adaptation performance. Our SA focuses on the challenging few-shot adaptation and adopts patch retrieval augmentation for overfitting alleviation. Extensive experiments on multiple datasets validate SA's superiority over state-of-the-art methods and demonstrate its defense against advanced data reconstruction attacks while preventing model leakage with minimal computation cost on the client side. The source codes can be found at https://github.com/conditionWang/Split_Adaptation.
中文: 提出的分割适配方法将视觉Transformer分割为前端和后端组件,通过量化和噪声注入在少样本下游任务中同时保护数据隐私和模型知识产权,同时保持高性能并降低客户端计算成本。
English: The proposed split adaptation method segments Vision Transformers into frontend and backend components, using quantization and noise injection to protect both data privacy and model intellectual property during few-shot downstream tasks, while maintaining high performance and low client computation costs.
Authors:Magnus Cunow, Gerrit GroÃmann
Abstract:
Autoencoders are effective deep learning models that can function as generative models and learn latent representations for downstream tasks. The use of graph autoencoders - with both encoder and decoder implemented as message passing networks - is intriguing due to their ability to generate permutation-invariant graph representations. However, this approach faces difficulties because decoding a graph structure from a single vector is challenging, and comparing input and output graphs requires an effective permutation-invariant similarity measure. As a result, many studies rely on approximate methods.
In this work, we explore the effect of graph matching precision on the training behavior and generation capabilities of a Variational Autoencoder (VAE). Our contribution is two-fold: (1) we propose a transformer-based message passing graph decoder as an alternative to a graph neural network decoder, that is more robust and expressive by leveraging global attention mechanisms. (2) We show that the precision of graph matching has significant impact on training behavior and is essential for effective de novo (molecular) graph generation.
Code is available at https://github.com/mcunow/graph-matching
中文摘要:本研究揭示了图匹配精度对变分自编码器训练的关键影响,并提出了基于Transformer的解码器,显著提升了分子图生成的鲁棒性与表达能力。
English summary: This study demonstrates that precise graph matching is crucial for training variational autoencoders and introduces a transformer-based decoder that enhances robustness and expressiveness in molecular graph generation.
Authors:Song Xia, Yi Yu, Wenhan Yang, Meiwen Ding, Zhuo Chen, Ling-Yu Duan, Alex C. Kot, Xudong Jiang
Abstract:
By locally encoding raw data into intermediate features, collaborative inference enables end users to leverage powerful deep learning models without exposure of sensitive raw data to cloud servers. However, recent studies have revealed that these intermediate features may not sufficiently preserve privacy, as information can be leaked and raw data can be reconstructed via model inversion attacks (MIAs). Obfuscation-based methods, such as noise corruption, adversarial representation learning, and information filters, enhance the inversion robustness by obfuscating the task-irrelevant redundancy empirically. However, methods for quantifying such redundancy remain elusive, and the explicit mathematical relation between this redundancy minimization and inversion robustness enhancement has not yet been established. To address that, this work first theoretically proves that the conditional entropy of inputs given intermediate features provides a guaranteed lower bound on the reconstruction mean square error (MSE) under any MIA. Then, we derive a differentiable and solvable measure for bounding this conditional entropy based on the Gaussian mixture estimation and propose a conditional entropy maximization (CEM) algorithm to enhance the inversion robustness. Experimental results on four datasets demonstrate the effectiveness and adaptability of our proposed CEM; without compromising feature utility and computing efficiency, plugging the proposed CEM into obfuscation-based defense mechanisms consistently boosts their inversion robustness, achieving average gains ranging from 12.9\% to 48.2\%. Code is available at \href{https://github.com/xiasong0501/CEM}{https://github.com/xiasong0501/CEM}.
Chinese: 协作推理通过将原始数据编码为中间特征来保护隐私,但这些特征仍可能遭受模型逆向攻击;本研究提出条件熵最大化算法,在不影响功能与效率的前提下,有效增强隐私保护能力。
English: Collaborative inference protects raw data by encoding it into intermediate features, but these can still be vulnerable to model inversion attacks, which this study addresses by proposing a conditional entropy maximization algorithm to enhance privacy without sacrificing utility or efficiency.
Authors:Benjamin Schneider, Florian Kerschbaum, Wenhu Chen
Abstract:
Visual embedding models excel at zero-shot tasks like visual retrieval and classification. However, these models cannot be used for tasks that contain ambiguity or require user instruction. These tasks necessitate an embedding model which outputs can use a natural language instruction to control the representation of a visual embedding. Existing CLIP-based approaches embed images and text independently, and fuse the result. We find that this results in weak interactions between modalities, and poor user control over the representation. We introduce ABC, an open-source multimodal embedding model that uses a vision-language model backbone to deeply integrate image features with natural language instructions. ABC achieves best-for-size performance on MSCOCO image-to-text retrieval and is the top performing model on classification and VQA tasks in the Massive Multimodal Embedding Benchmark. With a strongly unified vision-language representation, ABC can use natural language to solve subtle and potentially ambiguous visual retrieval problems. To evaluate this capability, we design CtrlBench, a benchmark that requires interleaving textual instructions with image content for correct retrieval. ABC advances the state of visual embeddings, outputting high-quality visual representations with natural language control. Our model and datasets are available at our project page: https://tiger-ai-lab.github.io/ABC/
Authors:Jiawei Zhang, Xuan Yang, Taiqi Wang, Yu Yao, Aleksandr Petiushko, Bo Li
Abstract:
Traditional autonomous driving systems often struggle to connect high-level reasoning with low-level control, leading to suboptimal and sometimes unsafe behaviors. Recent advances in multimodal large language models (MLLMs), which process both visual and textual data, offer an opportunity to unify perception and reasoning. However, effectively embedding precise safety knowledge into MLLMs for autonomous driving remains a significant challenge. To address this, we propose SafeAuto, a framework that enhances MLLM-based autonomous driving by incorporating both unstructured and structured knowledge. First, we introduce a Position-Dependent Cross-Entropy (PDCE) loss to improve low-level control signal predictions when values are represented as text. Second, to explicitly integrate safety knowledge, we develop a reasoning component that translates traffic rules into first-order logic (e.g., "red light $\implies$ stop") and embeds them into a probabilistic graphical model (e.g., Markov Logic Network) to verify predicted actions using recognized environmental attributes. Additionally, our Multimodal Retrieval-Augmented Generation (RAG) model leverages video, control signals, and environmental attributes to learn from past driving experiences. Integrating PDCE, MLN, and Multimodal RAG, SafeAuto outperforms existing baselines across multiple datasets, enabling more accurate, reliable, and safer autonomous driving. The code is available at https://github.com/AI-secure/SafeAuto.
中文总结:SafeAuto是一种创新框架,通过结合位置相关损失函数、概率逻辑推理与多模态检索增强生成技术,将多模态大语言模型与结构化安全知识相融合,从而显著提升自动驾驶系统的安全性与可靠性。
English Summary: SafeAuto is a novel framework that enhances autonomous driving safety by integrating multimodal large language models with structured safety knowledge through position-dependent loss functions, probabilistic reasoning, and multimodal retrieval-augmented generation.
Authors:Naveen Mysore
Abstract:
Reinforcement learning (RL) methods frequently assume that each new observation completely reflects the environment's state, thereby guaranteeing Markovian (one-step) transitions. In practice, partial observability or sensor/actuator noise often invalidates this assumption. This paper proposes a systematic methodology for detecting such violations, combining a partial correlation-based causal discovery process (PCMCI) with a novel Markov Violation score (MVS). The MVS measures multi-step dependencies that emerge when noise or incomplete state information disrupts the Markov property.
Classic control tasks (CartPole, Pendulum, Acrobot) serve as examples to illustrate how targeted noise and dimension omissions affect both RL performance and measured Markov consistency. Surprisingly, even substantial observation noise sometimes fails to induce strong multi-lag dependencies in certain domains (e.g., Acrobot). In contrast, dimension-dropping investigations show that excluding some state variables (e.g., angular velocities in CartPole and Pendulum) significantly reduces returns and increases MVS, while removing other dimensions has minimal impact.
These findings emphasize the importance of locating and safeguarding the most causally essential dimensions in order to preserve effective single-step learning. By integrating partial correlation tests with RL performance outcomes, the proposed approach precisely identifies when and where the Markov assumption is violated. This framework offers a principled mechanism for developing robust policies, informing representation learning, and addressing partial observability in real-world RL scenarios. All code and experimental logs are accessible for reproducibility (https://github.com/ucsb/markovianess).
中文: 本文提出一种结合因果发现与新型马尔可夫违背评分的方法,用于检测噪声或部分可观测性对强化学习中马尔可夫假设的破坏,并通过控制任务证明保护关键因果状态维度对维持有效学习至关重要。
English: This paper introduces a method combining causal discovery with a novel Markov Violation score to detect when noise or partial observability disrupts the Markov assumption in reinforcement learning, demonstrating through control tasks that protecting causally critical state dimensions is key to maintaining effective learning.
Authors:Jian Gao, Weidong Cao, Junyi Yang, Xuan Zhang
Abstract:
The massive and large-scale design of foundational semiconductor integrated circuits (ICs) is crucial to sustaining the advancement of many emerging and future technologies, such as generative AI, 5G/6G, and quantum computing. Excitingly, recent studies have shown the great capabilities of foundational models in expediting the design of digital ICs. Yet, applying generative AI techniques to accelerate the design of analog ICs remains a significant challenge due to critical domain-specific issues, such as the lack of a comprehensive dataset and effective representation methods for analog circuits. This paper proposes, $\textbf{AnalogGenie}$, a $\underline{\textbf{Gen}}$erat$\underline{\textbf{i}}$ve $\underline{\textbf{e}}$ngine for automatic design/discovery of $\underline{\textbf{Analog}}$ circuit topologies--the most challenging and creative task in the conventional manual design flow of analog ICs. AnalogGenie addresses two key gaps in the field: building a foundational comprehensive dataset of analog circuit topology and developing a scalable sequence-based graph representation universal to analog circuits. Experimental results show the remarkable generation performance of AnalogGenie in broadening the variety of analog ICs, increasing the number of devices within a single design, and discovering unseen circuit topologies far beyond any prior arts. Our work paves the way to transform the longstanding time-consuming manual design flow of analog ICs to an automatic and massive manner powered by generative AI. Our source code is available at https://github.com/xz-group/AnalogGenie.
中文: 本文提出AnalogGenie生成式引擎,通过构建全面数据集和可扩展的电路表示方法,实现了模拟集成电路的自动化设计,大幅提升了电路多样性并突破了现有拓扑发现能力的局限。
English: The paper introduces AnalogGenie, a generative AI engine that automates the design of analog integrated circuits by creating a comprehensive dataset and scalable representation, significantly expanding circuit variety and enabling topology discovery beyond previous methods.
Authors:Federico Pizarro Bejarano, Bryson Jones, Daniel Pastor Moreno, Joseph Bowkett, Paul G. Backes, Angela P. Schoellig
Abstract:
Diffusion models have revolutionized imitation learning, allowing robots to replicate complex behaviours. However, diffusion often relies on cameras and other exteroceptive sensors to observe the environment and lacks long-term memory. In space, military, and underwater applications, robots must be highly robust to failures in exteroceptive sensors, operating using only proprioceptive information. In this paper, we propose ProDapt, a method of incorporating long-term memory of previous contacts between the robot and the environment in the diffusion process, allowing it to complete tasks using only proprioceptive data. This is achieved by identifying "keypoints", essential past observations maintained as inputs to the policy. We test our approach using a UR10e robotic arm in both simulation and real experiments and demonstrate the necessity of this long-term memory for task completion.
中文: ProDapt方法通过引入以往接触环境的关键点作为长期记忆,改进了模仿学习中的扩散模型,使机器人仅凭本体感知数据即可完成任务,并在UR10e机械臂的仿真和实际实验中验证了其必要性。
English: ProDapt enhances diffusion models in imitation learning by incorporating long-term memory of environmental contacts through keypoints, enabling robots to complete tasks using only proprioceptive data, as validated with a UR10e arm in simulations and real experiments.
Authors:Hanjiang Hu, Alexander Robey, Changliu Liu
Abstract:
Large language models (LLMs) are shown to be vulnerable to jailbreaking attacks where adversarial prompts are designed to elicit harmful responses. While existing defenses effectively mitigate single-turn attacks by detecting and filtering unsafe inputs, they fail against multi-turn jailbreaks that exploit contextual drift over multiple interactions, gradually leading LLMs away from safe behavior. To address this challenge, we propose a safety steering framework grounded in safe control theory, ensuring invariant safety in multi-turn dialogues. Our approach models the dialogue with LLMs using state-space representations and introduces a novel neural barrier function (NBF) to detect and filter harmful queries emerging from evolving contexts proactively. Our method achieves invariant safety at each turn of dialogue by learning a safety predictor that accounts for adversarial queries, preventing potential context drift toward jailbreaks. Extensive experiments under multiple LLMs show that our NBF-based safety steering outperforms safety alignment, prompt-based steering and lightweight LLM guardrails baselines, offering stronger defenses against multi-turn jailbreaks while maintaining a better trade-off among safety, helpfulness and over-refusal. Check out the website here https://sites.google.com/view/llm-nbf/home . Our code is available on https://github.com/HanjiangHu/NBF-LLM .
中文摘要:该研究提出了一种基于神经屏障函数的安全引导框架,通过主动检测多轮对话中的有害查询并维持持续安全状态,有效防御大语言模型的多轮越狱攻击。
English Summary: The proposed safety steering framework using a neural barrier function effectively prevents multi-turn jailbreaking attacks in large language models by proactively detecting harmful queries and ensuring invariant safety throughout dialogues.
Authors:Xinyu Yuan, Zichen Wang, Marcus Collins, Huzefa Rangwala
Abstract:
Recent years have witnessed a surge in the development of protein structural tokenization methods, which chunk protein 3D structures into discrete or continuous representations. Structure tokenization enables the direct application of powerful techniques like language modeling for protein structures, and large multimodal models to integrate structures with protein sequences and functional texts. Despite the progress, the capabilities and limitations of these methods remain poorly understood due to the lack of a unified evaluation framework. We first introduce StructTokenBench, a framework that comprehensively evaluates the quality and efficiency of structure tokenizers, focusing on fine-grained local substructures rather than global structures, as typical in existing benchmarks. Our evaluations reveal that no single model dominates all benchmarking perspectives. Observations of codebook under-utilization led us to develop AminoAseed, a simple yet effective strategy that enhances codebook gradient updates and optimally balances codebook size and dimension for improved tokenizer utilization and quality. Compared to the leading model ESM3, our method achieves an average of 6.31% performance improvement across 24 supervised tasks, with sensitivity and utilization rates increased by 12.83% and 124.03%, respectively. Source code and model weights are available at https://github.com/KatarinaYuan/StructTokenBench
Chinese: 本研究提出了StructTokenBench框架以评估蛋白质结构标记化方法,并开发了AminoAseed策略,通过优化码本利用显著提升了标记器性能,在监督任务中相比现有模型实现了明显改进。
English: The study introduces StructTokenBench, a framework for evaluating protein structure tokenization methods, and proposes AminoAseed, a strategy that improves tokenizer performance by enhancing codebook utilization, achieving significant gains over existing models in supervised tasks.
Authors:Zhenxing Cui, Lu Chen, Yunhai Wang, Daniel Haehn, Yong Wang, Hanspeter Pfister
Abstract:
This paper presents a systematic study of the generalization of convolutional neural networks (CNNs) and humans on relational reasoning tasks with bar charts. We first revisit previous experiments on graphical perception and update the benchmark performance of CNNs. We then test the generalization performance of CNNs on a classic relational reasoning task: estimating bar length ratios in a bar chart, by progressively perturbing the standard visualizations. We further conduct a user study to compare the performance of CNNs and humans. Our results show that CNNs outperform humans only when the training and test data have the same visual encodings. Otherwise, they may perform worse. We also find that CNNs are sensitive to perturbations in various visual encodings, regardless of their relevance to the target bars. Yet, humans are mainly influenced by bar lengths. Our study suggests that robust relational reasoning with visualizations is challenging for CNNs. Improving CNNs' generalization performance may require training them to better recognize task-related visual properties.
中文: 本研究表明,尽管卷积神经网络在视觉编码一致的条形图关系推理任务中表现优于人类,但在视觉干扰下其性能显著下降,而人类主要关注条形长度,这凸显了卷积神经网络在此类任务中实现稳健泛化的挑战。
English: This study demonstrates that while CNNs can outperform humans in relational reasoning tasks with bar charts when visual encodings remain consistent, their performance significantly deteriorates under visual perturbations, unlike humans who focus primarily on bar lengths, highlighting the challenge of achieving robust generalization in CNNs for such tasks.
Authors:Qiusi Zhan, Richard Fang, Henil Shalin Panchal, Daniel Kang
Abstract:
Large Language Model (LLM) agents exhibit remarkable performance across diverse applications by using external tools to interact with environments. However, integrating external tools introduces security risks, such as indirect prompt injection (IPI) attacks. Despite defenses designed for IPI attacks, their robustness remains questionable due to insufficient testing against adaptive attacks. In this paper, we evaluate eight different defenses and bypass all of them using adaptive attacks, consistently achieving an attack success rate of over 50%. This reveals critical vulnerabilities in current defenses. Our research underscores the need for adaptive attack evaluation when designing defenses to ensure robustness and reliability. The code is available at https://github.com/uiuc-kang-lab/AdaptiveAttackAgent.
中文: 大语言模型代理面临间接提示注入攻击的重大安全风险,自适应攻击成功绕过了所有八种评估防御措施,成功率超过50%,揭示了当前防御的关键漏洞,并强调了设计防御时进行自适应攻击评估的必要性。
English: Large Language Model agents face significant security risks from indirect prompt injection attacks, as demonstrated by adaptive attacks that successfully bypass all eight evaluated defenses with over 50% success rates, highlighting critical vulnerabilities and the need for robust defense evaluations.
Authors:Hongyi Cai, Yuqian Fu, Hongming Fu, Bo Zhao
Abstract:
Instruction tuning is crucial for optimizing Large Language Models (LLMs), yet mainstream data selection methods heavily rely on LLMs as instruction quality scorers, leading to high computational costs and reduced data diversity. To address these limitations, we propose MergeIT, a novel LLM-based Merging strategy for better Instruction Tuning that shifts the focus from selection to synthesis. MergeIT operates in two stages: first, topic-aware filtering clusters and refines the dataset, preserving diversity while eliminating redundancy without relying on LLM-based scoring. Second, LLM-based merging synthesizes semantically similar instructions into more informative and compact training data, enhancing data richness while further reducing dataset size. Experimental results demonstrate that MergeIT enables efficient, diverse, and scalable instruction selection and synthesis, establishing LLM-based merging as a promising alternative to conventional scoring-based selection methods for instruction tuning. Our source code and datasets are now available at https://github.com/XcloudFance/MergeIT
中文摘要:MergeIT提出了一种基于大型语言模型的两阶段合并策略,通过主题感知过滤保留多样性并消除冗余,再合成语义相似的指令为更丰富紧凑的训练数据,实验证明其比传统评分选择方法更高效、可扩展。
English Summary: MergeIT introduces a two-stage LLM-based merging strategy for instruction tuning, which first filters and clusters data by topic to maintain diversity without LLM scoring, then synthesizes similar instructions into richer, more compact training sets, proving more efficient and scalable than traditional selection methods.
Authors:Menghua Wu, Russell Littman, Jacob Levine, Lin Qiu, Tommaso Biancalani, David Richmond, Jan-Christian Huetter
Abstract:
High-content perturbation experiments allow scientists to probe biomolecular systems at unprecedented resolution, but experimental and analysis costs pose significant barriers to widespread adoption. Machine learning has the potential to guide efficient exploration of the perturbation space and extract novel insights from these data. However, current approaches neglect the semantic richness of the relevant biology, and their objectives are misaligned with downstream biological analyses. In this paper, we hypothesize that large language models (LLMs) present a natural medium for representing complex biological relationships and rationalizing experimental outcomes. We propose PerturbQA, a benchmark for structured reasoning over perturbation experiments. Unlike current benchmarks that primarily interrogate existing knowledge, PerturbQA is inspired by open problems in perturbation modeling: prediction of differential expression and change of direction for unseen perturbations, and gene set enrichment. We evaluate state-of-the-art machine learning and statistical approaches for modeling perturbations, as well as standard LLM reasoning strategies, and we find that current methods perform poorly on PerturbQA. As a proof of feasibility, we introduce Summer (SUMMarize, retrievE, and answeR, a simple, domain-informed LLM framework that matches or exceeds the current state-of-the-art. Our code and data are publicly available at https://github.com/genentech/PerturbQA.
中文: PerturbQA是一个专为提升扰动实验中结构化推理能力而设计的新基准,通过利用大型语言模型更好地捕捉生物学语义,解决了现有方法在预测未知扰动方面的不足,并显著提高了分析准确性。
English: PerturbQA is a new benchmark designed to enhance machine learning models' structured reasoning in perturbation experiments, addressing current limitations by leveraging large language models to better capture biological semantics and improve predictive accuracy for unseen perturbations.
Authors:Li Yang, Mirna El Rajab, Abdallah Shami, Sami Muhaidat
Abstract:
Zero-Touch Networks (ZTNs) represent a state-of-the-art paradigm shift towards fully automated and intelligent network management, enabling the automation and intelligence required to manage the complexity, scale, and dynamic nature of next-generation (6G) networks. ZTNs leverage Artificial Intelligence (AI) and Machine Learning (ML) to enhance operational efficiency, support intelligent decision-making, and ensure effective resource allocation. However, the implementation of ZTNs is subject to security challenges that need to be resolved to achieve their full potential. In particular, two critical challenges arise: the need for human expertise in developing AI/ML-based security mechanisms, and the threat of adversarial attacks targeting AI/ML models. In this survey paper, we provide a comprehensive review of current security issues in ZTNs, emphasizing the need for advanced AI/ML-based security mechanisms that require minimal human intervention and protect AI/ML models themselves. Furthermore, we explore the potential of Automated ML (AutoML) technologies in developing robust security solutions for ZTNs. Through case studies, we illustrate practical approaches to securing ZTNs against both conventional and AI/ML-specific threats, including the development of autonomous intrusion detection systems and strategies to combat Adversarial ML (AML) attacks. The paper concludes with a discussion of the future research directions for the development of ZTN security approaches.
中文: 零接触网络(ZTNs)利用人工智能和机器学习实现网络管理自动化,但面临安全挑战,包括开发AI/ML安全机制需人力参与及对抗性攻击威胁,本综述通过回顾现有问题并探索AutoML技术来寻求解决方案。
English: Zero-Touch Networks (ZTNs) utilize AI and ML to automate network management but face security challenges, including the need for human expertise in developing AI/ML security mechanisms and threats from adversarial attacks, which this survey addresses by reviewing current issues and exploring AutoML for robust solutions.
Authors:Xi Tang, Jihao Qiu, Lingxi Xie, Yunjie Tian, Jianbin Jiao, Qixiang Ye
Abstract:
Multimodal large language models (MLLMs) have enabled open-world visual understanding by injecting visual input as extra tokens into large language models (LLMs) as contexts. However, when the visual input changes from a single image to a long video, the above paradigm encounters difficulty because the vast amount of video tokens has significantly exceeded the maximal capacity of MLLMs. Therefore, existing video-based MLLMs are mostly established upon sampling a small portion of tokens from input data, which can cause key information to be lost and thus produce incorrect answers. This paper presents a simple yet effective algorithm named Adaptive Keyframe Sampling (AKS). It inserts a plug-and-play module known as keyframe selection, which aims to maximize the useful information with a fixed number of video tokens. We formulate keyframe selection as an optimization involving (1) the relevance between the keyframes and the prompt, and (2) the coverage of the keyframes over the video, and present an adaptive algorithm to approximate the best solution. Experiments on two long video understanding benchmarks validate that Adaptive Keyframe Sampling improves video QA accuracy (beyond strong baselines) upon selecting informative keyframes. Our study reveals the importance of information pre-filtering in video-based MLLMs. Code is available at https://github.com/ncTimTang/AKS.
中文: 本文提出自适应关键帧采样算法(AKS),通过优化关键帧与提示的相关性及视频覆盖范围,有效提升多模态大语言模型在长视频理解中的问答准确率。
English: This paper introduces Adaptive Keyframe Sampling (AKS), a plug-and-play module that optimizes video token selection by balancing relevance to prompts and video coverage, thereby enhancing long video understanding accuracy in multimodal large language models.
Authors:Baiting Luo, Ava Pettet, Aron Laszka, Abhishek Dubey, Ayan Mukhopadhyay
Abstract:
Sequential decision-making in high-dimensional continuous action spaces, particularly in stochastic environments, faces significant computational challenges. We explore this challenge in the traditional offline RL setting, where an agent must learn how to make decisions based on data collected through a stochastic behavior policy. We present Latent Macro Action Planner (L-MAP), which addresses this challenge by learning a set of temporally extended macro-actions through a state-conditional Vector Quantized Variational Autoencoder (VQ-VAE), effectively reducing action dimensionality. L-MAP employs a (separate) learned prior model that acts as a latent transition model and allows efficient sampling of plausible actions. During planning, our approach accounts for stochasticity in both the environment and the behavior policy by using Monte Carlo tree search (MCTS). In offline RL settings, including stochastic continuous control tasks, L-MAP efficiently searches over discrete latent actions to yield high expected returns. Empirical results demonstrate that L-MAP maintains low decision latency despite increased action dimensionality. Notably, across tasks ranging from continuous control with inherently stochastic dynamics to high-dimensional robotic hand manipulation, L-MAP significantly outperforms existing model-based methods and performs on-par with strong model-free actor-critic baselines, highlighting the effectiveness of the proposed approach in planning in complex and stochastic environments with high-dimensional action spaces.
中文: 潜在宏动作规划器(L-MAP)通过将连续动作空间离散化为潜在宏动作,并利用蒙特卡洛树搜索处理环境随机性,在离线强化学习的随机控制任务中显著优于现有方法且保持低决策延迟。
English: The Latent Macro Action Planner (L-MAP) addresses computational challenges in stochastic, high-dimensional continuous action spaces by learning temporally extended macro-actions and employing Monte Carlo tree search for efficient planning, significantly outperforming existing methods in offline reinforcement learning tasks.
Authors:Yujie Li, Xiangkun Wang, Xin Yang, Marcello Bonsangue, Junbo Zhang, Tianrui Li
Abstract:
Open-world continual learning (OWCL) adapts to sequential tasks with open samples, learning knowledge incrementally while preventing forgetting. However, existing OWCL still requires a large amount of labeled data for training, which is often impractical in real-world applications. Given that new categories/entities typically come with limited annotations and are in small quantities, a more realistic situation is OWCL with scarce labeled data, i.e., few-shot training samples. Hence, this paper investigates the problem of open-world few-shot continual learning (OFCL), challenging in (i) learning unbounded tasks without forgetting previous knowledge and avoiding overfitting, (ii) constructing compact decision boundaries for open detection with limited labeled data, and (iii) transferring knowledge about knowns and unknowns and even update the unknowns to knowns once the labels of open samples are learned. In response, we propose a novel OFCL framework that integrates three key components: (1) an instance-wise token augmentation (ITA) that represents and enriches sample representations with additional knowledge, (2) a margin-based open boundary (MOB) that supports open detection with new tasks emerge over time, and (3) an adaptive knowledge space (AKS) that endows unknowns with knowledge for the updating from unknowns to knowns. Finally, extensive experiments show that the proposed OFCL framework outperforms all baselines remarkably with practical importance and reproducibility. The source code is released at https://github.com/liyj1201/OFCL.
中文: 本文提出了一种新颖的开放世界小样本持续学习框架,通过实例化标记增强、基于边界的开放检测和自适应知识空间,解决了有限标注数据下的持续学习与知识迁移难题,实验证明其性能显著优于现有方法。
English: This paper introduces a novel framework for open-world few-shot continual learning (OFCL), addressing the challenges of learning from limited labeled data while preventing forgetting and enabling knowledge transfer through instance-wise token augmentation, margin-based open boundary, and adaptive knowledge space, with experimental results showing superior performance over existing methods.
Authors:Jonathan Drechsel, Anja Reusch, Steffen Herbold
Abstract:
Mathematical formulas are a fundamental and widely used component in various scientific fields, serving as a universal language for expressing complex concepts and relationships. While state-of-the-art transformer models excel in processing and understanding natural language, they encounter challenges with mathematical notation, which involves a complex structure and diverse representations. This study focuses on the development of specialized training datasets to enhance the encoding of mathematical content. We introduce Math Mutator (MAMUT), a framework capable of generating equivalent and falsified versions of a given mathematical formula in LaTeX notation, effectively capturing the mathematical variety in notation of the same concept. Based on MAMUT, we have generated four large mathematical datasets containing diverse notation. Experiments show that models trained on these datasets exhibit new SoTA performance on mathematical retrieval tasks. We publish our code, generated datasets, and pretrained mathematical models: https://github.com/aieng-lab/math-mutator.
中文摘要:本研究提出Math Mutator (MAMUT)框架,通过生成多样化数学公式变体构建专业训练数据集,在数学检索任务中实现了最先进的性能表现。
English Summary: This study introduces Math Mutator (MAMUT), a framework that generates diverse mathematical formula variations to create specialized training datasets, achieving state-of-the-art performance in mathematical retrieval tasks.
Authors:Yuxiang Chen, Haocheng Xi, Jun Zhu, Jianfei Chen
Abstract:
Pre-training Transformers in FP4 precision is becoming a promising approach to gain substantial speedup, but it comes with a considerable loss of accuracy. Microscaling (MX) data format provides a fine-grained per-group quantization method to improve the representation ability of the FP4 format and is supported by the next-generation Blackwell GPU architecture. However, training with MXFP4 data format still results in significant degradation and there is a lack of systematic research on the reason.
In this work, we propose a novel training method TetraJet for a more accurate FP4 training. We comprehensively evaluate all of the quantizers involved in the training, and identify the weight oscillation problem in the forward pass as the main source of the degradation in MXFP4 training. Therefore, we introduce two novel methods, EMA Quantizer (Q-EMA) and Adaptive Ramping Optimizer (Q-Ramping), to resolve the oscillation problem. Extensive experiments on Vision Transformers demonstrate that TetraJet consistently outperforms the existing 4-bit training methods, and Q-EMA & Q-Ramping can provide additional enhancement by effectively reducing oscillation. We decreased the accuracy degradation by more than $50\%$ compared to the baseline, and can even achieve competitive performance compared to full precision training. The codes are available at https://github.com/thu-ml/TetraJet-MXFP4Training
中文摘要:本研究提出TetraJet新型训练方法,通过识别并利用Q-EMA和Q-Ramping技术解决MXFP4训练中的权重振荡问题,相比基线方法将精度损失降低超过50%,在视觉Transformer上实现了与全精度训练相媲美的性能。
English Summary: The study introduces TetraJet, a novel training method that addresses accuracy degradation in FP4 pre-training by identifying and mitigating weight oscillation through Q-EMA and Q-Ramping techniques, achieving over 50% reduction in accuracy loss compared to baseline methods.
Authors:Ragib Amin Nihal, Benjamin Yen, Runwu Shi, Kazuhiro Nakadai
Abstract:
Marine ecosystem monitoring via Passive Acoustic Monitoring (PAM) generates vast data, but deep learning often requires precise annotations and short segments. We introduce DSMIL-LocNet, a Multiple Instance Learning framework for whale call detection and localization using only bag-level labels. Our dual-stream model processes 2-30 minute audio segments, leveraging spectral and temporal features with attention-based instance selection. Tests on Antarctic whale data show longer contexts improve classification (F1: 0.8-0.9) while medium instances ensure localization precision (0.65-0.70). This suggests MIL can enhance scalable marine monitoring. Code: https://github.com/Ragib-Amin-Nihal/DSMIL-Loc
中文:DSMIL-LocNet框架通过多示例学习仅需包级标签即可检测和定位鲸鱼叫声,在南极鲸鱼数据上展现出更好的分类和定位能力,有助于可扩展的海洋监测。
English: The DSMIL-LocNet framework uses multiple instance learning to detect and locate whale calls with only bag-level labels, demonstrating improved classification and localization on Antarctic whale data for scalable marine monitoring.
Authors:Long Chen, Xianchao Xiu
Abstract:
Sparse principal component analysis (PCA) is a well-established dimensionality reduction technique that is often used for unsupervised feature selection (UFS). However, determining the regularization parameters is rather challenging, and conventional approaches, including grid search and Bayesian optimization, not only bring great computational costs but also exhibit high sensitivity. To address these limitations, we first establish a structured sparse PCA formulation by integrating $\ell_1$-norm and $\ell_{2,1}$-norm to capture the local and global structures, respectively. Building upon the off-the-shelf alternating direction method of multipliers (ADMM) optimization framework, we then design an interpretable deep unfolding network that translates iterative optimization steps into trainable neural architectures. This innovation enables automatic learning of the regularization parameters, effectively bypassing the empirical tuning requirements of conventional methods. Numerical experiments on benchmark datasets validate the advantages of our proposed method over the existing state-of-the-art methods. Our code will be accessible at https://github.com/xianchaoxiu/SPCA-Net.
Chinese: 本文提出了一种深度展开网络,能够自动学习稀疏主成分分析的正则化参数,有效避免了传统方法的高计算成本和敏感性,并在基准数据集上验证了其优于现有方法的性能。
English: This paper introduces a deep unfolding network that automatically learns regularization parameters for sparse PCA, overcoming the computational cost and sensitivity of traditional methods while outperforming existing approaches on benchmark datasets.
Authors:Xiang Xiang, Zhuo Xu, Yao Deng, Qinhao Zhou, Yifan Liang, Ke Chen, Qingfang Zheng, Yaowei Wang, Xilin Chen, Wen Gao
Abstract:
The advancement of remote sensing, including satellite systems, facilitates the continuous acquisition of remote sensing imagery globally, introducing novel challenges for achieving open-world tasks. Deployed models need to continuously adjust to a constant influx of new data, which frequently exhibits diverse shifts from the data encountered during the training phase. To effectively handle the new data, models are required to detect semantic shifts, adapt to covariate shifts, and continuously update their parameters without forgetting learned knowledge, as has been considered in works on a variety of open-world tasks. However, existing studies are typically conducted within a single dataset to simulate realistic conditions, with a lack of large-scale benchmarks capable of evaluating multiple open-world tasks. In this paper, we introduce \textbf{OpenEarthSensing (OES)}, a large-scale fine-grained benchmark for open-world remote sensing. OES includes 189 scene and object categories, covering the vast majority of potential semantic shifts that may occur in the real world. Additionally, to provide a more comprehensive testbed for evaluating the generalization performance, OES encompasses five data domains with significant covariate shifts, including two RGB satellite domains, one RGB aerial domain, one multispectral RGB domain, and one infrared domain. We evaluate the baselines and existing methods for diverse tasks on OES, demonstrating that it serves as a meaningful and challenging benchmark for open-world remote sensing. The proposed dataset OES is available at https://haiv-lab.github.io/OES.
Authors:Li Yang, Shimaa Naser, Abdallah Shami, Sami Muhaidat, Lyndon Ong, Mérouane Debbah
Abstract:
The transition from 5G to 6G mobile networks necessitates network automation to meet the escalating demands for high data rates, ultra-low latency, and integrated technology. Recently, Zero-Touch Networks (ZTNs), driven by Artificial Intelligence (AI) and Machine Learning (ML), are designed to automate the entire lifecycle of network operations with minimal human intervention, presenting a promising solution for enhancing automation in 5G/6G networks. However, the implementation of ZTNs brings forth the need for autonomous and robust cybersecurity solutions, as ZTNs rely heavily on automation. AI/ML algorithms are widely used to develop cybersecurity mechanisms, but require substantial specialized expertise and encounter model drift issues, posing significant challenges in developing autonomous cybersecurity measures. Therefore, this paper proposes an automated security framework targeting Physical Layer Authentication (PLA) and Cross-Layer Intrusion Detection Systems (CLIDS) to address security concerns at multiple Internet protocol layers. The proposed framework employs drift-adaptive online learning techniques and a novel enhanced Successive Halving (SH)-based Automated ML (AutoML) method to automatically generate optimized ML models for dynamic networking environments. Experimental results illustrate that the proposed framework achieves high performance on the public Radio Frequency (RF) fingerprinting and the Canadian Institute for CICIDS2017 datasets, showcasing its effectiveness in addressing PLA and CLIDS tasks within dynamic and complex networking environments. Furthermore, the paper explores open challenges and research directions in the 5G/6G cybersecurity domain. This framework represents a significant advancement towards fully autonomous and secure 6G networks, paving the way for future innovations in network automation and cybersecurity.
中文: 本文提出一种采用漂移自适应在线学习和增强型AutoML的自动化安全框架,以解决5G/6G网络中的网络安全挑战,在物理层认证和跨层入侵检测方面实现高性能,并探讨了未来研究方向。
English: This paper proposes an automated security framework using drift-adaptive online learning and enhanced AutoML to address cybersecurity challenges in 5G/6G networks, achieving high performance in physical layer authentication and cross-layer intrusion detection while exploring future research directions.
Authors:Vicente Balmaseda, Bokun Wang, Ching-Long Lin, Tianbao Yang
Abstract:
In self-supervised contrastive learning, negative pairs are typically constructed using an anchor image and a sample drawn from the entire dataset, excluding the anchor. However, this approach can result in the creation of negative pairs with similar semantics, referred to as "false negatives", leading to their embeddings being falsely pushed apart. To address this issue, we introduce GloFND, an optimization-based approach that automatically learns on the fly the threshold for each anchor data to identify its false negatives during training. In contrast to previous methods for false negative discovery, our approach globally detects false negatives across the entire dataset rather than locally within the mini-batch. Moreover, its per-iteration computation cost remains independent of the dataset size. Experimental results on image and image-text data demonstrate the effectiveness of the proposed method. Our implementation is available at https://github.com/vibalcam/GloFND.
中文: 本文提出GloFND方法,通过自适应学习每个锚点的阈值来全局识别自监督对比学习中的假阴性样本,有效避免语义相似样本被错误分离,且计算成本不随数据集规模增加。
English: This paper introduces GloFND, an optimization-based method that globally identifies false negatives in self-supervised contrastive learning by adaptively learning per-anchor thresholds, effectively preventing semantically similar pairs from being incorrectly separated while maintaining computational efficiency independent of dataset size.
Authors:Mingyuan Wu, Jize Jiang, Haozhen Zheng, Meitang Li, Zhaoheng Li, Beitong Tian, Bo Chen, Yongjoo Park, Minjia Zhang, Chengxiang Zhai, Klara Nahrstedt
Abstract:
Vision Language Models (VLMs) have achieved remarkable success in a wide range of vision applications of increasing complexity and scales, yet choosing the right VLM model size involves a trade-off between response quality and cost. While smaller VLMs are cheaper to run, they typically produce responses only marginally better than random guessing on benchmarks such as MMMU. In this paper, we propose Cache of Thought (CoT), a master apprentice framework for collaborative inference between large and small VLMs. CoT manages high quality query results from large VLMs (master) in a cache, which are then selected via a novel multi modal retrieval and in-context learning to aid the performance of small VLMs (apprentice). We extensively evaluate CoT on various widely recognized and challenging general reasoning benchmarks, and show that CoT increases overall reasoning performance by up to 7.7% under the same budget, and specifically boosts the performance of apprentice VLMs by up to 36.6%. Our code is available at https://github.com/UIUC-MONET/Cache-of-Thoughts
Chinese: 本文提出Cache of Thought (CoT)框架,通过大型视觉语言模型的缓存响应来增强小型模型的推理能力,在相同预算下整体性能提升最高达7.7%。
English: The paper introduces Cache of Thought (CoT), a master-apprentice framework that uses cached responses from large VLMs to enhance the reasoning performance of smaller VLMs, achieving up to a 7.7% overall improvement under the same budget.
Authors:Keisuke Kamahori, Jungo Kasai, Noriyuki Kojima, Baris Kasikci
Abstract:
Modern automatic speech recognition (ASR) models, such as OpenAI's Whisper, rely on deep encoder-decoder architectures, and their encoders are a critical bottleneck for efficient deployment due to high computational intensity. We introduce LiteASR, a low-rank compression scheme for ASR encoders that significantly reduces inference costs while maintaining transcription accuracy. Our approach leverages the strong low-rank properties observed in intermediate activations: by applying principal component analysis (PCA) with a small calibration dataset, we approximate linear transformations with a chain of low-rank matrix multiplications, and further optimize self-attention to work in reduced dimensionality. Evaluation results show that our method can compress Whisper large-v3's encoder size by over 50%, matching Whisper medium's size with better transcription accuracy, thereby establishing a new Pareto frontier of accuracy and efficiency. The code of LiteASR is available at https://github.com/efeslab/LiteASR.
Chinese Summary: LiteASR提出了一种针对ASR编码器的低秩压缩方案,通过主成分分析和优化自注意力机制,在保持转录精度的同时将编码器尺寸压缩超过50%,实现了准确性与效率的新帕累托前沿。
English Summary: LiteASR introduces a low-rank compression technique for ASR encoders that reduces model size by over 50% while maintaining transcription accuracy, establishing a new Pareto frontier for speech recognition efficiency.
Authors:Vladimir Zaigrajew, Hubert Baniecki, Przemyslaw Biecek
Abstract:
Sparse autoencoders (SAEs) are useful for detecting and steering interpretable features in neural networks, with particular potential for understanding complex multimodal representations. Given their ability to uncover interpretable features, SAEs are particularly valuable for analyzing large-scale vision-language models (e.g., CLIP and SigLIP), which are fundamental building blocks in modern systems yet remain challenging to interpret and control. However, current SAE methods are limited by optimizing both reconstruction quality and sparsity simultaneously, as they rely on either activation suppression or rigid sparsity constraints. To this end, we introduce Matryoshka SAE (MSAE), a new architecture that learns hierarchical representations at multiple granularities simultaneously, enabling a direct optimization of both metrics without compromise. MSAE establishes a new state-of-the-art Pareto frontier between reconstruction quality and sparsity for CLIP, achieving 0.99 cosine similarity and less than 0.1 fraction of variance unexplained while maintaining ~80% sparsity. Finally, we demonstrate the utility of MSAE as a tool for interpreting and controlling CLIP by extracting over 120 semantic concepts from its representation to perform concept-based similarity search and bias analysis in downstream tasks like CelebA. We make the codebase available at https://github.com/WolodjaZ/MSAE.
Chinese: Matryoshka稀疏自编码器(MSAE)通过分层架构同时优化重构质量与稀疏性,在CLIP等视觉语言模型中实现了最先进的性能,并能通过语义概念提取有效进行特征解释与控制。
English: The Matryoshka Sparse Autoencoder (MSAE) introduces a hierarchical architecture that simultaneously optimizes reconstruction quality and sparsity for vision-language models like CLIP, achieving state-of-the-art performance while enabling effective feature interpretation and control through semantic concept extraction.
Authors:Joana C. Costa, Tiago Roxo, Hugo Proença, Pedro R. M. Inácio
Abstract:
State-of-the-art defense mechanisms are typically evaluated in the context of white-box attacks, which is not realistic, as it assumes the attacker can access the gradients of the target network. To protect against this scenario, Adversarial Training (AT) and Adversarial Distillation (AD) include adversarial examples during the training phase, and Adversarial Purification uses a generative model to reconstruct all the images given to the classifier. This paper considers an even more realistic evaluation scenario: gray-box attacks, which assume that the attacker knows the architecture and the dataset used to train the target network, but cannot access its gradients. We provide empirical evidence that models are vulnerable to gray-box attacks and propose LISArD, a defense mechanism that does not increase computational and temporal costs but provides robustness against gray-box and white-box attacks without including AT. Our method approximates a cross-correlation matrix, created with the embeddings of perturbed and clean images, to a diagonal matrix while simultaneously conducting classification learning. Our results show that LISArD can effectively protect against gray-box attacks, can be used in multiple architectures, and carries over its resilience to the white-box scenario. Also, state-of-the-art AD models underperform greatly when removing AT and/or moving to gray-box settings, highlighting the lack of robustness from existing approaches to perform in various conditions (aside from white-box settings). All the source code is available at https://github.com/Joana-Cabral/LISArD.
Chinese: 本文提出LISArD防御机制,通过在分类学习中将近似的扰动与干净图像嵌入的互相关矩阵对角化,无需对抗训练即可有效防御灰盒和白盒攻击,并保持计算效率。
English: This paper introduces LISArD, a defense mechanism that effectively protects models against gray-box and white-box attacks without adversarial training by approximating a cross-correlation matrix of embeddings from perturbed and clean images to a diagonal matrix during classification learning.
Authors:Jin Peng Zhou, Kaiwen Wang, Jonathan Chang, Zhaolin Gao, Nathan Kallus, Kilian Q. Weinberger, Kianté Brantley, Wen Sun
Abstract:
Reinforcement learning (RL) post-training is crucial for LLM alignment and reasoning, but existing policy-based methods, such as PPO and DPO, can fall short of fixing shortcuts inherited from pre-training. In this work, we introduce $Q\sharp$, a value-based algorithm for KL-regularized RL that guides the reference policy using the optimal regularized $Q$ function. We propose to learn the optimal $Q$ function using distributional RL on an aggregated online dataset. Unlike prior value-based baselines that guide the model using unregularized $Q$-values, our method is theoretically principled and provably learns the optimal policy for the KL-regularized RL problem. Empirically, $Q\sharp$ outperforms prior baselines in math reasoning benchmarks while maintaining a smaller KL divergence to the reference policy. Theoretically, we establish a reduction from KL-regularized RL to no-regret online learning, providing the first bounds for deterministic MDPs under only realizability. Thanks to distributional RL, our bounds are also variance-dependent and converge faster when the reference policy has small variance. In sum, our results highlight $Q\sharp$ as an effective approach for post-training LLMs, offering both improved performance and theoretical guarantees. The code can be found at https://github.com/jinpz/q_sharp.
中文: 提出的$Q\sharp$算法采用基于价值的分布强化学习方法优化KL正则化强化学习,在数学推理基准上优于现有方法,同时保持理论保证和更小的KL散度。
English: The proposed $Q\sharp$ algorithm introduces a value-based approach using distributional reinforcement learning to optimize KL-regularized RL, outperforming existing methods in math reasoning while maintaining theoretical guarantees and smaller KL divergence.
Authors:Long Minh Bui, Tho Tran Huu, Duy Dinh, Tan Minh Nguyen, Trong Nghia Hoang
Abstract:
Transformers have increasingly become the de facto method to model sequential data with state-of-the-art performance. Due to its widespread use, being able to estimate and calibrate its modeling uncertainty is important to understand and design robust transformer models. To achieve this, previous works have used Gaussian processes (GPs) to perform uncertainty calibration for the attention units of transformers and attained notable successes. However, such approaches have to confine the transformers to the space of symmetric attention to ensure the necessary symmetric requirement of their GP's kernel specification, which reduces the representation capacity of the model. To mitigate this restriction, we propose the Correlated Gaussian Process Transformer (CGPT), a new class of transformers whose self-attention units are modeled as cross-covariance between two correlated GPs (CGPs). This allows asymmetries in attention and can enhance the representation capacity of GP-based transformers. We also derive a sparse approximation for CGP to make it scale better. Our empirical studies show that both CGP-based and sparse CGP-based transformers achieve better performance than state-of-the-art GP-based transformers on a variety of benchmark tasks. The code for our experiments is available at https://github.com/MinhLong210/CGP-Transformers.
Chinese: 提出的相关高斯过程变换器(CGPT)通过使用相关高斯过程建模自注意力机制,克服了以往基于高斯过程的变换器必须保持对称性的限制,实现了非对称注意力并提升了模型表达能力,同时通过稀疏近似保持了良好的可扩展性。
English: The proposed Correlated Gaussian Process Transformer (CGPT) overcomes the symmetry limitations of previous GP-based transformers by using correlated Gaussian processes for self-attention, enabling asymmetric attention and improved representation capacity while maintaining scalability through sparse approximation.
Authors:Tianyi Lorena Yan, Robin Jia
Abstract:
To answer one-to-many factual queries (e.g., listing cities of a country), a language model (LM) must simultaneously recall knowledge and avoid repeating previous answers. How are these two subtasks implemented and integrated internally? Across multiple datasets, models, and prompt templates, we identify a promote-then-suppress mechanism: the model first recalls all answers, and then suppresses previously generated ones. Specifically, LMs use both the subject and previous answer tokens to perform knowledge recall, with attention propagating subject information and MLPs promoting the answers. Then, attention attends to and suppresses previous answer tokens, while MLPs amplify the suppression signal. Our mechanism is corroborated by extensive experimental evidence: in addition to using early decoding and causal tracing, we analyze how components use different tokens by introducing both Token Lens, which decodes aggregated attention updates from specified tokens, and a knockout method that analyzes changes in MLP outputs after removing attention to specified tokens. Overall, we provide new insights into how LMs' internal components interact with different input tokens to support complex factual recall. Code is available at https://github.com/Lorenayannnnn/how-lms-answer-one-to-many-factual-queries.
中文: 语言模型采用“先促进后抑制”机制,通过注意力与多层感知机先召回全部事实答案再抑制已生成内容,这一机制经Token Lens和敲除分析等实验方法得到验证。
English: Language models employ a promote-then-suppress mechanism, using attention and MLPs to first recall all factual answers and then suppress previously generated ones, as validated through experimental techniques like Token Lens and knockout analysis.
Authors:Yiheng Liu, Xiaohui Gao, Haiyang Sun, Bao Ge, Tianming Liu, Junwei Han, Xintao Hu
Abstract:
In recent years, the rapid advancement of large language models (LLMs) in natural language processing has sparked significant interest among researchers to understand their mechanisms and functional characteristics. Although existing studies have attempted to explain LLM functionalities by identifying and interpreting specific neurons, these efforts mostly focus on individual neuron contributions, neglecting the fact that human brain functions are realized through intricate interaction networks. Inspired by cognitive neuroscience research on functional brain networks (FBNs), this study introduces a novel approach to investigate whether similar functional networks exist within LLMs. We use methods similar to those in the field of functional neuroimaging analysis to locate and identify functional networks in LLM. Experimental results show that, similar to the human brain, LLMs contain functional networks that frequently recur during operation. Further analysis shows that these functional networks are crucial for LLM performance. Masking key functional networks significantly impairs the model's performance, while retaining just a subset of these networks is adequate to maintain effective operation. This research provides novel insights into the interpretation of LLMs and the lightweighting of LLMs for certain downstream tasks. Code is available at https://github.com/WhatAboutMyStar/LLM_ACTIVATION.
Chinese: 本研究受脑功能网络启发,提出了一种识别大语言模型中重复出现的功能网络的新方法,揭示了这些网络对模型性能的关键作用及其在模型轻量化方面的潜力。
English: This study introduces a novel approach inspired by functional brain networks to identify recurring functional networks in large language models, demonstrating their critical role in model performance and potential for model lightweighting.
Authors:Toru Lin, Kartik Sachdev, Linxi Fan, Jitendra Malik, Yuke Zhu
Abstract:
Learning generalizable robot manipulation policies, especially for complex multi-fingered humanoids, remains a significant challenge. Existing approaches primarily rely on extensive data collection and imitation learning, which are expensive, labor-intensive, and difficult to scale. Sim-to-real reinforcement learning (RL) offers a promising alternative, but has mostly succeeded in simpler state-based or single-hand setups. How to effectively extend this to vision-based, contact-rich bimanual manipulation tasks remains an open question. In this paper, we introduce a practical sim-to-real RL recipe that trains a humanoid robot to perform three challenging dexterous manipulation tasks: grasp-and-reach, box lift and bimanual handover. Our method features an automated real-to-sim tuning module, a generalized reward formulation based on contact and object goals, a divide-and-conquer policy distillation framework, and a hybrid object representation strategy with modality-specific augmentation. We demonstrate high success rates on unseen objects and robust, adaptive policy behaviors -- highlighting that vision-based dexterous manipulation via sim-to-real RL is not only viable, but also scalable and broadly applicable to real-world humanoid manipulation tasks.
中文: 本文提出了一种实用的仿真到现实强化学习方法,使人形机器人能够执行复杂的基于视觉的双臂操作任务,在面对未见物体时表现出高成功率和强大的适应能力。
English: This paper presents a practical sim-to-real reinforcement learning method that enables humanoid robots to perform complex vision-based bimanual manipulation tasks with high success rates and robust adaptability to unseen objects.
Authors:Lujie Yang, H. J. Terry Suh, Tong Zhao, Bernhard Paus Graesdal, Tarik Kelestemur, Jiuguang Wang, Tao Pang, Russ Tedrake
Abstract:
We present a low-cost data generation pipeline that integrates physics-based simulation, human demonstrations, and model-based planning to efficiently generate large-scale, high-quality datasets for contact-rich robotic manipulation tasks. Starting with a small number of embodiment-flexible human demonstrations collected in a virtual reality simulation environment, the pipeline refines these demonstrations using optimization-based kinematic retargeting and trajectory optimization to adapt them across various robot embodiments and physical parameters. This process yields a diverse, physically consistent dataset that enables cross-embodiment data transfer, and offers the potential to reuse legacy datasets collected under different hardware configurations or physical parameters. We validate the pipeline's effectiveness by training diffusion policies from the generated datasets for challenging contact-rich manipulation tasks across multiple robot embodiments, including a floating Allegro hand and bimanual robot arms. The trained policies are deployed zero-shot on hardware for bimanual iiwa arms, achieving high success rates with minimal human input. Project website: https://lujieyang.github.io/physicsgen/.
Authors:Arnav Kumar Jain, Gonzalo Gonzalez-Pumariega, Wayne Chen, Alexander M Rush, Wenting Zhao, Sanjiban Choudhury
Abstract:
We address the problem of code generation from multi-turn execution feedback. Existing methods either generate code without feedback or use complex, hierarchical reinforcement learning to optimize multi-turn rewards. We propose a simple yet scalable approach, $μ$Code, that solves multi-turn code generation using only single-step rewards. Our key insight is that code generation is a one-step recoverable MDP, where the correct code can be recovered from any intermediate code state in a single turn. $μ$Code iteratively trains both a generator to provide code solutions conditioned on multi-turn execution feedback and a verifier to score the newly generated code. Experimental evaluations show that our approach achieves significant improvements over the state-of-the-art baselines. We provide analysis of the design choices of the reward models and policy, and show the efficacy of $μ$Code at utilizing the execution feedback. Our code is available at https://github.com/portal-cornell/muCode.
Chinese: 提出的$μ$Code方法通过使用单步奖励,训练生成器和验证器基于执行反馈迭代改进代码解决方案,为多轮代码生成提供了一种简单且可扩展的途径,相比现有方法取得了显著性能提升。
English: The proposed $μ$Code method introduces a simple and scalable approach to multi-turn code generation by using single-step rewards, training both a generator and a verifier to iteratively improve code solutions based on execution feedback, achieving significant performance gains over existing methods.
Authors:Albert Gong, KamilÄ StankeviÄiÅ«tÄ, Chao Wan, Anmol Kabra, Raphael Thesmar, Johann Lee, Julius Klenke, Carla P. Gomes, Kilian Q. Weinberger
Abstract:
High-quality benchmarks are essential for evaluating reasoning and retrieval capabilities of large language models (LLMs). However, curating datasets for this purpose is not a permanent solution as they are prone to data leakage and inflated performance results. To address these challenges, we propose PhantomWiki: a pipeline to generate unique, factually consistent document corpora with diverse question-answer pairs. Unlike prior work, PhantomWiki is neither a fixed dataset, nor is it based on any existing data. Instead, a new PhantomWiki instance is generated on demand for each evaluation. We vary the question difficulty and corpus size to disentangle reasoning and retrieval capabilities respectively, and find that PhantomWiki datasets are surprisingly challenging for frontier LLMs. Thus, we contribute a scalable and data leakage-resistant framework for disentangled evaluation of reasoning, retrieval, and tool-use abilities. Our code is available at https://github.com/kilian-group/phantom-wiki.
中文: PhantomWiki是一种创新的流程,可按需生成独特的文档库和问答对,用于评估大型语言模型,有效解决数据泄露问题,并能够分别评估推理和检索能力。
English: PhantomWiki is a novel pipeline that generates unique, on-demand document corpora and question-answer pairs to evaluate large language models, effectively addressing data leakage and enabling disentangled assessment of reasoning and retrieval capabilities.
Authors:Yongjia Lei, Haoyu Han, Ryan A. Rossi, Franck Dernoncourt, Nedim Lipka, Mahantesh M Halappanavar, Jiliang Tang, Yu Wang
Abstract:
Text-rich Graph Knowledge Bases (TG-KBs) have become increasingly crucial for answering queries by providing textual and structural knowledge. However, current retrieval methods often retrieve these two types of knowledge in isolation without considering their mutual reinforcement and some hybrid methods even bypass structural retrieval entirely after neighboring aggregation. To fill in this gap, we propose a Mixture of Structural-and-Textual Retrieval (MoR) to retrieve these two types of knowledge via a Planning-Reasoning-Organizing framework. In the Planning stage, MoR generates textual planning graphs delineating the logic for answering queries. Following planning graphs, in the Reasoning stage, MoR interweaves structural traversal and textual matching to obtain candidates from TG-KBs. In the Organizing stage, MoR further reranks fetched candidates based on their structural trajectory. Extensive experiments demonstrate the superiority of MoR in harmonizing structural and textual retrieval with insights, including uneven retrieving performance across different query logics and the benefits of integrating structural trajectories for candidate reranking. Our code is available at https://github.com/Yoega/MoR.
Chinese: 提出的结构-文本混合检索框架通过规划、推理和组织三个阶段,将图结构遍历与文本匹配相结合,有效提升了富文本图知识库的检索效果,展现出在协调结构性和文本性知识方面的优越性能。
English: The proposed Mixture of Structural-and-Textual Retrieval (MoR) framework integrates structural traversal and textual matching through planning, reasoning, and organizing stages to enhance retrieval from Text-rich Graph Knowledge Bases, demonstrating superior performance in harmonizing both knowledge types.
Authors:Qingsen Yan, Yixu Feng, Cheng Zhang, Guansong Pang, Kangbiao Shi, Peng Wu, Wei Dong, Jinqiu Sun, Yanning Zhang
Abstract:
Low-Light Image Enhancement (LLIE) is a crucial computer vision task that aims to restore detailed visual information from corrupted low-light images. Many existing LLIE methods are based on standard RGB (sRGB) space, which often produce color bias and brightness artifacts due to inherent high color sensitivity in sRGB. While converting the images using Hue, Saturation and Value (HSV) color space helps resolve the brightness issue, it introduces significant red and black noise artifacts. To address this issue, we propose a new color space for LLIE, namely Horizontal/Vertical-Intensity (HVI), defined by polarized HS maps and learnable intensity. The former enforces small distances for red coordinates to remove the red artifacts, while the latter compresses the low-light regions to remove the black artifacts. To fully leverage the chromatic and intensity information, a novel Color and Intensity Decoupling Network (CIDNet) is further introduced to learn accurate photometric mapping function under different lighting conditions in the HVI space. Comprehensive results from benchmark and ablation experiments show that the proposed HVI color space with CIDNet outperforms the state-of-the-art methods on 10 datasets. The code is available at https://github.com/Fediory/HVI-CIDNet.
中文摘要:提出的HVI色彩空间与CIDNet相结合,能有效消除低光图像增强中的红黑伪影,在多个数据集上实现卓越性能。
English Summary: The proposed HVI color space combined with CIDNet effectively eliminates red and black artifacts in low-light image enhancement, achieving superior performance across multiple datasets.
Authors:Mattéo Clémot, Julie Digne, Julien Tierny
Abstract:
This paper presents a novel topology-aware dimensionality reduction approach aiming at accurately visualizing the cyclic patterns present in high dimensional data. To that end, we build on the Topological Autoencoders (TopoAE) formulation. First, we provide a novel theoretical analysis of its associated loss and show that a zero loss indeed induces identical persistence pairs (in high and low dimensions) for the $0$-dimensional persistent homology (PH$^0$) of the Rips filtration. We also provide a counter example showing that this property no longer holds for a naive extension of TopoAE to PH$^d$ for $d\ge 1$. Based on this observation, we introduce a novel generalization of TopoAE to $1$-dimensional persistent homology (PH$^1$), called TopoAE++, for the accurate generation of cycle-aware planar embeddings, addressing the above failure case. This generalization is based on the notion of cascade distortion, a new penalty term favoring an isometric embedding of the $2$-chains filling persistent $1$-cycles, hence resulting in more faithful geometrical reconstructions of the $1$-cycles in the plane. We further introduce a novel, fast algorithm for the exact computation of PH for Rips filtrations in the plane, yielding improved runtimes over previously documented topology-aware methods. Our method also achieves a better balance between the topological accuracy, as measured by the Wasserstein distance, and the visual preservation of the cycles in low dimensions. Our C++ implementation is available at https://github.com/MClemot/TopologicalAutoencodersPlusPlus.
中文: 本文提出了TopoAE++,一种改进的拓扑感知降维方法,通过将拓扑自编码器推广至一维持续性同调并引入级联失真惩罚项,能精确可视化高维数据中的循环模式,在提升计算效率的同时更好地平衡了拓扑精度与视觉保持。
English: This paper introduces TopoAE++, an enhanced topology-aware dimensionality reduction method that accurately visualizes cyclic patterns in high-dimensional data by generalizing Topological Autoencoders to 1-dimensional persistent homology with a novel cascade distortion penalty, achieving improved runtime and better balance between topological accuracy and visual preservation.
Authors:Zhouyu He, Peng Qiao, Rongchun Li, Yong Dou, Yusong Tan
Abstract:
As the demands for superior agents grow, the training complexity of Deep Reinforcement Learning (DRL) becomes higher. Thus, accelerating training of DRL has become a major research focus. Dividing the DRL training process into subtasks and using parallel computation can effectively reduce training costs. However, current DRL training systems lack sufficient parallelization due to data assignment between subtask components. This assignment issue has been ignored, but addressing it can further boost training efficiency. Therefore, we propose a high-throughput distributed RL training system called TianJi. It relaxes assignment dependencies between subtask components and enables event-driven asynchronous communication. Meanwhile, TianJi maintains clear boundaries between subtask components. To address convergence uncertainty from relaxed assignment dependencies, TianJi proposes a distributed strategy based on the balance of sample production and consumption. The strategy controls the staleness of samples to correct their quality, ensuring convergence. We conducted extensive experiments. TianJi achieves a convergence time acceleration ratio of up to 4.37 compared to related comparison systems. When scaled to eight computational nodes, TianJi shows a convergence time speedup of 1.6 and a throughput speedup of 7.13 relative to XingTian, demonstrating its capability to accelerate training and scalability. In data transmission efficiency experiments, TianJi significantly outperforms other systems, approaching hardware limits. TianJi also shows effectiveness in on-policy algorithms, achieving convergence time acceleration ratios of 4.36 and 2.95 compared to RLlib and XingTian. TianJi is accessible at https://github.com/HiPRL/TianJi.git.
中文: 提出的TianJi系统通过解除子任务间的分配依赖并采用事件驱动异步通信,将深度强化学习训练加速高达4.37倍,同时保持系统可扩展性和传输效率。
English: The proposed TianJi system accelerates Deep Reinforcement Learning training by relaxing assignment dependencies between subtasks and implementing event-driven asynchronous communication, achieving up to 4.37 times faster convergence while maintaining scalability and transmission efficiency.
Authors:Yifan Zhang, Wenyu Du, Dongming Jin, Jie Fu, Zhi Jin
Abstract:
Chain-of-thought (CoT) significantly enhances the performance of large language models (LLMs) across a wide range of tasks, and prior research shows that CoT can theoretically increase expressiveness. However, there is limited mechanistic understanding of the algorithms that Transformer+CoT can learn. Our key contributions are: (1) We evaluate the state tracking capabilities of Transformer+CoT and its variants, confirming the effectiveness of CoT. (2) Next, we identify the circuit (a subset of model components, responsible for tracking the world state), indicating that late-layer MLP neurons play a key role. We propose two metrics, compression and distinction, and show that the neuron sets for each state achieve nearly 100% accuracy, providing evidence of an implicit finite state automaton (FSA) embedded within the model. (3) Additionally, we explore three challenging settings: skipping intermediate steps, introducing data noises, and testing length generalization. Our results demonstrate that Transformer+CoT learns robust algorithms (FSAs), highlighting its resilience in challenging scenarios. Our code is available at https://github.com/IvanChangPKU/FSA.
中文: 思维链(CoT)通过激活后层MLP神经元形成隐式有限状态自动机,显著增强Transformer模型的性能,并在复杂场景中展现出强大鲁棒性。
English: Chain-of-thought (CoT) boosts Transformer model performance by enabling implicit finite state automata through late-layer MLP neurons, demonstrating robustness in challenging scenarios.
Authors:Tergel Munkhbat, Namgyu Ho, Seo Hyun Kim, Yongjin Yang, Yujin Kim, Se-Young Yun
Abstract:
Chain-of-thought (CoT) reasoning has enabled large language models (LLMs) to utilize additional computation through intermediate tokens to solve complex tasks. However, we posit that typical reasoning traces contain many redundant tokens, incurring extraneous inference costs. Upon examination of the output distribution of current LLMs, we find evidence on their latent ability to reason more concisely, relative to their default behavior. To elicit this capability, we propose simple fine-tuning methods which leverage self-generated concise reasoning paths obtained by best-of-N sampling and few-shot conditioning, in task-specific settings. Our combined method achieves a 30% reduction in output tokens on average, across five model families on GSM8K and MATH, while maintaining average accuracy. By exploiting the fundamental stochasticity and in-context learning capabilities of LLMs, our self-training approach robustly elicits concise reasoning on a wide range of models, including those with extensive post-training. Code is available at https://github.com/TergelMunkhbat/concise-reasoning
中文: 思维链推理常产生冗余标记,但通过基于自生成简洁路径的微调,模型能在保持准确性的同时平均减少30%的输出长度。
English: Chain-of-thought reasoning in LLMs often produces redundant tokens, but by fine-tuning with self-generated concise paths, models can reduce output length by 30% while maintaining accuracy.
Authors:Guannan Lai, Yujie Li, Xiangkun Wang, Junbo Zhang, Tianrui Li, Xin Yang
Abstract:
Class Incremental Learning (CIL) aims to enable models to learn new classes sequentially while retaining knowledge of previous ones. Although current methods have alleviated catastrophic forgetting (CF), recent studies highlight that the performance of CIL models is highly sensitive to the order of class arrival, particularly when sequentially introduced classes exhibit high inter-class similarity. To address this critical yet understudied challenge of class order sensitivity, we first extend existing CIL frameworks through theoretical analysis, proving that grouping classes with lower pairwise similarity during incremental phases significantly improves model robustness to order variations. Building on this insight, we propose Graph-Driven Dynamic Similarity Grouping (GDDSG), a novel method that employs graph coloring algorithms to dynamically partition classes into similarity-constrained groups. Each group trains an isolated CIL sub-model and constructs meta-features for class group identification. Experimental results demonstrate that our method effectively addresses the issue of class order sensitivity while achieving optimal performance in both model accuracy and anti-forgetting capability. Our code is available at https://github.com/AIGNLAI/GDDSG.
中文: 本文提出图驱动的动态相似性分组方法(GDDSG),通过图着色算法在增量学习中动态划分相似性约束的类别组,有效解决了类别顺序敏感性问题,同时保持了高精度和抗遗忘能力。
English: This paper introduces Graph-Driven Dynamic Similarity Grouping (GDDSG), a novel method that uses graph coloring to dynamically group classes by similarity during incremental learning, effectively mitigating class order sensitivity while maintaining high accuracy and anti-forgetting performance.
Authors:Huazheng Wang, Yongcheng Jing, Haifeng Sun, Yingjie Wang, Jingyu Wang, Jianxin Liao, Dacheng Tao
Abstract:
In this paper, we investigate knowledge forgetting in large language models with a focus on its generalisation--ensuring that models forget not only specific training samples but also related implicit knowledge. To this end, we begin by identifying a broader unlearning scope that includes both target data and logically associated samples, including rephrased, subject-replaced, one-hop reasoned, and relation-reversed data. To rigorously evaluate generalisation, we introduce UGBench, the first comprehensive benchmark specifically designed to assess the unlearning of in-scope implicit knowledge covering 13 state-of-the-art methods across three datasets. UGBench reveals that unlearned models can still recall paraphrased answers and retain target facts in intermediate layers. This motivates us to take a preliminary step toward more generalised implicit knowledge forgetting by proposing PerMU, a novel probability perturbation-based unlearning paradigm. PerMU simulates adversarial unlearning samples to eliminate fact-related tokens from the logit distribution, collectively reducing the probabilities of all answer-associated tokens. Experiments are conducted on a diverse range of datasets, including TOFU, Harry Potter, ZsRE, WMDP, and MUSE, using models ranging from 1.3B to 13B in scale. The results demonstrate that PerMU delivers up to a 50.40% improvement in unlearning vanilla target data while maintaining a 40.73% boost in forgetting implicit knowledge. Our code can be found in https://github.com/MaybeLizzy/UGBench.
中文摘要:本文提出PerMU这一基于概率扰动的新型遗忘方法,通过模拟对抗性遗忘样本来消除显性目标数据及相关隐性知识,从而显著提升大语言模型的广义知识遗忘能力。
English Summary: This paper introduces PerMU, a novel probability perturbation-based unlearning method designed to enhance generalized knowledge forgetting in large language models by eliminating both explicit target data and related implicit knowledge through adversarial sample simulation.
Authors:Xinghao Wang, Feng Liu, Rui Su, Zhihui Wang, Lihua Fang, Lianqing Zhou, Lei Bai, Wanli Ouyang
Abstract:
Recent advances in deep learning have revolutionized seismic monitoring, yet developing a foundation model that performs well across multiple complex tasks remains challenging, particularly when dealing with degraded signals or data scarcity. This work presents SeisMoLLM, the first foundation model that utilizes cross-modal transfer for seismic monitoring, to unleash the power of large-scale pre-training from a large language model without requiring direct pre-training on seismic datasets. Through elaborate waveform tokenization and fine-tuning of pre-trained GPT-2 model, SeisMoLLM achieves state-of-the-art performance on the DiTing and STEAD datasets across five critical tasks: back-azimuth estimation, epicentral distance estimation, magnitude estimation, phase picking, and first-motion polarity classification. It attains 36 best results out of 43 task metrics and 12 top scores out of 16 few-shot generalization metrics, with many relative improvements ranging from 10% to 50%. In addition to its superior performance, SeisMoLLM maintains efficiency comparable to or even better than lightweight models in both training and inference. These findings establish SeisMoLLM as a promising foundation model for practical seismic monitoring and highlight cross-modal transfer as an exciting new direction for earthquake studies, showcasing the potential of advanced deep learning techniques to propel seismology research forward.
Chinese: SeisMoLLM作为首个利用预训练GPT-2进行跨模态迁移的地震监测基础模型,在五项关键任务中实现最优性能,无需直接地震数据预训练即展现出显著的效果提升和运行效率。
English: SeisMoLLM is a pioneering foundation model that leverages cross-modal transfer from a pre-trained GPT-2 to achieve state-of-the-art performance across five key seismic monitoring tasks, demonstrating significant improvements in accuracy and efficiency without requiring direct seismic data pre-training.
Authors:Yuan-Chih Yang, Hung-Hsuan Chen
Abstract:
Dropout and DropConnect are well-known techniques that apply a consistent drop rate to randomly deactivate neurons or edges in a neural network layer during training. This paper introduces a novel methodology that assigns dynamic drop rates to each edge within a layer, uniquely tailoring the dropping process without incorporating additional learning parameters. We perform experiments on synthetic and openly available datasets to validate the effectiveness of our approach. The results demonstrate that our method outperforms Dropout, DropConnect, and Standout, a classic mechanism known for its adaptive dropout capabilities. Furthermore, our approach improves the robustness and generalization of neural network training without increasing computational complexity. The complete implementation of our methodology is publicly accessible for research and replication purposes at https://github.com/ericabd888/Adjusting-the-drop-probability-in-DropConnect-based-on-the-magnitude-of-the-gradient/.
中文摘要:本文提出了一种新方法,可在神经网络层中为每条边动态调整丢弃率,相比Dropout和DropConnect等现有技术,该方法在不增加计算复杂度的前提下显著提升了模型的鲁棒性和泛化能力。
English Summary: This paper introduces a novel method that dynamically adjusts drop rates for each edge in neural network layers, outperforming existing techniques like Dropout and DropConnect in robustness and generalization without added computational cost.
Authors:Yung-Peng Hsu, Hung-Hsuan Chen
Abstract:
Clustering is essential in data analysis and machine learning, but traditional algorithms like $k$-means and Gaussian Mixture Models (GMM) often fail with nonconvex clusters. To address the challenge, we introduce the Flexible Bivariate Beta Mixture Model (FBBMM), which utilizes the flexibility of the bivariate beta distribution to handle diverse and irregular cluster shapes. Using the Expectation Maximization (EM) algorithm and Sequential Least Squares Programming (SLSQP) optimizer for parameter estimation, we validate FBBMM on synthetic and real-world datasets, demonstrating its superior performance in clustering complex data structures, offering a robust solution for big data analytics across various domains. We release the experimental code at https://github.com/yung-peng/MBMM-and-FBBMM.
中文摘要:灵活双变量贝塔混合模型(FBBMM)通过利用双变量贝塔分布的灵活性,有效处理非凸聚类问题,在合成和真实数据集上均展现出优于传统聚类算法的性能。
English Summary: The Flexible Bivariate Beta Mixture Model (FBBMM) overcomes limitations of traditional clustering methods by handling irregular cluster shapes through bivariate beta distributions and demonstrates superior performance on complex datasets.
Authors:Berken Utku Demirel, Christian Holz
Abstract:
Deep learning models lack shift invariance, making them sensitive to input shifts that cause changes in output. While recent techniques seek to address this for images, our findings show that these approaches fail to provide shift-invariance in time series, where the data generation mechanism is more challenging due to the interaction of low and high frequencies. Worse, they also decrease performance across several tasks. In this paper, we propose a novel differentiable bijective function that maps samples from their high-dimensional data manifold to another manifold of the same dimension, without any dimensional reduction. Our approach guarantees that samples -- when subjected to random shifts -- are mapped to a unique point in the manifold while preserving all task-relevant information without loss. We theoretically and empirically demonstrate that the proposed transformation guarantees shift-invariance in deep learning models without imposing any limits to the shift. Our experiments on six time series tasks with state-of-the-art methods show that our approach consistently improves the performance while enabling models to achieve complete shift-invariance without modifying or imposing restrictions on the model's topology. The source code is available on \href{https://github.com/eth-siplab/Shifting-the-Paradigm}{GitHub}.
Chinese Summary: 本文提出一种新颖的可微分双射函数,通过将样本映射到同维流形,使深度学习模型在不损失性能或改变架构的前提下,实现对时间序列数据的完全平移不变性。
English Summary: This paper introduces a novel differentiable bijective function that ensures deep learning models achieve complete shift-invariance for time series data without performance loss or architectural constraints.
Authors:Marco Pleines, Daniel Addis, David Rubinstein, Frank Zimmer, Mike Preuss, Peter Whidden
Abstract:
Pokémon Red, a classic Game Boy JRPG, presents significant challenges as a testbed for agents, including multi-tasking, long horizons of tens of thousands of steps, hard exploration, and a vast array of potential policies. We introduce a simplistic environment and a Deep Reinforcement Learning (DRL) training methodology, demonstrating a baseline agent that completes an initial segment of the game up to completing Cerulean City. Our experiments include various ablations that reveal vulnerabilities in reward shaping, where agents exploit specific reward signals. We also discuss limitations and argue that games like Pokémon hold strong potential for future research on Large Language Model agents, hierarchical training algorithms, and advanced exploration methods. Source Code: https://github.com/MarcoMeter/neroRL/tree/poke_red
中文: 该研究为《宝可梦红》设计了一个简化环境与深度强化学习方法,展示了智能体在游戏初期的进展,揭示了奖励塑造的脆弱性,并强调了此类游戏在未来人工智能研究中的潜力。
English: The study presents a simplistic environment and Deep Reinforcement Learning method for Pokémon Red, demonstrating a baseline agent's initial game progress while revealing reward shaping vulnerabilities and highlighting the game's potential for future AI research.
Authors:Nikolay Blagoev, Lydia Yiyu Chen, OÄuzhan Ersoy
Abstract:
Data and pipeline parallelism are ubiquitous for training of Large Language Models (LLM) on distributed nodes. Driven by the need for cost-effective training, recent work explores efficient communication arrangement for end to end training. Motivated by LLM's resistance to layer skipping and layer reordering, in this paper, we explore stage (several consecutive layers) skipping in pipeline training, and challenge the conventional practice of sequential pipeline execution. We derive convergence and throughput constraints (guidelines) for pipelining with skipping and swapping pipeline stages. Based on these constraints, we propose SkipPipe, the first partial pipeline framework to reduce the end-to-end training time for LLMs while preserving the convergence. The core of SkipPipe is a path scheduling algorithm that optimizes the paths for individual microbatches and reduces idle time (due to microbatch collisions) on the distributed nodes, complying with the given stage skipping ratio. We extensively evaluate SkipPipe on LLaMa models from 500M to 8B parameters on up to 20 nodes. Our results show that SkipPipe reduces training iteration time by up to $55\%$ compared to full pipeline. Our partial pipeline training also improves resistance to layer omission during inference, experiencing a drop in perplexity of only $7\%$ when running only half the model. Our code is available at https://github.com/gensyn-ai/skippipe.
Chinese: SkipPipe提出了一种部分流水线框架,通过启用阶段跳过和路径调度来优化大型语言模型的训练效率,在保持收敛性的同时将迭代时间减少高达55%。
English: SkipPipe introduces a partial pipeline framework that optimizes training efficiency for Large Language Models by enabling stage skipping and path scheduling, reducing iteration time by up to 55% while maintaining convergence.
Authors:Vidhi Lalchand, Anna-Christina Eilers
Abstract:
This work proposes a scalable probabilistic latent variable model based on Gaussian processes (Lawrence, 2004) in the context of multiple observation spaces. We focus on an application in astrophysics where data sets typically contain both observed spectral features and scientific properties of astrophysical objects such as galaxies or exoplanets. In our application, we study the spectra of very luminous galaxies known as quasars, along with their properties, such as the mass of their central supermassive black hole, accretion rate, and luminosity-resulting in multiple observation spaces. A single data point is then characterized by different classes of observations, each with different likelihoods. Our proposed model extends the baseline stochastic variational Gaussian process latent variable model (GPLVM) introduced by Lalchand et al. (2022) to this setting, proposing a seamless generative model where the quasar spectra and scientific labels can be generated simultaneously using a shared latent space as input to different sets of Gaussian process decoders, one for each observation space. Additionally, this framework enables training in a missing data setting where a large number of dimensions per data point may be unknown or unobserved. We demonstrate high-fidelity reconstructions of the spectra and scientific labels during test-time inference and briefly discuss the scientific interpretations of the results, along with the significance of such a generative model.
本研究提出了一种基于高斯过程的可扩展概率潜变量模型,通过共享潜空间处理多个观测空间,能够同时生成类星体光谱及其科学属性,并在训练中有效应对数据缺失问题。
This study introduces a scalable probabilistic latent variable model using Gaussian processes to handle multiple observation spaces, enabling simultaneous generation of quasar spectra and scientific properties through shared latent representations and accommodating missing data during training.
Authors:Jiacheng Ye, Zhenyu Wu, Jiahui Gao, Zhiyong Wu, Xin Jiang, Zhenguo Li, Lingpeng Kong
Abstract:
In the post-AlphaGo era, there has been a renewed interest in search techniques such as Monte Carlo Tree Search (MCTS), particularly in their application to Large Language Models (LLMs). This renewed attention is driven by the recognition that current next-token prediction models often lack the ability for long-term planning. Is it possible to instill search-like abilities within the models to enhance their planning abilities without relying on explicit search? We propose DiffuSearch , a model that does \textit{implicit search} by looking into the future world via discrete diffusion modeling. We instantiate DiffuSearch on a classical board game, Chess, where explicit search is known to be essential. Through extensive controlled experiments, we show DiffuSearch outperforms both the searchless and explicit search-enhanced policies. Specifically, DiffuSearch outperforms the one-step policy by 19.2% and the MCTS-enhanced policy by 14% on action accuracy. Furthermore, DiffuSearch demonstrates a notable 30% enhancement in puzzle-solving abilities compared to explicit search-based policies, along with a significant 540 Elo increase in game-playing strength assessment. These results indicate that implicit search via discrete diffusion is a viable alternative to explicit search over a one-step policy. All codes are publicly available at \href{https://github.com/HKUNLP/DiffuSearch}{https://github.com/HKUNLP/DiffuSearch}.
中文: 提出的DiffuSearch模型通过离散扩散实现隐式搜索,在象棋实验中显著提升了大型语言模型的规划能力,其表现优于无搜索和显式搜索方法。
English: The proposed DiffuSearch model employs discrete diffusion to perform implicit search, significantly enhancing planning capabilities in LLMs and outperforming both searchless and explicit search methods in chess experiments.
Authors:Hugo Lyons Keenan, Sarah Erfani, Christopher Leckie
Abstract:
Effective out-of-distribution (OOD) detection is crucial for the safe deployment of machine learning models in real-world scenarios. However, recent work has shown that OOD detection methods are vulnerable to adversarial attacks, potentially leading to critical failures in high-stakes applications. This discovery has motivated work on robust OOD detection methods that are capable of maintaining performance under various attack settings. Prior approaches have made progress on this problem but face a number of limitations: often only exhibiting robustness to attacks on OOD data or failing to maintain strong clean performance. In this work, we adapt an existing robust classification framework, TRADES, extending it to the problem of robust OOD detection and discovering a novel objective function. Recognising the critical importance of a strong clean/robust trade-off for OOD detection, we introduce an additional loss term which boosts classification and detection performance. Our approach, called HALO (Helper-based AdversariaL OOD detection), surpasses existing methods and achieves state-of-the-art performance across a number of datasets and attack settings. Extensive experiments demonstrate an average AUROC improvement of 3.15 in clean settings and 7.07 under adversarial attacks when compared to the next best method. Furthermore, HALO exhibits resistance to transferred attacks, offers tuneable performance through hyperparameter selection, and is compatible with existing OOD detection frameworks out-of-the-box, leaving open the possibility of future performance gains. Code is available at: https://github.com/hugo0076/HALO
中文: HALO方法通过引入新型损失函数扩展了TRADES框架,在多个数据集上实现了最先进的鲁棒分布外检测性能,在正常和对抗攻击场景下均展现出显著提升。
English: The HALO method extends the TRADES framework to robust out-of-distribution detection by introducing a novel loss term, achieving state-of-the-art performance with significant improvements in both clean and adversarial settings across multiple datasets.
Authors:Xinran Li, Xiaolu Wang, Chenjia Bai, Jun Zhang
Abstract:
In cooperative multi-agent reinforcement learning (MARL), well-designed communication protocols can effectively facilitate consensus among agents, thereby enhancing task performance. Moreover, in large-scale multi-agent systems commonly found in real-world applications, effective communication plays an even more critical role due to the escalated challenge of partial observability compared to smaller-scale setups. In this work, we endeavor to develop a scalable communication protocol for MARL. Unlike previous methods that focus on selecting optimal pairwise communication links-a task that becomes increasingly complex as the number of agents grows-we adopt a global perspective on communication topology design. Specifically, we propose utilizing the exponential topology to enable rapid information dissemination among agents by leveraging its small-diameter and small-size properties. This approach leads to a scalable communication protocol, named ExpoComm. To fully unlock the potential of exponential graphs as communication topologies, we employ memory-based message processors and auxiliary tasks to ground messages, ensuring that they reflect global information and benefit decision-making. Extensive experiments on large-scale cooperative benchmarks, including MAgent and Infrastructure Management Planning, demonstrate the superior performance and robust zero-shot transferability of ExpoComm compared to existing communication strategies. The code is publicly available at https://github.com/LXXXXR/ExpoComm.
中文摘要:本研究提出ExpoComm协议,通过采用指数图拓扑结构实现多智能体间高效信息传播,结合记忆消息处理器提升决策能力,在大规模协作任务中展现出优越性能和零样本迁移能力。
English Summary: The study introduces ExpoComm, a scalable communication protocol for multi-agent reinforcement learning that uses exponential graph topology to enhance information sharing and decision-making among agents, demonstrating superior performance in large-scale cooperative tasks.
Authors:Qijie Xu, Defang Chen, Jiawei Chen, Siwei Lyu, Can Wang
Abstract:
The rise of diffusion models has significantly improved the fidelity and diversity of generated images. With numerous benefits, these advancements also introduce new risks. Diffusion models can be exploited to create high-quality Deepfake images, which poses challenges for image authenticity verification. In recent years, research on generalizable diffusion-generated image detection has grown rapidly. However, a comprehensive review of this topic is still lacking. To bridge this gap, we present a systematic survey of recent advances and classify them into two main categories: (1) data-driven detection and (2) feature-driven detection. Existing detection methods are further classified into six fine-grained categories based on their underlying principles. Finally, we identify several open challenges and envision some future directions, with the hope of inspiring more research work on this important topic. Reviewed works in this survey can be found at https://github.com/zju-pi/Awesome-Diffusion-generated-Image-Detection.
中文摘要:本综述系统梳理了扩散生成图像检测的最新进展,将其分为数据驱动和特征驱动两大类方法,旨在应对深度伪造对图像真实性带来的挑战,并指出了该领域未来研究的关键方向。
English Summary: This survey systematically reviews and classifies recent advances in detecting diffusion-generated deepfake images into data-driven and feature-driven methods, addressing the growing risks they pose to image authenticity while highlighting open challenges and future research directions.
Authors:Moo Jin Kim, Chelsea Finn, Percy Liang
Abstract:
Recent vision-language-action models (VLAs) build upon pretrained vision-language models and leverage diverse robot datasets to demonstrate strong task execution, language following ability, and semantic generalization. Despite these successes, VLAs struggle with novel robot setups and require fine-tuning to achieve good performance, yet how to most effectively fine-tune them is unclear given many possible strategies. In this work, we study key VLA adaptation design choices such as different action decoding schemes, action representations, and learning objectives for fine-tuning, using OpenVLA as our representative base model. Our empirical analysis informs an Optimized Fine-Tuning (OFT) recipe that integrates parallel decoding, action chunking, a continuous action representation, and a simple L1 regression-based learning objective to altogether improve inference efficiency, policy performance, and flexibility in the model's input-output specifications. We propose OpenVLA-OFT, an instantiation of this recipe, which sets a new state of the art on the LIBERO simulation benchmark, significantly boosting OpenVLA's average success rate across four task suites from 76.5% to 97.1% while increasing action generation throughput by 26$\times$. In real-world evaluations, our fine-tuning recipe enables OpenVLA to successfully execute dexterous, high-frequency control tasks on a bimanual ALOHA robot and outperform other VLAs ($Ï_0$ and RDT-1B) fine-tuned using their default recipes, as well as strong imitation learning policies trained from scratch (Diffusion Policy and ACT) by up to 15% (absolute) in average success rate. We release code for OFT and pretrained model checkpoints at https://openvla-oft.github.io/.
Authors:Harsh Gupta, Yuchen Mo, Shengmiao Jin, Wenzhen Yuan
Abstract:
High-resolution tactile sensors have become critical for embodied perception and robotic manipulation. However, a key challenge in the field is the lack of transferability between sensors due to design and manufacturing variations, which result in significant differences in tactile signals. This limitation hinders the ability to transfer models or knowledge learned from one sensor to another. To address this, we introduce a novel method for extracting Sensor-Invariant Tactile Representations (SITR), enabling zero-shot transfer across optical tactile sensors. Our approach utilizes a transformer-based architecture trained on a diverse dataset of simulated sensor designs, allowing it to generalize to new sensors in the real world with minimal calibration. Experimental results demonstrate the method's effectiveness across various tactile sensing applications, facilitating data and model transferability for future advancements in the field.
Authors:Hugues Turbé, Mina Bjelogrlic, Gianmarco Mengaldo, Christian Lovis
Abstract:
Visual foundation models (VFMs) have become increasingly popular due to their state-of-the-art performance. However, interpretability remains crucial for critical applications. In this sense, self-explainable models (SEM) aim to provide interpretable classifiers that decompose predictions into a weighted sum of interpretable concepts. Despite their promise, recent studies have shown that these explanations often lack faithfulness. In this work, we combine VFMs with a novel prototypical architecture and specialized training objectives. By training only a lightweight head (approximately 1M parameters) on top of frozen VFMs, our approach (ProtoFM) offers an efficient and interpretable solution. Evaluations demonstrate that our approach achieves competitive classification performance while outperforming existing models across a range of interpretability metrics derived from the literature. Code is available at https://github.com/hturbe/proto-fm.
Chinese Summary: 本文提出的ProtoFM方法将视觉基础模型与原型架构及专门训练目标相结合,仅需在冻结模型上训练轻量级头部即可实现高效可解释的分类,在保持竞争力的分类性能同时显著超越了现有模型的可解释性指标。
English Summary: This paper introduces ProtoFM, an efficient and interpretable method that combines visual foundation models with a prototypical architecture and specialized training objectives, achieving competitive classification performance while surpassing existing models in interpretability metrics.
Authors:Achille Nazaret, David Blei
Abstract:
The goal of causal discovery is to learn a directed acyclic graph from data. One of the most well-known methods for this problem is Greedy Equivalence Search (GES). GES searches for the graph by incrementally and greedily adding or removing edges to maximize a model selection criterion. It has strong theoretical guarantees on infinite data but can fail in practice on finite data. In this paper, we first identify some of the causes of GES's failure, finding that it can get blocked in local optima, especially in denser graphs. We then propose eXtremely Greedy Equivalent Search (XGES), which involves a new heuristic to improve the search strategy of GES while retaining its theoretical guarantees. In particular, XGES favors deleting edges early in the search over inserting edges, which reduces the possibility of the search ending in local optima. A further contribution of this work is an efficient algorithmic formulation of XGES (and GES). We benchmark XGES on simulated datasets with known ground truth. We find that XGES consistently outperforms GES in recovering the correct graphs, and it is 10 times faster. XGES implementations in Python and C++ are available at https://github.com/ANazaret/XGES.
Chinese: 本文提出XGES方法,通过优先删除边的启发式策略改进GES的搜索过程,在保持理论保证的同时有效避免局部最优,显著提升了因果图发现的准确性和计算效率。
English: This paper introduces XGES, an enhanced version of GES that uses a heuristic favoring edge deletion to avoid local optima, improving both accuracy and speed in causal discovery on finite data.
Authors:Yucheng Zhang, Beatrice Bevilacqua, Mikhail Galkin, Bruno Ribeiro
Abstract:
Fully inductive knowledge graph models can be trained on multiple domains and subsequently perform zero-shot knowledge graph completion (KGC) in new unseen domains. This is an important capability towards the goal of having foundation models for knowledge graphs. In this work, we introduce a more expressive and capable fully inductive model, dubbed TRIX, which not only yields strictly more expressive triplet embeddings (head entity, relation, tail entity) compared to state-of-the-art methods, but also introduces a new capability: directly handling both entity and relation prediction tasks in inductive settings. Empirically, we show that TRIX outperforms the state-of-the-art fully inductive models in zero-shot entity and relation predictions in new domains, and outperforms large-context LLMs in out-of-domain predictions. The source code is available at https://github.com/yuchengz99/TRIX.
Chinese: TRIX是一种完全归纳的知识图谱模型,在表达能力和性能上超越现有最优方法,在新领域的零样本实体和关系预测中表现卓越,并在跨领域预测任务中优于大型上下文语言模型。
English: TRIX is a fully inductive knowledge graph model that surpasses state-of-the-art methods in expressiveness and capability, excelling in zero-shot entity and relation predictions across new domains and outperforming large-context LLMs in out-of-domain tasks.
Authors:Dayu Yang, Tianyang Liu, Daoan Zhang, Antoine Simoulin, Xiaoyi Liu, Yuwei Cao, Zhaopu Teng, Xin Qian, Grey Yang, Jiebo Luo, Julian McAuley
Abstract:
In large language models (LLMs), code and reasoning reinforce each other: code offers an abstract, modular, and logic-driven structure that supports reasoning, while reasoning translates high-level goals into smaller, executable steps that drive more advanced code intelligence. In this study, we examine how code serves as a structured medium for enhancing reasoning: it provides verifiable execution paths, enforces logical decomposition, and enables runtime validation. We also explore how improvements in reasoning have transformed code intelligence from basic completion to advanced capabilities, enabling models to address complex software engineering tasks through planning and debugging. Finally, we identify key challenges and propose future research directions to strengthen this synergy, ultimately improving LLM's performance in both areas.
中文摘要:代码与推理在大语言模型中相互促进,代码提供结构化逻辑框架,推理通过规划和调试实现高级代码智能,未来研究将致力于强化这种协同作用以提升模型性能。
English Summary: Code and reasoning mutually enhance each other in large language models, with code providing structured logical frameworks and reasoning enabling complex code intelligence through planning and debugging, while future research aims to strengthen this synergy.
Authors:Danae Sánchez Villegas, Ingo Ziegler, Desmond Elliott
Abstract:
Reasoning over sequences of images remains a challenge for multimodal large language models (MLLMs). While recent models incorporate multi-image data during pre-training, they still struggle to recognize sequential structures, often treating images independently. This work introduces ImageChain, a framework that enhances MLLMs with sequential reasoning capabilities over image data by modeling visual sequences as a multi-turn conversation. In ImageChain, images are interleaved with corresponding textual descriptions to form a controlled dialogue that explicitly captures temporal dependencies and narrative progression. Our method optimizes for the task of next-scene description, where the model generates a context-aware description of an upcoming scene based on preceding visual and textual cues. We demonstrate that our approach improves performance on the next-scene description task -- achieving an average improvement from 3.7% to 19% in SimRate, a metric that quantifies semantic similarity to human-annotated ground truths. Moreover, ImageChain achieves robust zero-shot out-of-domain performance in applications ranging from comics to robotics. Extensive experiments validate that instruction-tuning in a multimodal, multi-turn conversation design is key to bridging the gap between static image understanding and temporally-aware reasoning.
中文: ImageChain通过将图像序列建模为多轮对话,增强了多模态大语言模型的顺序推理能力,在下一场景描述等任务中显著提升性能并实现强大的跨领域泛化。
English: ImageChain enhances multimodal large language models by modeling image sequences as multi-turn conversations, significantly improving sequential reasoning and achieving robust performance in tasks like next-scene description.
Authors:OÄuzhan Ersoy, Jari Kolehmainen, Gabriel Passamani Andrade
Abstract:
Training dense LLMs requires enormous amounts of data and centralized compute, which introduces fundamental bottlenecks and ever-growing costs for large models. Several studies aim to reduce this dependency on centralization by reducing the communication overhead of training dense models. Taking this idea of reducing communication overhead to a natural extreme, by training embarrassingly parallelizable ensembles of small independent experts, has been shown to outperform large dense models trained in traditional centralized settings. However, existing studies do not take into account underlying differences amongst data domains and treat them as monolithic, regardless of their underlying complexity, size, or distribution. In this paper, we explore the effects of introducing heterogeneity to these ensembles of domain expert models. Specifically, by allowing models within the ensemble to vary in size--as well as the number of training steps taken depending on the training data's domain--we study the effect heterogeneity has on these ensembles when evaluated against domains included in, and excluded from, the training set. We use the same compute budget to train heterogeneous ensembles and homogeneous baselines for comparison. We show that the heterogeneous ensembles achieve the lowest perplexity scores in $20$ out of the $21$ data domains used in the evaluation. Our code is available at https://github.com/gensyn-ai/hdee.
中文: 研究表明,在相同计算预算下,根据数据领域调整模型规模和训练步长的异构专家集成模型,在大多数评估领域中比同构模型表现更优,困惑度更低。
English: This study demonstrates that heterogeneous ensembles of domain experts, which vary in model size and training steps based on data domain, outperform homogeneous models by achieving lower perplexity in most evaluated domains under the same computational budget.
Authors:Guoqing Chao, Kaixin Xu, Xijiong Xie, Yongyong Chen
Abstract:
Incomplete multi-view clustering has become one of the important research problems due to the extensive missing multi-view data in the real world. Although the existing methods have made great progress, there are still some problems: 1) most methods cannot effectively mine the information hidden in the missing data; 2) most methods typically divide representation learning and clustering into two separate stages, but this may affect the clustering performance as the clustering results directly depend on the learned representation. To address these problems, we propose a novel incomplete multi-view clustering method with hierarchical information transfer. Firstly, we design the view-specific Graph Convolutional Networks (GCN) to obtain the representation encoding the graph structure, which is then fused into the consensus representation. Secondly, considering that one layer of GCN transfers one-order neighbor node information, the global graph propagation with the consensus representation is proposed to handle the missing data and learn deep representation. Finally, we design a weight-sharing pseudo-classifier with contrastive learning to obtain an end-to-end framework that combines view-specific representation learning, global graph propagation with hierarchical information transfer, and contrastive clustering for joint optimization. Extensive experiments conducted on several commonly-used datasets demonstrate the effectiveness and superiority of our method in comparison with other state-of-the-art approaches. The code is available at https://github.com/KelvinXuu/GHICMC.
中文: 本文提出了一种新颖的基于层次信息传递的不完整多视图聚类方法,通过图卷积网络、全局图传播和对比学习构建端到端框架,在有效处理缺失数据的同时实现表示学习与聚类的联合优化。
English: This paper introduces a novel incomplete multi-view clustering method using hierarchical information transfer, which integrates graph convolutional networks, global graph propagation, and contrastive learning into an end-to-end framework to jointly optimize representation learning and clustering while effectively handling missing data.
Authors:Li Ju, Xingyi Yang, Qi Li, Xinchao Wang
Abstract:
Graph neural networks (GNNs) are conventionally trained on a per-domain, per-task basis. It creates a significant barrier in transferring the acquired knowledge to different, heterogeneous data setups. This paper introduces GraphBridge, a novel framework to enable knowledge transfer across disparate tasks and domains in GNNs, circumventing the need for modifications to task configurations or graph structures. Specifically, GraphBridge allows for the augmentation of any pre-trained GNN with prediction heads and a bridging network that connects the input to the output layer. This architecture not only preserves the intrinsic knowledge of the original model but also supports outputs of arbitrary dimensions. To mitigate the negative transfer problem, GraphBridge merges the source model with a concurrently trained model, thereby reducing the source bias when applied to the target domain. Our method is thoroughly evaluated across diverse transfer learning scenarios, including Graph2Graph, Node2Node, Graph2Node, and graph2point-cloud. Empirical validation, conducted over 16 datasets representative of these scenarios, confirms the framework's capacity for task- and domain-agnostic transfer learning within graph-like data, marking a significant advancement in the field of GNNs. Code is available at https://github.com/jujulili888/GraphBridge.
中文: GraphBridge提出了一种新颖框架,能够在图神经网络中实现跨任务和跨领域的知识迁移,无需修改图结构,通过模型融合有效缓解负迁移问题,并在多种场景下展现出卓越性能。
English: GraphBridge introduces a novel framework enabling knowledge transfer across disparate tasks and domains in graph neural networks without requiring structural modifications, effectively mitigating negative transfer through model merging and demonstrating robust performance across diverse scenarios.
Authors:Mohammad Moulaeifard, Peter H. Charlton, Nils Strodthoff
Abstract:
Photoplethysmography (PPG)-based blood pressure (BP) estimation represents a promising alternative to cuff-based BP measurements. Recently, an increasing number of deep learning models have been proposed to infer BP from the raw PPG waveform. However, these models have been predominantly evaluated on in-distribution test sets, which immediately raises the question of the generalizability of these models to external datasets. To investigate this question, we trained five deep learning models on the recently released PulseDB dataset, provided in-distribution benchmarking results on this dataset, and then assessed out-of-distribution performance on several external datasets. The best model (XResNet1d101) achieved in-distribution MAEs of 9.4 and 6.0 mmHg for systolic and diastolic BP respectively on PulseDB (with subject-specific calibration), and 14.0 and 8.5 mmHg respectively without calibration. Equivalent MAEs on external test datasets without calibration ranged from 15.0 to 25.1 mmHg (SBP) and 7.0 to 10.4 mmHg (DBP). Our results indicate that the performance is strongly influenced by the differences in BP distributions between datasets. We investigated a simple way of improving performance through sample-based domain adaptation and put forward recommendations for training models with good generalization properties. With this work, we hope to educate more researchers for the importance and challenges of out-of-distribution generalization.
Chinese: 本研究评估了基于光电容积脉搏波描记法的深度学习模型在血压估计中的泛化能力,发现模型在外部数据集上性能显著下降,并提出了领域自适应方法以提高模型的鲁棒性。
English: This study evaluates the generalizability of deep learning models for photoplethysmography-based blood pressure estimation, revealing significant performance drops on external datasets and proposing domain adaptation methods to enhance model robustness.
Authors:Kaiwen Yan, Hongcheng Guo, Xuanqing Shi, Shaosheng Cao, Donglin Di, Zhoujun Li
Abstract:
With the rapid advancement of Large Language Models (LLMs), the demand for robust instruction-following capabilities in code generation tasks has grown significantly. Code generation not only facilitates faster prototyping and automated testing, but also augments developer efficiency through improved maintainability and reusability of code. In this paper, we introduce CodeIF, the first benchmark specifically designed to assess the abilities of LLMs to adhere to task-oriented instructions within diverse code generation scenarios. CodeIF encompasses a broad range of tasks, including function synthesis, error debugging, algorithmic refactoring, and code explanation, thereby providing a comprehensive suite to evaluate model performance across varying complexity levels and programming domains. We conduct extensive experiments with LLMs, analyzing their strengths and limitations in meeting the demands of these tasks. The experimental results offer valuable insights into how well current models align with human instructions, as well as the extent to which they can generate consistent, maintainable, and contextually relevant code. Our findings not only underscore the critical role that instruction-following LLMs can play in modern software development, but also illuminate pathways for future research aimed at enhancing their adaptability, reliability, and overall effectiveness in automated code generation. CodeIF data and code are publicly available: https://github.com/lin-rany/codeIF
中文: 本文介绍了首个专门评估大语言模型在多样化代码生成场景中遵循指令能力的基准CodeIF,通过广泛实验揭示了模型在任务执行中的优势与不足,并强调了其在现代软件开发中的关键作用及未来研究方向。
English: This paper introduces CodeIF, the first benchmark designed to evaluate how well Large Language Models follow instructions in code generation tasks across diverse scenarios like function synthesis and debugging, revealing their strengths and limitations while highlighting their potential role in software development.
Authors:Henry Peng Zou, Zhengyao Gu, Yue Zhou, Yankai Chen, Weizhi Zhang, Liancheng Fang, Yibo Wang, Yangning Li, Kay Liu, Philip S. Yu
Abstract:
Test-time computing approaches, which leverage additional computational resources during inference, have been proven effective in enhancing large language model performance. This work introduces a novel, linearly scaling approach, TestNUC, that improves test-time predictions by leveraging the local consistency of neighboring unlabeled data-it classifies an input instance by considering not only the model's prediction on that instance but also on neighboring unlabeled instances. We evaluate TestNUC across eight diverse datasets, spanning intent classification, topic mining, domain discovery, and emotion detection, demonstrating its consistent superiority over baseline methods such as standard prompting and self-consistency. Furthermore, TestNUC can be seamlessly integrated with existing test-time computing approaches, substantially boosting their performance. Our analysis reveals that TestNUC scales effectively with increasing amounts of unlabeled data and performs robustly across different embedding models, making it practical for real-world applications. Our code is available at https://github.com/HenryPengZou/TestNUC.
中文: TestNUC是一种创新的测试时计算方法,通过利用相邻未标记数据的局部一致性来提高预测精度,在多个数据集上表现优于基线方法,并能与现有方法无缝集成。
English: TestNUC is a novel test-time computing method that enhances prediction accuracy by leveraging the local consistency of neighboring unlabeled data, demonstrating consistent superiority across diverse datasets and seamless integration with existing approaches.
Authors:Zhaowei Zhang, Fengshuo Bai, Qizhi Chen, Chengdong Ma, Mingzhi Wang, Haoran Sun, Zilong Zheng, Yaodong Yang
Abstract:
How to align large language models (LLMs) with user preferences from a static general dataset has been frequently studied. However, user preferences are usually personalized, changing, and diverse regarding culture, values, or time. This leads to the problem that the actual user preferences often do not coincide with those trained by the model developers in the practical use of LLMs. Since we cannot collect enough data and retrain for every demand, researching efficient real-time preference adaptation methods based on the backbone LLMs during test time is important. To this end, we introduce Amulet, a novel, training-free framework that formulates the decoding process of every token as a separate online learning problem with the guidance of simple user-provided prompts, thus enabling real-time optimization to satisfy users' personalized preferences. To reduce the computational cost brought by this optimization process for each token, we additionally provide a closed-form solution for each iteration step of the optimization process, thereby reducing the computational time cost to a negligible level. The detailed experimental results demonstrate that Amulet can achieve significant performance improvements in rich settings with combinations of different LLMs, datasets, and user preferences, while maintaining acceptable computational efficiency.
Authors:Fraser Birks, Thomas D Swinburne, James R Kermode
Abstract:
Machine-learned interatomic potentials offer near first-principles accuracy but are computationally expensive, limiting their application in large-scale molecular dynamics simulations. Inspired by quantum mechanics/molecular mechanics methods, we present ML-MIX, an efficient and flexible LAMMPS package for accelerating simulations by spatially mixing interatomic potentials of different complexities. Through constrained linear fitting, we show it is possible to generate a 'cheap' approximate model which closely matches an 'expensive' reference in relevant regions of configuration space. We demonstrate the capability of ML-MIX through case-studies in Si, Fe, and W-He systems, achieving up to an 11x speedup on 8,000 atom systems without sacrificing accuracy on static and dynamic quantities, including calculation of minimum energy paths and dynamical simulations of defect diffusion. For larger domain sizes, we show that the achievable speedup of ML-MIX simulations is limited only by the relative speed of the cheap potential over the expensive potential. The ease of use and flexible nature of this method will extend the practical reach of MLIPs throughout computational materials science, enabling parsimonious application to large spatial and temporal domains.
中文摘要:ML-MIX是一个LAMMPS软件包,通过混合不同复杂度的原子间势能来加速分子动力学模拟,在硅和铁等材料系统中实现了高达11倍的加速,同时保持计算精度。
English Summary: ML-MIX is a LAMMPS package that accelerates molecular dynamics simulations by mixing interatomic potentials of varying complexities, achieving up to 11x speedup while maintaining accuracy in materials systems like Si and Fe.
Authors:Shijun Zhang, Hongkai Zhao, Yimin Zhong, Haomin Zhou
Abstract:
The architecture of a neural network and the selection of its activation function are both fundamental to its performance. Equally vital is ensuring these two elements are well-matched, as their alignment is key to achieving effective representation and learning. In this paper, we introduce the Fourier Multi-Component and Multi-Layer Neural Network (FMMNN), a novel model that creates a strong synergy between them. We demonstrate that FMMNNs are highly effective and flexible in modeling high-frequency components. Our theoretical results demonstrate that FMMNNs have exponential expressive power for function approximation. We also analyze the optimization landscape of FMMNNs and find it to be much more favorable than that of standard fully connected neural networks, especially when dealing with high-frequency features. In addition, we propose a scaled random initialization method for the first layer's weights in FMMNNs, which significantly speeds up training and enhances overall performance. Extensive numerical experiments support our theoretical insights, showing that FMMNNs consistently outperform traditional approaches in accuracy and efficiency across various tasks.
中文: FMMNN模型通过架构与激活函数的协同作用,实现了指数级表达能力和更优的优化特性,在多种任务中均以更高精度和效率超越传统网络方法。
English: The FMMNN model synergizes architecture and activation functions to achieve exponential expressive power and favorable optimization, outperforming traditional networks in accuracy and efficiency across diverse tasks.
Authors:Jakub Macina, Nico Daheim, Ido Hakimi, Manu Kapur, Iryna Gurevych, Mrinmaya Sachan
Abstract:
Evaluating the pedagogical capabilities of AI-based tutoring models is critical for making guided progress in the field. Yet, we lack a reliable, easy-to-use, and simple-to-run evaluation that reflects the pedagogical abilities of models. To fill this gap, we present MathTutorBench, an open-source benchmark for holistic tutoring model evaluation. MathTutorBench contains a collection of datasets and metrics that broadly cover tutor abilities as defined by learning sciences research in dialog-based teaching. To score the pedagogical quality of open-ended teacher responses, we train a reward model and show it can discriminate expert from novice teacher responses with high accuracy. We evaluate a wide set of closed- and open-weight models on MathTutorBench and find that subject expertise, indicated by solving ability, does not immediately translate to good teaching. Rather, pedagogy and subject expertise appear to form a trade-off that is navigated by the degree of tutoring specialization of the model. Furthermore, tutoring appears to become more challenging in longer dialogs, where simpler questioning strategies begin to fail. We release the benchmark, code, and leaderboard openly to enable rapid benchmarking of future models.
Authors:Ziyue Jiang, Yi Ren, Ruiqi Li, Shengpeng Ji, Boyang Zhang, Zhenhui Ye, Chen Zhang, Bai Jionghao, Xiaoda Yang, Jialong Zuo, Yu Zhang, Rui Liu, Xiang Yin, Zhou Zhao
Abstract:
While recent zero-shot text-to-speech (TTS) models have significantly improved speech quality and expressiveness, mainstream systems still suffer from issues related to speech-text alignment modeling: 1) models without explicit speech-text alignment modeling exhibit less robustness, especially for hard sentences in practical applications; 2) predefined alignment-based models suffer from naturalness constraints of forced alignments. This paper introduces \textit{MegaTTS 3}, a TTS system featuring an innovative sparse alignment algorithm that guides the latent diffusion transformer (DiT). Specifically, we provide sparse alignment boundaries to MegaTTS 3 to reduce the difficulty of alignment without limiting the search space, thereby achieving high naturalness. Moreover, we employ a multi-condition classifier-free guidance strategy for accent intensity adjustment and adopt the piecewise rectified flow technique to accelerate the generation process. Experiments demonstrate that MegaTTS 3 achieves state-of-the-art zero-shot TTS speech quality and supports highly flexible control over accent intensity. Notably, our system can generate high-quality one-minute speech with only 8 sampling steps. Audio samples are available at https://sditdemo.github.io/sditdemo/.
Chinese: MegaTTS 3采用创新的稀疏对齐算法,在提升零样本文本转语音的语音文本对齐效果和自然度的同时,实现了灵活的语音强度控制和高效生成。
English: MegaTTS 3 introduces a sparse alignment algorithm that enhances zero-shot text-to-speech performance by improving speech-text alignment and naturalness while enabling flexible accent control and efficient generation.
Authors:Tianyun Liu
Abstract:
Traditional text-to-speech (TTS) methods primarily focus on establishing a mapping between phonemes and mel-spectrograms. However, during the phoneme encoding stage, there is often a lack of real mel-spectrogram auxiliary information, which results in the encoding process lacking true semantic understanding. At the same time, traditional TTS systems often struggle to balance the inference speed of the model with the quality of the synthesized speech. Methods that generate high-quality synthesized speech tend to have slower inference speeds, while faster inference methods often sacrifice speech quality. In this paper, I propose Clip-TTS, a TTS method based on the Clip architecture. This method uses the Clip framework to establish a connection between text content and real mel-spectrograms during the text encoding stage, enabling the text encoder to directly learn the true semantics of the global context, thereby ensuring the quality of the synthesized speech. In terms of model architecture, I adopt the basic structure of Transformer, which allows Clip-TTS to achieve fast inference speeds. Experimental results show that on the LJSpeech and Baker datasets, the speech generated by Clip-TTS achieves state-of-the-art MOS scores, and it also performs excellently on multi-emotion datasets.Audio samples are available at: https://ltydd1314.github.io/.
Authors:Jacob Dunefsky, Arman Cohan
Abstract:
Steering vectors (SVs) have emerged as a promising approach for interpreting and controlling LLMs, but current methods typically require large contrastive datasets that are often impractical to construct and may capture spurious correlations. We propose directly optimizing SVs through gradient descent on a single training example, and systematically investigate how these SVs generalize. We consider several SV optimization techniques and find that the resulting SVs effectively mediate safety-relevant behaviors in multiple models. Indeed, in experiments on an alignment-faking model, we are able to optimize one-shot SVs that induce harmful behavior on benign examples and whose negations suppress harmful behavior on malign examples. And in experiments on refusal suppression, we demonstrate that one-shot optimized SVs can transfer across inputs, yielding a Harmbench attack success rate of 96.9%. Furthermore, we extend work on "emergent misalignment" and show that SVs optimized to induce a model to write vulnerable code cause the model to respond harmfully on unrelated open-ended prompts. Finally, we use one-shot SV optimization to investigate how an instruction-tuned LLM recovers from outputting false information, and find that this ability is independent of the model's explicit verbalization that the information was false. Overall, our findings suggest that optimizing SVs on a single example can mediate a wide array of misaligned behaviors in LLMs. Code can be found at https://github.com/jacobdunefsky/one-shot-steering-repro and https://github.com/jacobdunefsky/one-shot-steering-misalignment.
中文: 本研究提出通过单一样本的梯度下降优化导向向量,证明其能有效调控大型语言模型中的多种未对齐行为,包括安全操纵和拒绝抑制等。
English: This study introduces a method to optimize steering vectors using just one training example via gradient descent, demonstrating their effectiveness in controlling various misaligned behaviors in large language models, including safety manipulation and refusal suppression.
Authors:Zichuan Fu, Wentao Song, Yejing Wang, Xian Wu, Yefeng Zheng, Yingying Zhang, Derong Xu, Xuetao Wei, Tong Xu, Xiangyu Zhao
Abstract:
Recent advances in transformer-based Large Language Models (LLMs) have demonstrated remarkable capabilities across various tasks. However, their quadratic computational complexity concerning sequence length remains a significant bottleneck for processing long documents. As a result, many efforts like sparse attention and state space models have been proposed to improve the efficiency of LLMs over long sequences. Though effective, these approaches compromise the performance or introduce structural complexity. This calls for a simple yet efficient model that preserves the fundamental Transformer architecture. To this end, we introduce SWAT, which enables efficient long-context handling via Sliding Window Attention Training. This paper first attributes the inefficiency of Transformers to the attention sink phenomenon resulting from the high variance of softmax operation. Then, we replace softmax with the sigmoid function and utilize a balanced ALiBi and Rotary Position Embedding for efficient information compression and retention. Experiments demonstrate that SWAT achieves SOTA performance compared with state-of-the-art linear recurrent architectures on eight benchmarks. Code is available at https://github.com/Fzkuji/swat-attention.
Chinese Summary: 本文提出SWAT模型,通过用sigmoid替换softmax并结合平衡位置嵌入,有效提升了Transformer处理长文本的能力,在多个基准测试中实现了最优性能。
English Summary: The paper introduces SWAT, a model that enhances long-context processing in Transformers by replacing softmax with sigmoid and using balanced position embeddings, achieving state-of-the-art efficiency and performance.
Authors:Yifan Hu, Yuante Li, Peiyuan Liu, Yuxia Zhu, Naiqi Li, Tao Dai, Shu-tao Xia, Dawei Cheng, Changjun Jiang
Abstract:
Financial time series (FinTS) record the behavior of human-brain-augmented decision-making, capturing valuable historical information that can be leveraged for profitable investment strategies. Not surprisingly, this area has attracted considerable attention from researchers, who have proposed a wide range of methods based on various backbones. However, the evaluation of the area often exhibits three systemic limitations: 1. Failure to account for the full spectrum of stock movement patterns observed in dynamic financial markets. (Diversity Gap), 2. The absence of unified assessment protocols undermines the validity of cross-study performance comparisons. (Standardization Deficit), and 3. Neglect of critical market structure factors, resulting in inflated performance metrics that lack practical applicability. (Real-World Mismatch). Addressing these limitations, we propose FinTSB, a comprehensive and practical benchmark for financial time series forecasting (FinTSF). To increase the variety, we categorize movement patterns into four specific parts, tokenize and pre-process the data, and assess the data quality based on some sequence characteristics. To eliminate biases due to different evaluation settings, we standardize the metrics across three dimensions and build a user-friendly, lightweight pipeline incorporating methods from various backbones. To accurately simulate real-world trading scenarios and facilitate practical implementation, we extensively model various regulatory constraints, including transaction fees, among others. Finally, we conduct extensive experiments on FinTSB, highlighting key insights to guide model selection under varying market conditions. Overall, FinTSB provides researchers with a novel and comprehensive platform for improving and evaluating FinTSF methods. The code is available at https://github.com/TongjiFinLab/FinTSBenchmark.
中文摘要:针对金融时间序列分析存在的多样性不足、标准化缺失和实际应用性差等问题,FinTSB基准通过模式分类、统一指标和监管约束建模,为改进和评估预测方法提供了全面平台。
English Summary: Financial time series analysis faces challenges in diversity, standardization, and real-world applicability, which the proposed FinTSB benchmark addresses through pattern categorization, unified metrics, and regulatory modeling to enhance forecasting methods.
Authors:Dung V. Nguyen, Minh H. Nguyen, Luc Q. Nguyen, Rachel S. Y. Teo, Tan M. Nguyen, Linh Duy Tran
Abstract:
Existing methods for merging experts during model training and fine-tuning predominantly rely on Euclidean geometry, which assumes a flat parameter space. This assumption can limit the model's generalization ability, especially during the pre-training phase, where the parameter manifold might exhibit more complex curvature. Curvature-aware merging methods typically require additional information and computational resources to approximate the Fisher Information Matrix, adding memory overhead. In this paper, we introduce CAMEx (Curvature-Aware Merging of Experts), a novel expert merging protocol that incorporates natural gradients to account for the non-Euclidean curvature of the parameter manifold. By leveraging natural gradients, CAMEx adapts more effectively to the structure of the parameter space, improving alignment between model updates and the manifold's geometry. This approach enhances both pre-training and fine-tuning, resulting in better optimization trajectories and improved generalization without the substantial memory overhead typically associated with curvature-aware methods. Our contributions are threefold: (1) CAMEx significantly outperforms traditional Euclidean-based expert merging techniques across various natural language processing tasks, leading to enhanced performance during pre-training and fine-tuning; (2) we introduce a dynamic merging architecture that optimizes resource utilization, achieving high performance while reducing computational costs, facilitating efficient scaling of large language models; and (3) we provide both theoretical and empirical evidence to demonstrate the efficiency of our proposed method. The code is publicly available at: https://github.com/kpup1710/CAMEx.
中文: 本文提出CAMEx,一种利用自然梯度适应参数流形曲率的专家合并方法,在提升模型泛化能力和优化效果的同时,避免了传统曲率感知方法的高内存开销。
English: This paper introduces CAMEx, a curvature-aware expert merging protocol that uses natural gradients to better align with the parameter manifold's geometry, enhancing model generalization and optimization without the high memory costs of traditional methods.
Authors:Ruifeng Tan, Weixiang Hong, Jiayue Tang, Xibin Lu, Ruijun Ma, Xiang Zheng, Jia Li, Jiaqiang Huang, Tong-Yi Zhang
Abstract:
Battery Life Prediction (BLP), which relies on time series data produced by battery degradation tests, is crucial for battery utilization, optimization, and production. Despite impressive advancements, this research area faces three key challenges. Firstly, the limited size of existing datasets impedes insights into modern battery life data. Secondly, most datasets are restricted to small-capacity lithium-ion batteries tested under a narrow range of diversity in labs, raising concerns about the generalizability of findings. Thirdly, inconsistent and limited benchmarks across studies obscure the effectiveness of baselines and leave it unclear if models popular in other time series fields are effective for BLP. To address these challenges, we propose BatteryLife, a comprehensive dataset and benchmark for BLP. BatteryLife integrates 16 datasets, offering a 2.5 times sample size compared to the previous largest dataset, and provides the most diverse battery life resource with batteries from 8 formats, 59 chemical systems, 9 operating temperatures, and 421 charge/discharge protocols, including both laboratory and industrial tests. Notably, BatteryLife is the first to release battery life datasets of zinc-ion batteries, sodium-ion batteries, and industry-tested large-capacity lithium-ion batteries. With the comprehensive dataset, we revisit the effectiveness of baselines popular in this and other time series fields. Furthermore, we propose CyclePatch, a plug-in technique that can be employed in various neural networks. Extensive benchmarking of 18 methods reveals that models popular in other time series fields can be unsuitable for BLP, and CyclePatch consistently improves model performance establishing state-of-the-art benchmarks. Moreover, BatteryLife evaluates model performance across aging conditions and domains. BatteryLife is available at https://github.com/Ruifeng-Tan/BatteryLife.
中文摘要:电池寿命预测面临数据集有限、测试条件单一和基准不一致的挑战,而提出的BatteryLife数据集和CyclePatch技术有效解决了这些问题,显著提升了模型性能并建立了新的性能基准。
English Summary: Battery Life Prediction faces challenges of limited datasets, narrow testing conditions, and inconsistent benchmarks, which are addressed by the proposed BatteryLife dataset and CyclePatch technique that significantly enhance model performance and establish new standards.
Authors:Yuxiang Wang, Xinnan Dai, Wenqi Fan, Yao Ma
Abstract:
Graph-structured data has become increasingly prevalent across various domains, raising the demand for effective models to handle graph tasks like node classification and link prediction. Traditional graph learning models like Graph Neural Networks (GNNs) have made significant strides, but their capabilities in handling graph data remain limited in certain contexts. In recent years, large language models (LLMs) have emerged as promising candidates for graph tasks, yet most studies focus primarily on performance benchmarks and fail to address their broader potential, including their ability to handle limited data, their transferability across tasks, and their robustness. In this work, we provide a comprehensive exploration of LLMs applied to graph tasks. We evaluate the performance of pure LLMs, including those without parameter optimization and those fine-tuned with instructions, across various scenarios. Our analysis goes beyond accuracy, assessing LLM ability to perform in few-shot/zero-shot settings, transfer across domains, understand graph structures, and demonstrate robustness in challenging scenarios. We conduct extensive experiments with 16 graph learning models alongside 6 LLMs (e.g., Llama3B, GPT-4o, Qwen-plus), comparing their performance on datasets like Cora, PubMed, ArXiv, and Products. Our findings show that LLMs, particularly those with instruction tuning, outperform traditional models in few-shot settings, exhibit strong domain transferability, and demonstrate excellent generalization and robustness. This work offers valuable insights into the capabilities of LLMs for graph learning, highlighting their advantages and potential for real-world applications, and paving the way for future research in this area. Codes and datasets are released in https://github.com/myflashbarry/LLM-benchmarking.
中文: 本研究全面评估大语言模型在图任务中的表现,发现经过指令微调的模型在少样本学习、跨领域迁移能力和鲁棒性方面显著优于传统图学习方法。
English: This study comprehensively evaluates large language models (LLMs) for graph tasks, demonstrating that instruction-tuned LLMs excel in few-shot learning, cross-domain transferability, and robustness compared to traditional graph models.
Authors:Jiayi Fu, Xuandong Zhao, Chengyuan Yao, Heng Wang, Qi Han, Yanghua Xiao
Abstract:
Reinforcement Learning from Human Feedback (RLHF) is essential for aligning large language models (LLMs) with human values. However, RLHF is susceptible to \emph{reward hacking}, where the agent exploits flaws in the reward function rather than learning the intended behavior, thus degrading alignment. Although reward shaping helps stabilize RLHF and partially mitigate reward hacking, a systematic investigation into shaping techniques and their underlying principles remains lacking. To bridge this gap, we present a comprehensive study of the prevalent reward shaping methods. Our analysis suggests two key design principles: (1) the RL reward should be bounded, and (2) the RL reward benefits from rapid initial growth followed by gradual convergence. Guided by these insights, we propose Preference As Reward (PAR), a novel approach that leverages the latent preferences embedded within the reward model as the signal for reinforcement learning. We evaluated PAR on two base models, Gemma2-2B, and Llama3-8B, using two datasets, Ultrafeedback-Binarized and HH-RLHF. Experimental results demonstrate PAR's superior performance over other reward shaping methods. On the AlpacaEval 2.0 benchmark, PAR achieves a win rate of at least 5 percentage points higher than competing approaches. Furthermore, PAR exhibits remarkable data efficiency, requiring only a single reference reward for optimal performance, and maintains robustness against reward hacking even after two full epochs of training. The code is available at https://github.com/PorUna-byte/PAR, and the Work done during the internship at StepFun by Jiayi Fu.
Chinese: 本研究提出了一种新颖的奖励塑造方法PAR,通过利用奖励模型中的潜在偏好信号来有效缓解人类反馈强化学习中的奖励破解问题,在评估中展现出卓越的性能和鲁棒性。
English: This study introduces Preference As Reward (PAR), a novel reward shaping method that effectively mitigates reward hacking in reinforcement learning from human feedback by leveraging latent preferences, demonstrating superior performance and robustness in evaluations.
Authors:Zhewei Kang, Xuandong Zhao, Dawn Song
Abstract:
Best-of-N selection is a key technique for improving the reasoning performance of Large Language Models (LLMs) through increased test-time computation. Current state-of-the-art methods often employ computationally intensive reward models for response evaluation and selection. Reward-free alternatives, like self-consistency and universal self-consistency, are limited in their ability to handle open-ended generation tasks or scale effectively. To address these limitations, we propose self-certainty, a novel and efficient metric that leverages the inherent probability distribution of LLM outputs to estimate response quality without requiring external reward models. We hypothesize that higher distributional self-certainty, aggregated across multiple samples, correlates with improved response accuracy, as it reflects greater confidence in the generated output. Through extensive experiments on various reasoning tasks, we demonstrate that self-certainty (1) scales effectively with increasing sample size $N$, akin to reward models but without the computational overhead; (2) complements chain-of-thought, improving reasoning performance beyond greedy decoding; and (3) generalizes to open-ended tasks where traditional self-consistency methods fall short. Our findings establish self-certainty as a practical and efficient way for improving LLM reasoning capabilities. The code is available at https://github.com/backprop07/Self-Certainty
中文: 本文提出“自确定性”这一新指标,利用大语言模型内部概率分布来高效评估回答质量,无需外部奖励模型,实验证明其在推理任务中比现有方法具有更好的可扩展性和性能表现。
English: The paper introduces "self-certainty," a novel metric that uses LLMs' internal probability distributions to efficiently evaluate response quality without external rewards, demonstrating improved scalability and performance across reasoning tasks compared to existing methods.
Authors:PIN AI Team, Bill Sun, Gavin Guo, Regan Peng, Boliang Zhang, Shouqiao Wang, Laura Florescu, Xi Wang, Davide Crapis, Ben Wu
Abstract:
Personal AI assistants (e.g., Apple Intelligence, Meta AI) offer proactive recommendations that simplify everyday tasks, but their reliance on sensitive user data raises concerns about privacy and trust. To address these challenges, we introduce the Guardian of Data (GOD), a secure, privacy-preserving framework for training and evaluating AI assistants directly on-device. Unlike traditional benchmarks, the GOD model measures how well assistants can anticipate user needs-such as suggesting gifts-while protecting user data and autonomy. Functioning like an AI school, it addresses the cold start problem by simulating user queries and employing a curriculum-based approach to refine the performance of each assistant. Running within a Trusted Execution Environment (TEE), it safeguards user data while applying reinforcement and imitation learning to refine AI recommendations. A token-based incentive system encourages users to share data securely, creating a data flywheel that drives continuous improvement. Specifically, users mine with their data, and the mining rate is determined by GOD's evaluation of how well their AI assistant understands them across categories such as shopping, social interactions, productivity, trading, and Web3. By integrating privacy, personalization, and trust, the GOD model provides a scalable, responsible path for advancing personal AI assistants. For community collaboration, part of the framework is open-sourced at https://github.com/PIN-AI/God-Model.
中文: GOD框架提出了一种安全的设备端训练与评估系统,通过可信执行环境、激励策略和模拟学习,在实现智能助手主动推荐功能的同时保障用户隐私,为负责任的人工智能发展提供可扩展路径。
English: The GOD framework introduces a secure, on-device training and evaluation system for personal AI assistants that balances proactive recommendations with privacy protection through a TEE, incentive mechanisms, and simulated learning to advance responsible AI development.
Authors:Sefik Serengil, Alper Ozpinar
Abstract:
Facial recognition systems rely on embeddings to represent facial images and determine identity by verifying if the distance between embeddings is below a pre-tuned threshold. While embeddings are not reversible to original images, they still contain sensitive information, making their security critical. Traditional encryption methods like AES are limited in securely utilizing cloud computational power for distance calculations. Homomorphic Encryption, allowing calculations on encrypted data, offers a robust alternative. This paper introduces CipherFace, a homomorphic encryption-driven framework for secure cloud-based facial recognition, which we have open-sourced at http://github.com/serengil/cipherface. By leveraging FHE, CipherFace ensures the privacy of embeddings while utilizing the cloud for efficient distance computation. Furthermore, we propose a novel encrypted distance computation method for both Euclidean and Cosine distances, addressing key challenges in performing secure similarity calculations on encrypted data. We also conducted experiments with different facial recognition models, various embedding sizes, and cryptosystem configurations, demonstrating the scalability and effectiveness of CipherFace in real-world applications.
Chinese: 本文提出了CipherFace,一个基于同态加密的开源框架,通过保护敏感嵌入特征并高效计算加密的欧几里得与余弦距离,实现安全的云端人脸识别,适用于实际应用场景。
English: This paper introduces CipherFace, an open-source homomorphic encryption framework that enables secure cloud-based facial recognition by protecting sensitive embeddings while efficiently computing encrypted Euclidean and Cosine distances for real-world applications.
Authors:Yukun Chen, Shuo Shao, Enhao Huang, Yiming Li, Pin-Yu Chen, Zhan Qin, Kui Ren
Abstract:
Backdoor attacks on deep neural networks (DNNs) have emerged as a significant security threat, allowing adversaries to implant hidden malicious behaviors during the model training phase. Pre-processing-based defense, which is one of the most important defense paradigms, typically focuses on input transformations or backdoor trigger inversion (BTI) to deactivate or eliminate embedded backdoor triggers during the inference process. However, these methods suffer from inherent limitations: transformation-based defenses often fail to balance model utility and defense performance, while BTI-based defenses struggle to accurately reconstruct trigger patterns without prior knowledge. In this paper, we propose REFINE, an inversion-free backdoor defense method based on model reprogramming. REFINE consists of two key components: \textbf{(1)} an input transformation module that disrupts both benign and backdoor patterns, generating new benign features; and \textbf{(2)} an output remapping module that redefines the model's output domain to guide the input transformations effectively. By further integrating supervised contrastive loss, REFINE enhances the defense capabilities while maintaining model utility. Extensive experiments on various benchmark datasets demonstrate the effectiveness of our REFINE and its resistance to potential adaptive attacks.
中文摘要:REFINE是一种基于模型重编程的无逆向后门防御方法,通过输入转换和输出重映射模块协同工作,在无需触发模式重构的情况下有效消除后门威胁,同时保持模型性能。
English Summary: REFINE is a novel backdoor defense method that uses model reprogramming with input transformation and output remapping to neutralize embedded triggers without inversion, effectively balancing security and model performance.
Authors:Aman Goel, Xian Carrie Wu, Zhe Wang, Dmitriy Bespalov, Yanjun Qi
Abstract:
Jailbreaking large-language models (LLMs) involves testing their robustness against adversarial prompts and evaluating their ability to withstand prompt attacks that could elicit unauthorized or malicious responses. In this paper, we present TurboFuzzLLM, a mutation-based fuzzing technique for efficiently finding a collection of effective jailbreaking templates that, when combined with harmful questions, can lead a target LLM to produce harmful responses through black-box access via user prompts. We describe the limitations of directly applying existing template-based attacking techniques in practice, and present functional and efficiency-focused upgrades we added to mutation-based fuzzing to generate effective jailbreaking templates automatically. TurboFuzzLLM achieves $\geq$ 95\% attack success rates (ASR) on public datasets for leading LLMs (including GPT-4o \& GPT-4 Turbo), shows impressive generalizability to unseen harmful questions, and helps in improving model defenses to prompt attacks. TurboFuzzLLM is available open source at https://github.com/amazon-science/TurboFuzzLLM.
中文摘要:TurboFuzzLLM是一种基于变异的模糊测试技术,能自动生成有效的越狱模板来测试大语言模型的鲁棒性,对GPT-4o等领先模型的攻击成功率超过95%,同时有助于增强模型防御能力。
English Summary: TurboFuzzLLM is a mutation-based fuzzing technique that automatically generates effective jailbreaking templates to test LLM robustness, achieving over 95% attack success rates against models like GPT-4o while helping improve their defenses.
Authors:Yizhe Zhang, Richard Bai, Zijin Gu, Ruixiang Zhang, Jiatao Gu, Emmanuel Abbe, Samy Bengio, Navdeep Jaitly
Abstract:
Language models usually use left-to-right (L2R) autoregressive factorization. However, L2R factorization may not always be the best inductive bias. Therefore, we investigate whether alternative factorizations of the text distribution could be beneficial in some tasks. We investigate right-to-left (R2L) training as a compelling alternative, focusing on multiple-choice questions (MCQs) as a test bed for knowledge extraction and reasoning. Through extensive experiments across various model sizes (2B-8B parameters) and training datasets, we find that R2L models can significantly outperform L2R models on several MCQ benchmarks, including logical reasoning, commonsense understanding, and truthfulness assessment tasks. Our analysis reveals that this performance difference may be fundamentally linked to multiple factors including calibration, computability, and directional conditional entropy. We analyze the impact of these factors through controlled simulation studies using arithmetic tasks, where the impacting factors can be better disentangled. Our work demonstrates that exploring alternative factorizations of the text distribution can lead to improvements in LLM capabilities and provides theoretical insights into optimal factorization towards approximating human language distribution, and when each reasoning order might be more advantageous. Our code and checkpoints are released at https://github.com/apple/ml-reversal-blessing.
中文摘要:本研究表明,在多项选择题推理任务中,从右向左训练的语言模型优于传统的从左向右模型,揭示了性能与校准度和条件熵的关联,并为文本分布的最佳因子化提供了理论依据。
English Summary: This study demonstrates that right-to-left (R2L) trained language models outperform traditional left-to-right (L2R) models on multiple-choice reasoning tasks, revealing performance links to calibration and conditional entropy while providing theoretical insights into optimal text factorization.
Authors:Henry Peng Zou, Siffi Singh, Yi Nian, Jianfeng He, Jason Cai, Saab Mansour, Hang Su
Abstract:
Generalized Category Discovery (GCD) is a practical and challenging open-world task that aims to recognize both known and novel categories in unlabeled data using limited labeled data from known categories. Due to the lack of supervision, previous GCD methods face significant challenges, such as difficulty in rectifying errors for confusing instances, and inability to effectively uncover and leverage the semantic meanings of discovered clusters. Therefore, additional annotations are usually required for real-world applicability. However, human annotation is extremely costly and inefficient. To address these issues, we propose GLEAN, a unified framework for generalized category discovery that actively learns from diverse and quality-enhanced LLM feedback. Our approach leverages three different types of LLM feedback to: (1) improve instance-level contrastive features, (2) generate category descriptions, and (3) align uncertain instances with LLM-selected category descriptions. Extensive experiments demonstrate the superior performance of \MethodName over state-of-the-art models across diverse datasets, metrics, and supervision settings. Our code is available at https://github.com/amazon-science/Glean.
Chinese: GLEAN是一个统一框架,通过主动学习多样化和质量增强的大型语言模型反馈,解决在未标记数据中识别已知和未知类别的挑战。
English: GLEAN is a unified framework for generalized category discovery that actively learns from diverse and quality-enhanced LLM feedback to address challenges in recognizing both known and novel categories in unlabeled data.
Authors:Zenghui Chang, Yiqiao Zhang, Hong Cai Chen
Abstract:
Graph similarity learning, crucial for tasks such as graph classification and similarity search, focuses on measuring the similarity between two graph-structured entities. The core challenge in this field is effectively managing the interactions between graphs. Traditional methods often entail separate, redundant computations for each graph pair, leading to unnecessary complexity. This paper revolutionizes the approach by introducing a parallel graph interaction method called graph fusion. By merging the node sequences of graph pairs into a single large graph, our method leverages a global attention mechanism to facilitate interaction computations and to harvest cross-graph insights. We further assess the similarity between graph pairs at two distinct levels-graph-level and node-level-introducing two innovative, yet straightforward, similarity computation algorithms. Extensive testing across five public datasets shows that our model not only outperforms leading baseline models in graph-to-graph classification and regression tasks but also sets a new benchmark for performance and efficiency. The code for this paper is open-source and available at https://github.com/LLiRarry/GFM-code.git
Chinese Summary: 本文提出了一种名为图融合的并行图交互方法,通过将节点序列合并为单一图并利用全局注意力机制进行交互计算,同时在图和节点两个层面评估相似性,在多项任务中实现了卓越的性能和效率突破。
English Summary: This paper introduces a parallel graph interaction method called graph fusion, which merges node sequences into a single graph to enable efficient cross-graph insights through global attention and dual-level similarity computation, achieving superior performance and efficiency in graph classification and regression tasks.
Authors:Gianluigi Silvestri, Luca Ambrogioni, Chieh-Hsin Lai, Yuhta Takida, Yuki Mitsufuji
Abstract:
Consistency Training (CT) has recently emerged as a strong alternative to diffusion models for image generation. However, non-distillation CT often suffers from high variance and instability, motivating ongoing research into its training dynamics. We propose Variational Consistency Training (VCT), a flexible and effective framework compatible with various forward kernels, including those in flow matching. Its key innovation is a learned noise-data coupling scheme inspired by Variational Autoencoders, where a data-dependent encoder models noise emission. This enables VCT to adaptively learn noise-todata pairings, reducing training variance relative to the fixed, unsorted pairings in classical CT. Experiments on multiple image datasets demonstrate significant improvements: our method surpasses baselines, achieves state-of-the-art FID among non-distillation CT approaches on CIFAR-10, and matches SoTA performance on ImageNet 64 x 64 with only two sampling steps. Code is available at https://github.com/sony/vct.
中文: 变分一致性训练(VCT)通过学习噪声与数据的自适应配对机制,有效降低了训练方差,在图像生成任务中仅需两步采样即可达到顶尖性能。
English: Variational Consistency Training (VCT) introduces a learned noise-data coupling scheme to reduce training variance and instability, achieving state-of-the-art results on image generation tasks with minimal sampling steps.
Authors:Jintao Zhang, Chendong Xiang, Haofeng Huang, Jia Wei, Haocheng Xi, Jun Zhu, Jianfei Chen
Abstract:
An efficient attention implementation is essential for large models due to its quadratic time complexity. Fortunately, attention commonly exhibits sparsity, i.e., many values in the attention map are near zero, allowing for the omission of corresponding computations. Many studies have utilized the sparse pattern to accelerate attention. However, most existing works focus on optimizing attention within specific models by exploiting certain sparse patterns of the attention map. A universal sparse attention that guarantees both the speedup and end-to-end performance of diverse models remains elusive. In this paper, we propose SpargeAttn, a universal sparse and quantized attention for any model. Our method uses a two-stage online filter: in the first stage, we rapidly and accurately predict the attention map, enabling the skip of some matrix multiplications in attention. In the second stage, we design an online softmax-aware filter that incurs no extra overhead and further skips some matrix multiplications. Experiments show that our method significantly accelerates diverse models, including language, image, and video generation, without sacrificing end-to-end metrics. The codes are available at https://github.com/thu-ml/SpargeAttn.
中文: SpargeAttn提出了一种通用的稀疏量化注意力机制,通过两阶段在线过滤跳过冗余计算,在加速语言、图像和视频生成等多种模型的同时保持端到端性能不损失。
English: SpargeAttn introduces a universal sparse and quantized attention mechanism that accelerates diverse models across language, image, and video generation without compromising performance by employing a two-stage online filter to skip unnecessary computations.
Authors:Ankita Raj, Deepankar Varma, Chetan Arora
Abstract:
Foundation models (FMs) for computer vision learn rich and robust representations, enabling their adaptation to task/domain-specific deployments with little to no fine-tuning. However, we posit that the very same strength can make applications based on FMs vulnerable to model stealing attacks. Through empirical analysis, we reveal that models fine-tuned from FMs harbor heightened susceptibility to model stealing, compared to conventional vision architectures like ResNets. We hypothesize that this behavior is due to the comprehensive encoding of visual patterns and features learned by FMs during pre-training, which are accessible to both the attacker and the victim. We report that an attacker is able to obtain 94.28% agreement (matched predictions with victim) for a Vision Transformer based victim model (ViT-L/16) trained on CIFAR-10 dataset, compared to only 73.20% agreement for a ResNet-18 victim, when using ViT-L/16 as the thief model. We arguably show, for the first time, that utilizing FMs for downstream tasks may not be the best choice for deployment in commercial APIs due to their susceptibility to model theft. We thereby alert model owners towards the associated security risks, and highlight the need for robust security measures to safeguard such models against theft. Code is available at https://github.com/rajankita/foundation_model_stealing.
中文: 计算机视觉基础模型虽然能轻松适应特定任务,但极易遭受模型窃取攻击,攻击者在微调模型上可获得高达94.28%的预测一致性,这对商业部署构成严重安全威胁。
English: Foundation models in computer vision, while enabling easy adaptation to specific tasks, are highly vulnerable to model stealing attacks, with attackers achieving up to 94.28% prediction agreement on fine-tuned models, highlighting significant security risks for commercial deployments.
Authors:Tianmi Ma, Jiawei Du, Wenxin Huang, Wenjie Wang, Liang Xie, Xian Zhong, Joey Tianyi Zhou
Abstract:
Large language models (LLMs) have demonstrated remarkable capabilities in natural language tasks, yet their performance in dynamic, real-world financial environments remains underexplored. Existing approaches are limited to historical backtesting, where trading actions cannot influence market prices and agents train only on static data. To address this limitation, we present the Agent Trading Arena, a virtual zero-sum stock market in which LLM-based agents engage in competitive multi-agent trading and directly impact price dynamics. By simulating realistic bid-ask interactions, our platform enables training in scenarios that closely mirror live markets, thereby narrowing the gap between training and evaluation. Experiments reveal that LLMs struggle with numerical reasoning when given plain-text data, often overfitting to local patterns and recent values. In contrast, chart-based visualizations significantly enhance both numerical reasoning and trading performance. Furthermore, incorporating a reflection module yields additional improvements, especially with visual inputs. Evaluations on NASDAQ and CSI datasets demonstrate the superiority of our method, particularly under high volatility. All code and data are available at https://github.com/wekjsdvnm/Agent-Trading-Arena.
大语言模型在实时金融交易中因数值推理能力不足和过拟合问题表现欠佳,而代理交易竞技场通过引入可视化输入和反思模块的竞争性模拟,显著提升了交易性能,尤其在市场波动剧烈时效果更为突出。
Large language models struggle with real-time financial trading due to limitations in numerical reasoning and overfitting, but the Agent Trading Arena introduces a competitive simulation with visual inputs and reflection modules that significantly enhance performance, especially in volatile markets.
Authors:Mingyuan Sun, Zheng Fang, Jiaxu Wang, Junjie Jiang, Delei Kong, Chenming Hu, Yuetong Fang, Renjing Xu
Abstract:
The increasing complexity and parameter count of Convolutional Neural Networks (CNNs) and Transformers pose challenges in terms of computational efficiency and resource demands. Pruning has been identified as an effective strategy to address these challenges by removing redundant elements such as neurons, channels, or connections, thereby enhancing computational efficiency without heavily compromising performance. This paper builds on the foundational work of Optimal Brain Damage (OBD) by advancing the methodology of parameter importance estimation using the Hessian matrix. Unlike previous approaches that rely on approximations, we introduce Optimal Brain Apoptosis (OBA), a novel pruning method that calculates the Hessian-vector product value directly for each parameter. By decomposing the Hessian matrix across network layers and identifying conditions under which inter-layer Hessian submatrices are non-zero, we propose a highly efficient technique for computing the second-order Taylor expansion of parameters. This approach allows for a more precise pruning process, particularly in the context of CNNs and Transformers, as validated in our experiments including VGG19, ResNet32, ResNet50, and ViT-B/16 on CIFAR10, CIFAR100 and Imagenet datasets. Our code is available at https://github.com/NEU-REAL/OBA.
中文: 本文提出了一种名为最优脑凋亡(OBA)的新剪枝方法,通过直接计算Hessian向量积来精确评估参数重要性,从而在多个数据集上验证了其在提升卷积神经网络和Transformer计算效率方面的有效性。
English: This paper introduces Optimal Brain Apoptosis (OBA), a novel pruning method that enhances computational efficiency by directly calculating Hessian-vector products for precise parameter importance estimation in CNNs and Transformers, validated on multiple datasets.
Authors:Hongyi Chen, Jingtao Ding, Xiaojun Liang, Yong Li, Xiao-Ping Zhang
Abstract:
Source localization in graph information propagation is essential for mitigating network disruptions, including misinformation spread, cyber threats, and infrastructure failures. Existing deep generative approaches face significant challenges in real-world applications due to limited propagation data availability. We present SIDSL (\textbf{S}tructure-prior \textbf{I}nformed \textbf{D}iffusion model for \textbf{S}ource \textbf{L}ocalization), a generative diffusion framework that leverages topology-aware priors to enable robust source localization with limited data. SIDSL addresses three key challenges: unknown propagation patterns through structure-based source estimations via graph label propagation, complex topology-propagation relationships via a propagation-enhanced conditional denoiser with GNN-parameterized label propagation module, and class imbalance through structure-prior biased diffusion initialization. By learning pattern-invariant features from synthetic data generated by established propagation models, SIDSL enables effective knowledge transfer to real-world scenarios. Experimental evaluation on four real-world datasets demonstrates superior performance with 7.5-13.3\% F1 score improvements over baselines, including over 19\% improvement in few-shot and 40\% in zero-shot settings, validating the framework's effectiveness for practical source localization. Our code can be found \href{https://github.com/tsinghua-fib-lab/SIDSL}{here}.
中文: SIDSL是一种利用拓扑感知先验的生成扩散框架,可在有限数据下实现鲁棒的源定位,在真实场景中相比基线方法F1分数提升7.5-13.3%,验证了其实际有效性。
English: SIDSL is a generative diffusion framework that leverages topology-aware priors to enable robust source localization with limited data, demonstrating superior performance with 7.5-13.3% F1 score improvements over baselines in real-world applications.
Authors:Vishal Nedungadi, Muhammad Akhtar Munir, Marc RuÃwurm, Ron Sarafian, Ioannis N. Athanasiadis, Yinon Rudich, Fahad Shahbaz Khan, Salman Khan
Abstract:
Air pollution remains a leading global health risk, exacerbated by rapid industrialization and urbanization, contributing significantly to morbidity and mortality rates. In this paper, we introduce AirCast, a novel multi-variable air pollution forecasting model, by combining weather and air quality variables. AirCast employs a multi-task head architecture that simultaneously forecasts atmospheric conditions and pollutant concentrations, improving its understanding of how weather patterns affect air quality. Predicting extreme pollution events is challenging due to their rare occurrence in historic data, resulting in a heavy-tailed distribution of pollution levels. To address this, we propose a novel Frequency-weighted Mean Absolute Error (fMAE) loss, adapted from the class-balanced loss for regression tasks. Informed from domain knowledge, we investigate the selection of key variables known to influence pollution levels. Additionally, we align existing weather and chemical datasets across spatial and temporal dimensions. AirCast's integrated approach, combining multi-task learning, frequency weighted loss and domain informed variable selection, enables more accurate pollution forecasts. Our source code and models are made public here (https://github.com/vishalned/AirCast.git)
中文摘要:AirCast是一种创新的多变量空气污染预测模型,通过整合天气和空气质量数据,采用多任务学习架构、频率加权损失函数及领域知识驱动的变量选择方法,显著提升了污染预测的精确度。
English Summary: AirCast is a novel multi-variable air pollution forecasting model that integrates weather and air quality data through multi-task learning, frequency-weighted loss, and domain-informed variable selection to improve prediction accuracy.
Authors:Yuhan Chen, Yihong Luo, Yifan Song, Pengwen Dai, Jing Tang, Xiaochun Cao
Abstract:
Despite extensive research efforts focused on OOD detection on images, OOD detection on nodes in graph learning remains underexplored. The dependence among graph nodes hinders the trivial adaptation of existing approaches on images that assume inputs to be i.i.d. sampled, since many unique features and challenges specific to graphs are not considered, such as the heterophily issue. Recently, GNNSafe, which considers node dependence, adapted energy-based detection to the graph domain with state-of-the-art performance, however, it has two serious issues: 1) it derives node energy from classification logits without specifically tailored training for modeling data distribution, making it less effective at recognizing OOD data; 2) it highly relies on energy propagation, which is based on homophily assumption and will cause significant performance degradation on heterophilic graphs, where the node tends to have dissimilar distribution with its neighbors. To address the above issues, we suggest training EBMs by MLE to enhance data distribution modeling and remove energy propagation to overcome the heterophily issues. However, training EBMs via MLE requires performing MCMC sampling on both node feature and node neighbors, which is challenging due to the node interdependence and discrete graph topology. To tackle the sampling challenge, we introduce DeGEM, which decomposes the learning process into two parts: a graph encoder that leverages topology information for node representations and an energy head that operates in latent space. Extensive experiments validate that DeGEM, without OOD exposure during training, surpasses previous state-of-the-art methods, achieving an average AUROC improvement of 6.71% on homophilic graphs and 20.29% on heterophilic graphs, and even outperform methods trained with OOD exposure. Our code is available at: https://github.com/draym28/DeGEM.
中文: 图学习中的节点分布外检测因节点间依赖性而研究不足,提出的DeGEM方法通过将学习过程分解为图编码器和能量头,克服了现有方法的局限,在无需分布外数据训练的情况下实现了最优性能。
English: Graph out-of-distribution (OOD) detection remains underexplored due to node interdependence, and the proposed DeGEM method overcomes limitations of prior approaches by decomposing learning into a graph encoder and energy head, achieving state-of-the-art performance without OOD exposure during training.
Authors:Runzhong Wang, Rui-Xi Wang, Mrunali Manjrekar, Connor W. Coley
Abstract:
Molecular machine learning has gained popularity with the advancements of geometric deep learning. In parallel, retrieval-augmented generation has become a principled approach commonly used with language models. However, the optimal integration of retrieval augmentation into molecular machine learning remains unclear. Graph neural networks stand to benefit from clever matching to understand the structural alignment of retrieved molecules to a query molecule. Neural graph matching offers a compelling solution by explicitly modeling node and edge affinities between two structural graphs while employing a noise-robust, end-to-end neural network to learn affinity metrics. We apply this approach to mass spectrum simulation and introduce MARASON, a novel model that incorporates neural graph matching to enhance a fragmentation-based neural network. Experimental results highlight the effectiveness of our design, with MARASON achieving 28% top-1 accuracy, a substantial improvement over the non-retrieval state-of-the-art accuracy of 19%. Moreover, MARASON outperforms both naive retrieval-augmented generation methods and traditional graph matching approaches. Code is publicly available at https://github.com/coleygroup/ms-pred
中文摘要:MARASON模型通过神经图匹配技术将检索增强生成应用于分子机器学习,在质谱模拟任务中实现了28%的准确率,较无检索增强的现有最优方法19%有显著提升。
English Summary: Neural graph matching is applied to molecular machine learning through the MARASON model, significantly improving mass spectrum simulation accuracy to 28% compared to 19% without retrieval augmentation.
Authors:Hyeonjeong Ha, Qiusi Zhan, Jeonghwan Kim, Dimitrios Bralios, Saikrishna Sanniboina, Nanyun Peng, Kai-Wei Chang, Daniel Kang, Heng Ji
Abstract:
Multimodal large language models (MLLMs) equipped with Retrieval Augmented Generation (RAG) leverage both their rich parametric knowledge and the dynamic, external knowledge to excel in tasks such as Question Answering. While RAG enhances MLLMs by grounding responses in query-relevant external knowledge, this reliance poses a critical yet underexplored safety risk: knowledge poisoning attacks, where misinformation or irrelevant knowledge is intentionally injected into external knowledge bases to manipulate model outputs to be incorrect and even harmful. To expose such vulnerabilities in multimodal RAG, we propose MM-PoisonRAG, a novel knowledge poisoning attack framework with two attack strategies: Localized Poisoning Attack (LPA), which injects query-specific misinformation in both text and images for targeted manipulation, and Globalized Poisoning Attack (GPA) to provide false guidance during MLLM generation to elicit nonsensical responses across all queries. We evaluate our attacks across multiple tasks, models, and access settings, demonstrating that LPA successfully manipulates the MLLM to generate attacker-controlled answers, with a success rate of up to 56% on MultiModalQA. Moreover, GPA completely disrupts model generation to 0% accuracy with just a single irrelevant knowledge injection. Our results highlight the urgent need for robust defenses against knowledge poisoning to safeguard multimodal RAG frameworks.
中文摘要:多模态检索增强生成系统面临知识投毒攻击的安全隐患,攻击者通过注入误导性内容操控模型输出,MM-PoisonRAG框架实验显示其定向攻击成功率最高可达56%。
English Summary: Multimodal RAG systems face critical safety risks from knowledge poisoning attacks, where adversaries inject misleading content to manipulate model outputs, as demonstrated by the MM-PoisonRAG framework achieving up to 56% targeted attack success.
Authors:Xiongxiao Xu, Haoran Wang, Yueqing Liang, Philip S. Yu, Yue Zhao, Kai Shu
Abstract:
Large language models (LLMs) have been increasingly used in time series analysis. However, the potential of multimodal LLMs (MLLMs), particularly vision-language models, for time series remains largely under-explored. One natural way for humans to detect time series anomalies is through visualization and textual description. Motivated by this, we raise a critical and practical research question: Can multimodal LLMs perform time series anomaly detection? To answer this, we propose VisualTimeAnomaly benchmark to evaluate MLLMs in time series anomaly detection (TSAD). Our approach transforms time series numerical data into the image format and feed these images into various MLLMs, including proprietary models (GPT-4o and Gemini-1.5) and open-source models (LLaVA-NeXT and Qwen2-VL), each with one larger and one smaller variant. In total, VisualTimeAnomaly contains 12.4k time series images spanning 3 scenarios and 3 anomaly granularities with 9 anomaly types across 8 MLLMs. Starting with the univariate case (point- and range-wise anomalies), we extend our evaluation to more practical scenarios, including multivariate and irregular time series scenarios, and variate-wise anomalies. Our study reveals several key insights:
1) MLLMs detect range- and variate-wise anomalies more effectively than point-wise anomalies.
2) MLLMs are highly robust to irregular time series, even with 25% of the data missing.
3) Open-source MLLMs perform comparably to proprietary models in TSAD. While open-source MLLMs excel on univariate time series, proprietary MLLMs demonstrate superior effectiveness on multivariate time series.
To the best of our knowledge, this is the first work to comprehensively investigate MLLMs for TSAD, particularly for multivariate and irregular time series scenarios. We release our dataset and code at https://github.com/mllm-ts/VisualTimeAnomaly to support future research.
Chinese: 本研究提出VisualTimeAnomaly基准,通过将时间序列数值数据转换为图像来评估多模态大语言模型在异常检测中的表现,发现模型能有效检测范围和变量级异常、对不规则数据具有强鲁棒性,且开源模型在单变量场景下与商业模型性能相当。
English: This study introduces the VisualTimeAnomaly benchmark to evaluate multimodal large language models (MLLMs) on time series anomaly detection by converting numerical data into images, revealing that MLLMs effectively detect range- and variate-wise anomalies, show robustness to irregular data, and that open-source models perform comparably to proprietary ones in univariate cases.
Authors:Ruxiao Chen, Chenguang Wang, Yuran Sun, Xilei Zhao, Susu Xu
Abstract:
Evacuation decision prediction is critical for efficient and effective wildfire response by helping emergency management anticipate traffic congestion and bottlenecks, allocate resources, and minimize negative impacts. Traditional statistical methods for evacuation decision prediction fail to capture the complex and diverse behavioral logic of different individuals. In this work, for the first time, we introduce FLARE, short for facilitating LLM for advanced reasoning on wildfire evacuation decision prediction, a Large Language Model (LLM)-based framework that integrates behavioral theories and models to streamline the Chain-of-Thought (CoT) reasoning and subsequently integrate with memory-based Reinforcement Learning (RL) module to provide accurate evacuation decision prediction and understanding. Our proposed method addresses the limitations of using existing LLMs for evacuation behavioral predictions, such as limited survey data, mismatching with behavioral theory, conflicting individual preferences, implicit and complex mental states, and intractable mental state-behavior mapping. Experiments on three post-wildfire survey datasets show an average of 20.47% performance improvement over traditional theory-informed behavioral models, with strong cross-event generalizability. Our complete code is publicly available at https://github.com/SusuXu-s-Lab/FLARE
Chinese: FLARE框架通过结合行为理论与大语言模型推理及强化学习,改进了野火疏散决策预测,相比传统模型性能提升20.47%,有效解决了数据不足和理论不匹配等局限性。
English: The FLARE framework enhances wildfire evacuation decision prediction by integrating behavioral theories with LLM-based reasoning and reinforcement learning, achieving a 20.47% performance improvement over traditional models while addressing data and theory limitations.
Authors:Taos Transue, Bao Wang
Abstract:
The orchestration of agents to optimize a collective objective without centralized control is challenging yet crucial for applications such as controlling autonomous fleets, and surveillance and reconnaissance using sensor networks. Decentralized controller design has been inspired by self-organization found in nature, with a prominent source of inspiration being flocking; however, decentralized controllers struggle to maintain flock cohesion. The graph neural network (GNN) architecture has emerged as an indispensable machine learning tool for developing decentralized controllers capable of maintaining flock cohesion, but they fail to exploit the symmetries present in flocking dynamics, hindering their generalizability. We enforce rotation equivariance and translation invariance symmetries in decentralized flocking GNN controllers and achieve comparable flocking control with 70% less training data and 75% fewer trainable weights than existing GNN controllers without these symmetries enforced. We also show that our symmetry-aware controller generalizes better than existing GNN controllers. Code and animations are available at http://github.com/Utah-Math-Data-Science/Equivariant-Decentralized-Controllers.
中文摘要:研究人员开发了一种基于图神经网络的分散式集群控制器,通过强制旋转等变和平移不变对称性,在减少70%训练数据和75%参数的情况下实现相当性能,并展现出更优的泛化能力。
English Summary: Researchers developed a decentralized flocking controller using graph neural networks that enforces rotation equivariance and translation invariance, achieving comparable performance with 70% less training data and 75% fewer parameters while demonstrating superior generalization capabilities.
Authors:Dang Nguyen, Zeman Li, Mohammadhossein Bateni, Vahab Mirrokni, Meisam Razaviyayn, Baharan Mirzasoleiman
Abstract:
Synthetic data has the potential to improve the performance, training efficiency, and privacy of real training examples. Nevertheless, existing approaches for synthetic text generation are mostly heuristics and cannot generate human-readable text without compromising the privacy of real data, or provide performance guarantees for training Large Language Models (LLMs). In this work, we propose the first theoretically rigorous approach for generating synthetic human-readable text that provides convergence, performance, and privacy guarantees for fine-tuning LLMs on a target task. To do so, we leverage Alternating Direction Method of Multipliers (ADMM) that iteratively optimizes the embeddings of synthetic examples to match the noisy gradient of the target training or validation data, and maps them to a sequence of text tokens with low perplexity. In doing so, the generated synthetic text guarantees convergence of the model to a close neighborhood of the solution obtained by fine-tuning on real data and preserves their privacy. Experiments on various classification tasks confirm the effectiveness of our proposed approach. Our code is available at https://github.com/BigML-CS-UCLA/GRADMM.
中文: 本研究提出了一种基于ADMM的理论严谨方法,可生成人类可读的合成文本,在保证收敛性、性能和隐私的前提下用于微调大语言模型,并在多项分类任务中得到验证。
English: This study introduces a theoretically rigorous method using ADMM to generate human-readable synthetic text that ensures convergence, performance, and privacy for fine-tuning LLMs, validated across multiple classification tasks.
Authors:Simin Chen, Yiming Chen, Zexin Li, Yifan Jiang, Zhongwei Wan, Yixin He, Dezhi Ran, Tianle Gu, Haizhou Li, Tao Xie, Baishakhi Ray
Abstract:
Data contamination has received increasing attention in the era of large language models (LLMs) due to their reliance on vast Internet-derived training corpora. To mitigate the risk of potential data contamination, LLM benchmarking has undergone a transformation from static to dynamic benchmarking. In this work, we conduct an in-depth analysis of existing static to dynamic benchmarking methods aimed at reducing data contamination risks. We first examine methods that enhance static benchmarks and identify their inherent limitations. We then highlight a critical gap-the lack of standardized criteria for evaluating dynamic benchmarks. Based on this observation, we propose a series of optimal design principles for dynamic benchmarking and analyze the limitations of existing dynamic benchmarks. This survey provides a concise yet comprehensive overview of recent advancements in data contamination research, offering valuable insights and a clear guide for future research efforts. We maintain a GitHub repository to continuously collect both static and dynamic benchmarking methods for LLMs. The repository can be found at this link.
中文: 本综述分析了为应对数据污染风险从静态基准测试向动态基准测试的转变,指出了现有评估标准的不足,并为动态基准测试提出了优化设计原则。
English: This survey analyzes the shift from static to dynamic benchmarking in large language models to address data contamination risks, identifies gaps in current evaluation standards, and proposes optimal design principles for dynamic benchmarks.
Authors:François Charton
Abstract:
This paper documents Int2Int, an open source code base for using transformers on problems of mathematical research, with a focus on number theory and other problems involving integers. Int2Int is a complete PyTorch implementation of a transformer architecture, together with training and evaluation loops, and classes and functions to represent, generate and decode common mathematical objects. Ancillary code for data preparation, and Jupyter Notebooks for visualizing experimental results are also provided. This document presents the main features of Int2Int, serves as its user manual, and provides guidelines on how to extend it. Int2Int is released under the MIT licence, at https://github.com/f-charton/Int2Int.
中文: Int2Int是一个基于PyTorch的开源Transformer框架,专注于数论等整数相关数学研究,提供完整实现、训练工具及可视化支持。
English: Int2Int is an open-source PyTorch-based transformer framework designed for mathematical research, particularly in number theory, providing complete implementation, training tools, and visualization support for integer-related problems.
Authors:Yijia Xiao, Wanjia Zhao, Junkai Zhang, Yiqiao Jin, Han Zhang, Zhicheng Ren, Renliang Sun, Haixin Wang, Guancheng Wan, Pan Lu, Xiao Luo, Yu Zhang, James Zou, Yizhou Sun, Wei Wang
Abstract:
Protein-specific large language models (Protein LLMs) are revolutionizing protein science by enabling more efficient protein structure prediction, function annotation, and design. While existing surveys focus on specific aspects or applications, this work provides the first comprehensive overview of Protein LLMs, covering their architectures, training datasets, evaluation metrics, and diverse applications. Through a systematic analysis of over 100 articles, we propose a structured taxonomy of state-of-the-art Protein LLMs, analyze how they leverage large-scale protein sequence data for improved accuracy, and explore their potential in advancing protein engineering and biomedical research. Additionally, we discuss key challenges and future directions, positioning Protein LLMs as essential tools for scientific discovery in protein science. Resources are maintained at https://github.com/Yijia-Xiao/Protein-LLM-Survey.
中文摘要:本研究首次系统综述了蛋白质大语言模型,涵盖其架构、应用与挑战,确立了其在推动蛋白质科学发展的关键工具地位。
English Summary: This work presents the first comprehensive survey of Protein LLMs, detailing their architectures, applications, and challenges while positioning them as essential tools for advancing protein science.
Authors:Yushi Zhang, Shuai Su, Yong Wang, Yanzhong Yao
Abstract:
This paper develops a novel deep learning approach for solving evolutionary equations, which integrates sequential learning strategies with an enhanced hard constraint strategy featuring trainable parameters, addressing the low computational accuracy of standard Physics-Informed Neural Networks (PINNs) in large temporal domains.Sequential learning strategies divide a large temporal domain into multiple subintervals and solve them one by one in a chronological order, which naturally respects the principle of causality and improves the stability of the PINN solution. The improved hard constraint strategy strictly ensures the continuity and smoothness of the PINN solution at time interval nodes, and at the same time passes the information from the previous interval to the next interval, which avoids the incorrect/trivial solution at the position far from the initial time. Furthermore, by investigating the requirements of different types of equations on hard constraints, we design a novel influence function with trainable parameters for hard constraints, which provides theoretical and technical support for the effective implementations of hard constraint strategies, and significantly improves the universality and computational accuracy of our method. In addition, an adaptive time-domain partitioning algorithm is proposed, which plays an important role in the application of the proposed method as well as in the improvement of computational efficiency and accuracy. Numerical experiments verify the performance of the method. The data and code accompanying this paper are available at https://github.com/zhizhi4452/HCS.
Chinese: 本文提出了一种新颖的深度学习算法,通过结合序列学习与具有可训练参数的改进硬约束策略,有效解决了物理信息神经网络在大时间域中计算精度不足的问题。
English: This paper introduces a novel deep learning method that combines sequential learning with an enhanced hard constraint strategy featuring trainable parameters to significantly improve the computational accuracy of Physics-Informed Neural Networks for evolutionary equations over large temporal domains.
Authors:Ruoyu Guo, Haochen Qiu
Abstract:
Making consistently profitable financial decisions in a continuously evolving and volatile stock market has always been a difficult task. Professionals from different disciplines have developed foundational theories to anticipate price movement and evaluate securities such as the famed Capital Asset Pricing Model (CAPM). In recent years, the role of artificial intelligence (AI) in asset pricing has been growing. Although the black-box nature of deep learning models lacks interpretability, they have continued to solidify their position in the financial industry. We aim to further enhance AI's potential and utility by introducing a return-weighted loss function that will drive top growth while providing the ML models a limited amount of information. Using only publicly accessible stock data (open/close/high/low, trading volume, sector information) and several technical indicators constructed from them, we propose an efficient daily trading system that detects top growth opportunities. Our best models achieve 61.73% annual return on daily rebalancing with an annualized Sharpe Ratio of 1.18 over 1340 testing days from 2019 to 2024, and 37.61% annual return with an annualized Sharpe Ratio of 0.97 over 1360 testing days from 2005 to 2010. The main drivers for success, especially independent of any domain knowledge, are the novel return-weighted loss function, the integration of categorical and continuous data, and the ML model architecture. We also demonstrate the superiority of our novel loss function over traditional loss functions via several performance metrics and statistical evidence.
中文: 本研究通过引入收益加权损失函数,仅利用公开数据和衍生技术指标,在无需领域知识的情况下构建了高效AI交易系统,实现了高年化收益率与夏普比率。
English: This study introduces a return-weighted loss function to enhance AI-driven stock trading, achieving high returns and Sharpe ratios by using only public data and technical indicators without domain knowledge.
Authors:Xu Wang, Jiaju Kang, Puyu Han, Yubao Zhao, Qian Liu, Liwenfei He, Lingqiong Zhang, Lingyun Dai, Yongcheng Wang, Jie Tao
Abstract:
We present ECG-Expert-QA, a comprehensive multimodal dataset for evaluating diagnostic capabilities in electrocardiogram (ECG) interpretation. It combines real-world clinical ECG data with systematically generated synthetic cases, covering 12 essential diagnostic tasks and totaling 47,211 expert-validated QA pairs. These encompass diverse clinical scenarios, from basic rhythm recognition to complex diagnoses involving rare conditions and temporal changes. A key innovation is the support for multi-turn dialogues, enabling the development of conversational medical AI systems that emulate clinician-patient or interprofessional interactions. This allows for more realistic assessment of AI models' clinical reasoning, diagnostic accuracy, and knowledge integration. Constructed through a knowledge-guided framework with strict quality control, ECG-Expert-QA ensures linguistic and clinical consistency, making it a high-quality resource for advancing AI-assisted ECG interpretation. It challenges models with tasks like identifying subtle ischemic changes and interpreting complex arrhythmias in context-rich scenarios. To promote research transparency and collaboration, the dataset, accompanying code, and prompts are publicly released at https://github.com/Zaozzz/ECG-Expert-QA
中文: ECG-Expert-QA是一个结合真实与合成心电图案例的多模态数据集,包含47,211个专家验证的问答对,支持多轮对话功能,旨在推进临床推理和诊断准确性的会话式医疗AI系统发展。
English: ECG-Expert-QA is a multimodal dataset combining real and synthetic ECG cases with 47,211 expert-validated QA pairs, featuring multi-turn dialogues to advance conversational medical AI systems for clinical reasoning and diagnostic accuracy.
Authors:Younghoon Na, Seunghun Oh, Seongji Ko, Hyunkyung Lee
Abstract:
The analysis of lifelogs can yield valuable insights into an individual's daily life, particularly with regard to their health and well-being. The accurate assessment of quality of life is necessitated by the use of diverse sensors and precise synchronization. To rectify this issue, this study proposes the image-based sleep quality and stress level estimation flow (PixleepFlow). PixleepFlow employs a conversion methodology into composite image data to examine sleep patterns and their impact on overall health. Experiments were conducted using lifelog datasets to ascertain the optimal combination of data formats. In addition, we identified which sensor information has the greatest influence on the quality of life through Explainable Artificial Intelligence(XAI). As a result, PixleepFlow produced more significant results than various data formats. This study was part of a written-based competition, and the additional findings from the lifelog dataset are detailed in Section Section IV. More information about PixleepFlow can be found at https://github.com/seongjiko/Pixleep.
中文: 本研究提出基于图像的PixleepFlow方法,通过将生命日志数据转换为复合图像来评估睡眠质量和压力水平,并利用可解释人工智能技术识别关键影响因素,取得了优于传统数据格式的分析效果。
English: This study introduces PixleepFlow, an image-based method that analyzes lifelog data to estimate sleep quality and stress levels, achieving superior results through composite image conversion and explainable AI techniques.
Authors:Ziyue Yang, Chengrui Chen, Yong Peng, Qiong Chen, Wanzeng Kong
Abstract:
The Rapid Serial Visual Presentation (RSVP) paradigm represents a promising application of electroencephalography (EEG) in Brain-Computer Interface (BCI) systems. However, cross-subject variability remains a critical challenge, particularly for BCI-illiterate users who struggle to effectively interact with these systems. To address this issue, we propose the Class-Sensitive Subject-to-Subject Semantic Style Transfer Network (CSSSTN), which incorporates a class-sensitive approach to align feature distributions between golden subjects (BCI experts) and target (BCI-illiterate) users on a class-by-class basis. Building on the SSSTN framework, CSSSTN incorporates three key components: (1) subject-specific classifier training, (2) a unique style loss to transfer class-discriminative features while preserving semantic information through a modified content loss, and (3) an ensemble approach to integrate predictions from both source and target domains. We evaluated CSSSTN using both a publicly available dataset and a self-collected dataset. Experimental results demonstrate that CSSSTN outperforms state-of-the-art methods, achieving mean balanced accuracy improvements of 6.4\% on the Tsinghua dataset and 3.5\% on the HDU dataset, with notable benefits for BCI-illiterate users. Ablation studies confirm the effectiveness of each component, particularly the class-sensitive transfer and the use of lower-layer features, which enhance transfer performance and mitigate negative transfer. Additionally, CSSSTN achieves competitive results with minimal target data, reducing calibration time and effort. These findings highlight the practical potential of CSSSTN for real-world BCI applications, offering a robust and scalable solution to improve the performance of BCI-illiterate users while minimizing reliance on extensive training data. Our code is available at https://github.com/ziyuey/CSSSTN.
中文摘要:提出的CSSSTN模型通过类别敏感的特征迁移方法,有效解决了脑机接口系统中跨被试差异的难题,在提升BCI不熟练用户性能的同时显著减少了校准数据需求。
English Summary: The proposed CSSSTN model effectively addresses cross-subject variability in EEG-based BCI systems by transferring class-discriminative features from expert to novice users, achieving significant accuracy improvements while reducing calibration requirements.
Authors:Francesco Stefano Carzaniga, Gary Tom Hoppeler, Michael Hersche, Kaspar Anton Schindler, Abbas Rahimi
Abstract:
All data modalities are not created equal, even when the signal they measure comes from the same source. In the case of the brain, two of the most important data modalities are the scalp electroencephalogram (EEG), and the intracranial electroencephalogram (iEEG). They are used by human experts, supported by deep learning (DL) models, to accomplish a variety of tasks, such as seizure detection and motor imagery classification. Although the differences between EEG and iEEG are well understood by human experts, the performance of DL models across these two modalities remains under-explored. To help characterize the importance of clean data on the performance of DL models, we propose BrainCodec, a high-fidelity EEG and iEEG neural compressor. We find that training BrainCodec on iEEG and then transferring to EEG yields higher reconstruction quality than training on EEG directly. In addition, we also find that training BrainCodec on both EEG and iEEG improves fidelity when reconstructing EEG. Our work indicates that data sources with higher SNR, such as iEEG, provide better performance across the board also in the medical time-series domain. BrainCodec also achieves up to a 64x compression on iEEG and EEG without a notable decrease in quality. BrainCodec markedly surpasses current state-of-the-art compression models both in final compression ratio and in reconstruction fidelity. We also evaluate the fidelity of the compressed signals objectively on a seizure detection and a motor imagery task performed by standard DL models. Here, we find that BrainCodec achieves a reconstruction fidelity high enough to ensure no performance degradation on the downstream tasks. Finally, we collect the subjective assessment of an expert neurologist, that confirms the high reconstruction quality of BrainCodec in a realistic scenario. The code is available at https://github.com/IBM/eeg-ieeg-brain-compressor.
中文摘要:BrainCodec是一种针对脑电图和颅内脑电图的高保真神经压缩器,利用高信噪比的iEEG数据训练可提升模型性能,在保持信号质量的同时实现高效压缩,确保下游医疗任务无性能损失。
English Summary: BrainCodec is a neural compressor that achieves high-fidelity compression for EEG and iEEG brain signals, demonstrating superior performance when trained on iEEG data and maintaining signal quality for downstream medical tasks.
Authors:Tianhong Li, Qinyi Sun, Lijie Fan, Kaiming He
Abstract:
Modularization is a cornerstone of computer science, abstracting complex functions into atomic building blocks. In this paper, we introduce a new level of modularization by abstracting generative models into atomic generative modules. Analogous to fractals in mathematics, our method constructs a new type of generative model by recursively invoking atomic generative modules, resulting in self-similar fractal architectures that we call fractal generative models. As a running example, we instantiate our fractal framework using autoregressive models as the atomic generative modules and examine it on the challenging task of pixel-by-pixel image generation, demonstrating strong performance in both likelihood estimation and generation quality. We hope this work could open a new paradigm in generative modeling and provide a fertile ground for future research. Code is available at https://github.com/LTH14/fractalgen.
中文: 本文提出分形生成模型,通过递归调用原子生成模块构建自相似结构,在图像生成任务中表现出色,为生成建模开辟了新范式。
English: This paper introduces fractal generative models, which recursively use atomic generative modules to create self-similar architectures, demonstrating strong performance in image generation tasks and offering a new paradigm for generative modeling.
Authors:Yichi Zhang, Yici Yan, Alex Schwing, Zhizhen Zhao
Abstract:
We formulate a hierarchical rectified flow to model data distributions. It hierarchically couples multiple ordinary differential equations (ODEs) and defines a time-differentiable stochastic process that generates a data distribution from a known source distribution. Each ODE resembles the ODE that is solved in a classic rectified flow, but differs in its domain, i.e., location, velocity, acceleration, etc. Unlike the classic rectified flow formulation, which formulates a single ODE in the location domain and only captures the expected velocity field (sufficient to capture a multi-modal data distribution), the hierarchical rectified flow formulation models the multi-modal random velocity field, acceleration field, etc., in their entirety. This more faithful modeling of the random velocity field enables integration paths to intersect when the underlying ODE is solved during data generation. Intersecting paths in turn lead to integration trajectories that are more straight than those obtained in the classic rectified flow formulation, where integration paths cannot intersect. This leads to modeling of data distributions with fewer neural function evaluations. We empirically verify this on synthetic 1D and 2D data as well as MNIST, CIFAR-10, and ImageNet-32 data. Our code is available at: https://riccizz.github.io/HRF/.
Authors:Runpeng Yu, Xinyin Ma, Xinchao Wang
Abstract:
To utilize visual information, Multimodal Large Language Model (MLLM) relies on the perception process of its vision encoder. The completeness and accuracy of visual perception significantly influence the precision of spatial reasoning, fine-grained understanding, and other tasks. However, MLLM still lacks the autonomous capability to control its own visual perception processes, for example, selectively reviewing specific regions of an image or focusing on information related to specific object categories. In this work, we propose the concept of Visual Perception Token, aiming to empower MLLM with a mechanism to control its visual perception processes. We design two types of Visual Perception Tokens, termed the Region Selection Token and the Vision Re-Encoding Token. MLLMs autonomously generate these tokens, just as they generate text, and use them to trigger additional visual perception actions. The Region Selection Token explicitly identifies specific regions in an image that require further perception, while the Vision Re-Encoding Token uses its hidden states as control signals to guide additional visual perception processes. Extensive experiments demonstrate the advantages of these tokens in handling spatial reasoning, improving fine-grained understanding, and other tasks. On average, the introduction of Visual Perception Tokens improves the performance of a 2B model by 23.6\%, increasing its score from 0.572 to 0.708, and even outperforms a 7B parameter model by 13.4\% (from 0.624). Please check out our repo https://github.com/yu-rp/VisualPerceptionToken
中文摘要:视觉感知令牌的提出使多模态大语言模型能够自主控制其视觉感知过程,通过区域选择令牌和视觉重编码令牌实现选择性图像区域审查,从而显著提升空间推理和细粒度理解等任务的性能表现。
English Summary: The proposed Visual Perception Token empowers Multimodal Large Language Models with autonomous control over visual perception processes, enabling selective region review and enhanced visual re-encoding to significantly boost performance in spatial reasoning and fine-grained understanding tasks.
Authors:Penghui Yang, Cunxiao Du, Fengzhuo Zhang, Haonan Wang, Tianyu Pang, Chao Du, Bo An
Abstract:
As Large Language Models (LLMs) can now process extremely long contexts, efficient inference over these extended inputs has become increasingly important, especially for emerging applications like LLM agents that highly depend on this capability. Speculative decoding (SD) offers a promising lossless acceleration technique compared to lossy alternatives such as quantization and model cascades. However, most state-of-the-art SD methods are trained on short texts (typically fewer than 4k tokens), making them unsuitable for long-context scenarios. Specifically, adapting these methods to long contexts presents three key challenges: (1) the excessive memory demands posed by draft models due to large Key-Value (KV) cache; (2) performance degradation resulting from the mismatch between short-context training and long-context inference; and (3) inefficiencies in tree attention mechanisms when managing long token sequences. This work introduces LongSpec, a framework that addresses these challenges through three core innovations: a memory-efficient draft model with a constant-sized KV cache; novel position indices that mitigate the training-inference mismatch; and an attention aggregation strategy that combines fast prefix computation with standard tree attention to enable efficient decoding. Experimental results confirm the effectiveness of LongSpec, achieving up to a 3.26x speedup over strong Flash Attention baselines across five long-context understanding datasets, as well as a 2.25x reduction in wall-clock time on the AIME24 long reasoning task with the QwQ model, demonstrating significant latency improvements for long-context applications. The code is available at https://github.com/sail-sg/LongSpec.
Chinese: LongSpec是一种创新框架,通过内存高效的草稿模型、专用位置索引和优化注意力机制,解决了现有推测解码方法在长上下文场景中的局限性,在长上下文应用中实现了高达3.26倍的加速效果。
English: LongSpec is a novel framework that overcomes the limitations of existing speculative decoding methods in long-context scenarios through memory-efficient draft models, specialized position indices, and optimized attention mechanisms, achieving up to 3.26x speedup in long-context applications.
Authors:Liming Liu, Zhenghao Xu, Zixuan Zhang, Hao Kang, Zichong Li, Chen Liang, Weizhu Chen, Tuo Zhao
Abstract:
Large Language Models (LLMs) have demonstrated remarkable success across various domains, yet their optimization remains a significant challenge due to the complex and high-dimensional loss landscapes they inhabit. While adaptive optimizers such as AdamW are widely used, they suffer from critical limitations, including an inability to capture interdependencies between coordinates and high memory consumption. Subsequent research, exemplified by SOAP, attempts to better capture coordinate interdependence but incurs greater memory overhead, limiting scalability for massive LLMs. An alternative approach aims to reduce memory consumption through low-dimensional projection, but this leads to substantial approximation errors, resulting in less effective optimization (e.g., in terms of per-token efficiency). In this paper, we propose COSMOS, a novel hybrid optimizer that leverages the varying importance of eigensubspaces in the gradient matrix to achieve memory efficiency without compromising optimization performance. The design of COSMOS is motivated by our empirical insights and practical considerations. Specifically, COSMOS applies SOAP to the leading eigensubspace, which captures the primary optimization dynamics, and MUON to the remaining eigensubspace, which is less critical but computationally expensive to handle with SOAP. This hybrid strategy significantly reduces memory consumption while maintaining robust optimization performance, making it particularly suitable for massive LLMs. Numerical experiments on various datasets and transformer architectures are provided to demonstrate the effectiveness of COSMOS. Our code is available at https://github.com/lliu606/COSMOS.
中文: COSMOS是一种新型混合优化器,通过对关键特征子空间应用SOAP和对次要子空间使用MUON,在保证优化性能的同时显著降低内存消耗,特别适用于大规模语言模型。
English: COSMOS is a novel hybrid optimizer that efficiently manages memory usage by applying SOAP to critical eigensubspaces and MUON to less important ones, maintaining robust optimization performance for large language models.
Authors:André V. Duarte, Xuandong Zhao, Arlindo L. Oliveira, Lei Li
Abstract:
How can we verify whether copyrighted content was used to train a large vision-language model (VLM) without direct access to its training data? Motivated by the hypothesis that a VLM is able to recognize images from its training corpus, we propose DIS-CO, a novel approach to infer the inclusion of copyrighted content during the model's development. By repeatedly querying a VLM with specific frames from targeted copyrighted material, DIS-CO extracts the content's identity through free-form text completions. To assess its effectiveness, we introduce MovieTection, a benchmark comprising 14,000 frames paired with detailed captions, drawn from films released both before and after a model's training cutoff. Our results show that DIS-CO significantly improves detection performance, nearly doubling the average AUC of the best prior method on models with logits available. Our findings also highlight a broader concern: all tested models appear to have been exposed to some extent to copyrighted content. Our code and data are available at https://github.com/avduarte333/DIS-CO
中文: DIS-CO是一种通过向视觉语言模型输入特定帧并分析其文本补全来检测训练数据中是否包含受版权保护内容的新方法,其检测性能显著优于现有技术,并揭示所有测试模型均存在不同程度的版权内容暴露。
English: DIS-CO is a novel method that detects whether copyrighted content was used to train vision-language models by querying them with specific frames and analyzing text completions, significantly outperforming prior methods and revealing widespread exposure to such content across tested models.
Authors:Hao Gu, Wei Li, Lujun Li, Qiyuan Zhu, Mark Lee, Shengjie Sun, Wei Xue, Yike Guo
Abstract:
Mixture-of-Experts (MoE) architectures in large language models (LLMs) achieve exceptional performance, but face prohibitive storage and memory requirements. To address these challenges, we present $D^2$-MoE, a new delta decompression compressor for reducing the parameters of MoE LLMs. Based on observations of expert diversity, we decompose their weights into a shared base weight and unique delta weights. Specifically, our method first merges each expert's weight into the base weight using the Fisher information matrix to capture shared components. Then, we compress delta weights through Singular Value Decomposition (SVD) by exploiting their low-rank properties. Finally, we introduce a semi-dynamical structured pruning strategy for the base weights, combining static and dynamic redundancy analysis to achieve further parameter reduction while maintaining input adaptivity. In this way, our $D^2$-MoE successfully compact MoE LLMs to high compression ratios without additional training. Extensive experiments highlight the superiority of our approach, with over 13% performance gains than other compressors on Mixtral|Phi-3.5|DeepSeek|Qwen2 MoE LLMs at 40$\sim$60% compression rates. Codes are available in https://github.com/lliai/D2MoE.
中文: 提出的$D^2$-MoE方法通过将专家权重分解为共享基础权重和独特增量权重,结合Fisher合并、SVD压缩和半动态剪枝策略,无需重新训练即可实现高压缩比,在40-60%压缩率下比其他压缩方法性能提升超过13%。
English: The proposed $D^2$-MoE method effectively compresses Mixture-of-Experts LLMs by decomposing expert weights into shared base and unique delta components, then applying Fisher merging, SVD compression, and semi-dynamical pruning to achieve high compression ratios without retraining, outperforming other compressors by over 13% at 40-60% compression rates.
Authors:Yi-Kai Zhang, De-Chuan Zhan, Han-Jia Ye
Abstract:
Large Language Models (LLMs) have demonstrated human-like instruction-following abilities, particularly those exceeding 100 billion parameters. The combined capability of some smaller, resource-friendly LLMs can address most of the instructions that larger LLMs excel at. In this work, we explore how to route the best-performing LLM for each instruction to achieve better overall performance. We develop a new paradigm, constructing capability instructions with model capability representation, user instruction, and performance inquiry prompts to assess the performance. To learn from capability instructions, we introduce a new end-to-end framework called Model Selection with Aptitude Test (Model-SAT), which generates positive and negative samples based on what different models perform well or struggle with. Model-SAT uses a model capability encoder that extends its model representation to a lightweight LLM. Our experiments show that Model-SAT understands the performance dimensions of candidate models and provides the probabilities of their capability to handle various instructions. Additionally, during deployment, a new model can quickly infer its aptitude test results across 50 tasks, each with 20 shots. Model-SAT performs state-of-the-art model routing without candidate inference and in real-world new model-released scenarios. The code is available at https://github.com/Now-Join-Us/CIT-LLM-Routing
超过1000亿参数的大型语言模型展现出类似人类的指令跟随能力,本研究提出Model-SAT框架,通过测试模型能力将指令路由至最佳执行模型而无需候选推理,在现实场景中实现最优性能。
Large language models with over 100 billion parameters show human-like instruction-following abilities, and this research introduces Model-SAT, a framework that routes instructions to the best-performing model by testing their capabilities without needing candidate inference, achieving top performance in real-world scenarios.
Authors:Hogun Kee, Wooseok Oh, Minjae Kang, Hyemin Ahn, Songhwai Oh
Abstract:
In this paper, we present the tidiness score-guided Monte Carlo tree search (TSMCTS), a novel framework designed to address the tabletop tidying up problem using only an RGB-D camera. We address two major problems for tabletop tidying up problem: (1) the lack of public datasets and benchmarks, and (2) the difficulty of specifying the goal configuration of unseen objects. We address the former by presenting the tabletop tidying up (TTU) dataset, a structured dataset collected in simulation. Using this dataset, we train a vision-based discriminator capable of predicting the tidiness score. This discriminator can consistently evaluate the degree of tidiness across unseen configurations, including real-world scenes. Addressing the second problem, we employ Monte Carlo tree search (MCTS) to find tidying trajectories without specifying explicit goals. Instead of providing specific goals, we demonstrate that our MCTS-based planner can find diverse tidied configurations using the tidiness score as a guidance. Consequently, we propose TSMCTS, which integrates a tidiness discriminator with an MCTS-based tidying planner to find optimal tidied arrangements. TSMCTS has successfully demonstrated its capability across various environments, including coffee tables, dining tables, office desks, and bathrooms. The TTU dataset is available at: https://github.com/rllab-snu/TTU-Dataset.
中文: 本文提出TSMCTS新框架,通过结合整洁度评分判别器与蒙特卡洛树搜索,仅需RGB-D相机即可解决桌面整理问题,并在多种真实场景中成功验证了其有效性。
English: This paper introduces TSMCTS, a novel framework that combines a tidiness score discriminator with Monte Carlo tree search to solve tabletop tidying problems using only RGB-D camera input, successfully demonstrating its effectiveness across multiple real-world environments.
Authors:Boxuan Zhang, Ruqi Zhang
Abstract:
Large language models (LLMs) excel in many tasks but struggle to accurately quantify uncertainty in their generated responses. This limitation makes it challenging to detect misinformation and ensure reliable decision-making. Existing uncertainty quantification (UQ) methods for LLMs are primarily prompt-wise rather than response-wise, often requiring multiple response samples, which incurs high computational costs. Moreover, LLMs have been shown to be overconfident, particularly when using reasoning steps to derive their answers. In this work, we propose CoT-UQ, a response-wise UQ framework that integrates LLMs' inherent reasoning capabilities through Chain-of-Thought (CoT) into the UQ process. CoT-UQ captures critical information during inference by extracting keywords from each reasoning step and assessing their importance to the final answer. This key reasoning information is then aggregated to produce a final uncertainty estimate. We conduct extensive experiments based on Llama Family with model sizes varying from 8B to 13B across logical and mathematical reasoning tasks. Experimental results demonstrate that CoT-UQ significantly outperforms existing UQ methods, achieving an average improvement of 5.9% AUROC compared to current UQ methods. The code is available at: https://github.com/ZBox1005/CoT-UQ.
Chinese: 本文提出CoT-UQ框架,通过利用大语言模型的思维链推理能力,从每个推理步骤中提取关键信息并评估其重要性,实现了响应式不确定性量化,在多项任务中以平均5.9%的AUROC提升显著优于现有方法,同时降低了计算成本。
English: This paper introduces CoT-UQ, a response-wise uncertainty quantification framework that leverages LLMs' Chain-of-Thought reasoning to extract and evaluate key information from each step, significantly outperforming existing methods by 5.9% AUROC on average while reducing computational costs.
Authors:Zhong Li, Qi Huang, Lincen Yang, Jiayang Shi, Zhao Yang, Niki van Stein, Thomas Bäck, Matthijs van Leeuwen
Abstract:
In recent years, generative models have achieved remarkable performance across diverse applications, including image generation, text synthesis, audio creation, video generation, and data augmentation. Diffusion models have emerged as superior alternatives to Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs) by addressing their limitations, such as training instability, mode collapse, and poor representation of multimodal distributions. This success has spurred widespread research interest. In the domain of tabular data, diffusion models have begun to showcase similar advantages over GANs and VAEs, achieving significant performance breakthroughs and demonstrating their potential for addressing unique challenges in tabular data modeling. However, while domains like images and time series have numerous surveys summarizing advancements in diffusion models, there remains a notable gap in the literature for tabular data. Despite the increasing interest in diffusion models for tabular data, there has been little effort to systematically review and summarize these developments. This lack of a dedicated survey limits a clear understanding of the challenges, progress, and future directions in this critical area. This survey addresses this gap by providing a comprehensive review of diffusion models for tabular data. Covering works from June 2015, when diffusion models emerged, to December 2024, we analyze nearly all relevant studies, with updates maintained in a \href{https://github.com/Diffusion-Model-Leiden/awesome-diffusion-models-for-tabular-data}{GitHub repository}. Assuming readers possess foundational knowledge of statistics and diffusion models, we employ mathematical formulations to deliver a rigorous and detailed review, aiming to promote developments in this emerging and exciting area.
Chinese: 扩散模型已超越GAN和VAE,解决了其训练不稳定等局限,并在表格数据领域展现出巨大潜力;本综述填补了该领域系统评述的空白,全面回顾了2015至2024年的相关研究。
English: Diffusion models have surpassed GANs and VAEs in addressing their limitations and shown exceptional potential in tabular data applications, yet a comprehensive review was lacking until this survey systematically analyzed relevant studies from 2015 to 2024.
Authors:Yinchuan Li, Xinyu Shao, Jianping Zhang, Haozhi Wang, Leo Maxime Brunswic, Kaiwen Zhou, Jiqian Dong, Kaiyang Guo, Xiu Li, Zhitang Chen, Jun Wang, Jianye Hao
Abstract:
In recent years, the exceptional performance of generative models in generative tasks has sparked significant interest in their integration into decision-making processes. Due to their ability to handle complex data distributions and their strong model capacity, generative models can be effectively incorporated into decision-making systems by generating trajectories that guide agents toward high-reward state-action regions or intermediate sub-goals. This paper presents a comprehensive review of the application of generative models in decision-making tasks. We classify seven fundamental types of generative models: energy-based models, generative adversarial networks, variational autoencoders, normalizing flows, diffusion models, generative flow networks, and autoregressive models. Regarding their applications, we categorize their functions into three main roles: controllers, modelers and optimizers, and discuss how each role contributes to decision-making. Furthermore, we examine the deployment of these models across five critical real-world decision-making scenarios. Finally, we summarize the strengths and limitations of current approaches and propose three key directions for advancing next-generation generative directive models: high-performance algorithms, large-scale generalized decision-making models, and self-evolving and adaptive models.
Chinese: 本文全面综述了生成模型在决策任务中作为控制器、建模器和优化器的应用,并提出了高性能算法、大规模通用决策模型及自适应模型三大发展方向。
English: This paper comprehensively reviews how generative models enhance decision-making by serving as controllers, modelers, and optimizers across various applications, while outlining future directions for high-performance, scalable, and adaptive systems.
Authors:Zekun Wang, Mingyang Yi, Shuchen Xue, Zhenguo Li, Ming Liu, Bing Qin, Zhi-Ming Ma
Abstract:
Diffusion Probabilistic Models (DPMs) have achieved significant success in generative tasks. However, their training and sampling processes suffer from the issue of distribution mismatch. During the denoising process, the input data distributions differ between the training and inference stages, potentially leading to inaccurate data generation. To obviate this, we analyze the training objective of DPMs and theoretically demonstrate that this mismatch can be alleviated through Distributionally Robust Optimization (DRO), which is equivalent to performing robustness-driven Adversarial Training (AT) on DPMs. Furthermore, for the recently proposed Consistency Model (CM), which distills the inference process of the DPM, we prove that its training objective also encounters the mismatch issue. Fortunately, this issue can be mitigated by AT as well. Based on these insights, we propose to conduct efficient AT on both DPM and CM. Finally, extensive empirical studies validate the effectiveness of AT in diffusion-based models. The code is available at https://github.com/kugwzk/AT_Diff.
中文摘要:扩散概率模型在训练与推理阶段存在分布不匹配问题,通过对抗性训练可有效缓解该问题,提升生成数据的准确性。
English Summary: Diffusion Probabilistic Models face distribution mismatch during training and inference, which can be effectively addressed through Adversarial Training to improve generation accuracy.
Authors:Linian Wang, Leye Wang
Abstract:
Privacy concerns in machine learning are heightened by regulations such as the GDPR, which enforces the "right to be forgotten" (RTBF), driving the emergence of machine unlearning as a critical research field. Vertical Federated Learning (VFL) enables collaborative model training by aggregating a sample's features across distributed parties while preserving data privacy at each source. This paradigm has seen widespread adoption in healthcare, finance, and other privacy-sensitive domains. However, existing VFL systems lack robust mechanisms to comply with RTBF requirements, as unlearning methodologies for VFL remain underexplored. In this work, we introduce the first VFL framework with theoretically guaranteed unlearning capabilities, enabling the removal of any data at any time. Unlike prior approaches -- which impose restrictive assumptions on model architectures or data types for removal -- our solution is model- and data-agnostic, offering universal compatibility. Moreover, our framework supports asynchronous unlearning, eliminating the need for all parties to be simultaneously online during the forgetting process. These advancements address critical gaps in current VFL systems, ensuring compliance with RTBF while maintaining operational flexibility.We make all our implementations publicly available at https://github.com/wangln19/vertical-federated-unlearning.
中文摘要:本文提出了首个具备理论保障遗忘能力的纵向联邦学习框架,能够随时删除任意数据且不依赖特定模型或数据类型,同时支持异步操作,有效解决了现有系统难以满足"被遗忘权"合规要求的关键问题。
English Summary: This paper introduces the first Vertical Federated Learning (VFL) framework with theoretically guaranteed unlearning capabilities, enabling model- and data-agnostic removal of any data at any time while supporting asynchronous operations to ensure compliance with the "right to be forgotten."
Authors:Tianjin Huang, Haotian Hu, Zhenyu Zhang, Gaojie Jin, Xiang Li, Li Shen, Tianlong Chen, Lu Liu, Qingsong Wen, Zhangyang Wang, Shiwei Liu
Abstract:
This paper comprehensively evaluates several recently proposed optimizers for 4-bit training, revealing that low-bit precision amplifies sensitivity to learning rates and often causes unstable gradient norms, leading to divergence at higher learning rates. Among these, SPAM, a recent optimizer featuring momentum reset and spike-aware gradient clipping, achieves the best performance across various bit levels, but struggles to stabilize gradient norms, requiring careful learning rate tuning. To address these limitations, we propose Stable-SPAM, which incorporates enhanced gradient normalization and clipping techniques. In particular, Stable-SPAM (1) adaptively updates the clipping threshold for spiked gradients by tracking their historical maxima; (2) normalizes the entire gradient matrix based on its historical $l_2$-norm statistics; and $(3)$ inherits momentum reset from SPAM to periodically reset the first and second moments of Adam, mitigating the accumulation of spiked gradients. Extensive experiments show that Stable-SPAM effectively stabilizes gradient norms in 4-bit LLM training, delivering superior performance compared to Adam and SPAM. Notably, our 4-bit LLaMA-1B model trained with Stable-SPAM outperforms the BF16 LLaMA-1B trained with Adam by up to $2$ perplexity. Furthermore, when both models are trained in 4-bit, Stable-SPAM achieves the same loss as Adam while requiring only about half the training steps. Code is available at https://github.com/TianjinYellow/StableSPAM.git.
中文: 本文提出的Stable-SPAM优化器通过自适应梯度裁剪和归一化技术,有效稳定了4位训练中的梯度范数,在困惑度和收敛速度上均优于Adam和SPAM等现有方法。
English: This paper introduces Stable-SPAM, an enhanced optimizer that stabilizes gradient norms in 4-bit training through adaptive gradient clipping and normalization, outperforming existing methods like Adam and SPAM by achieving lower perplexity and faster convergence.
Authors:Maksim Zhdanov, Max Welling, Jan-Willem van de Meent
Abstract:
Large-scale physical systems defined on irregular grids pose significant scalability challenges for deep learning methods, especially in the presence of long-range interactions and multi-scale coupling. Traditional approaches that compute all pairwise interactions, such as attention, become computationally prohibitive as they scale quadratically with the number of nodes. We present Erwin, a hierarchical transformer inspired by methods from computational many-body physics, which combines the efficiency of tree-based algorithms with the expressivity of attention mechanisms. Erwin employs ball tree partitioning to organize computation, which enables linear-time attention by processing nodes in parallel within local neighborhoods of fixed size. Through progressive coarsening and refinement of the ball tree structure, complemented by a novel cross-ball interaction mechanism, it captures both fine-grained local details and global features. We demonstrate Erwin's effectiveness across multiple domains, including cosmology, molecular dynamics, PDE solving, and particle fluid dynamics, where it consistently outperforms baseline methods both in accuracy and computational efficiency.
中文摘要:Erwin是一种结合树算法效率与注意力机制表达力的分层变换器,通过球树分区实现线性时间计算,在保持多尺度交互能力的同时,在多个物理系统领域展现出卓越性能。
English summary: Erwin is a hierarchical transformer that combines tree-based efficiency with attention expressivity, enabling linear-time computation while capturing multi-scale interactions across diverse physical systems.
Authors:Bruno Puri, Aakriti Jain, Elena Golimblevskaia, Patrick Kahardipraja, Thomas Wiegand, Wojciech Samek, Sebastian Lapuschkin
Abstract:
Recent advances in mechanistic interpretability have highlighted the potential of automating interpretability pipelines in analyzing the latent representations within LLMs. While this may enhance our understanding of internal mechanisms, the field lacks standardized evaluation methods for assessing the validity of discovered features. We attempt to bridge this gap by introducing FADE: Feature Alignment to Description Evaluation, a scalable model-agnostic framework for automatically evaluating feature-to-description alignment. FADE evaluates alignment across four key metrics - Clarity, Responsiveness, Purity, and Faithfulness - and systematically quantifies the causes of the misalignment between features and their descriptions. We apply FADE to analyze existing open-source feature descriptions and assess key components of automated interpretability pipelines, aiming to enhance the quality of descriptions. Our findings highlight fundamental challenges in generating feature descriptions, particularly for SAEs compared to MLP neurons, providing insights into the limitations and future directions of automated interpretability. We release FADE as an open-source package at: https://github.com/brunibrun/FADE
中文摘要:本文提出FADE框架,用于评估自动化可解释性流程中特征与描述的匹配度,旨在弥补标准化评估方法的缺失,并揭示生成精确描述所面临的核心挑战。
English Summary: The paper introduces FADE, a scalable framework for evaluating feature-description alignment in automated interpretability pipelines, addressing the lack of standardized evaluation methods and highlighting challenges in generating accurate descriptions.
Authors:Md Saidul Hoque Anik, Ariful Azad
Abstract:
Knowledge graph (KG) learning offers a powerful framework for generating new knowledge and making inferences. Training KG embedding can take a significantly long time, especially for larger datasets. Our analysis shows that the gradient computation of embedding is one of the dominant functions in the translation-based KG embedding training loop. We address this issue by replacing the core embedding computation with SpMM (Sparse-Dense Matrix Multiplication) kernels. This allows us to unify multiple scatter (and gather) operations as a single operation, reducing training time and memory usage. We create a general framework for training KG models using sparse kernels and implement four models, namely TransE, TransR, TransH, and TorusE. Our sparse implementations exhibit up to 5.3x speedup on the CPU and up to 4.2x speedup on the GPU with a significantly low GPU memory footprint. The speedups are consistent across large and small datasets for a given model. Our proposed sparse approach can be extended to accelerate other translation-based (such as TransC, TransM, etc.) and non-translational (such as DistMult, ComplEx, RotatE, etc.) models as well. An implementation of the SpTransX framework is publicly available as a Python package in https://github.com/HipGraph/SpTransX.
Chinese: 该研究提出了一种基于稀疏内核的框架,利用SpMM加速知识图谱嵌入训练,在多种模型上实现了CPU最高5.3倍、GPU最高4.2倍的加速效果,同时显著降低了内存占用。
English: The study introduces a sparse kernel-based framework using SpMM to accelerate knowledge graph embedding training, achieving up to 5.3x speedup on CPUs and 4.2x on GPUs while reducing memory usage across various models.
Authors:Hansung Choi, Daewon Seo
Abstract:
The concept of a minimax classifier is well-established in statistical decision theory, but its implementation via neural networks remains challenging, particularly in scenarios with imbalanced training data having a limited number of samples for minority classes. To address this issue, we propose a novel minimax learning algorithm designed to minimize the risk of worst-performing classes. Our algorithm iterates through two steps: a minimization step that trains the model based on a selected target prior, and a maximization step that updates the target prior towards the adversarial prior for the trained model. In the minimization, we introduce a targeted logit-adjustment loss function that efficiently identifies optimal decision boundaries under the target prior. Moreover, based on a new prior-dependent generalization bound that we obtained, we theoretically prove that our loss function has a better generalization capability than existing loss functions. During the maximization, we refine the target prior by shifting it towards the adversarial prior, depending on the worst-performing classes rather than on per-class risk estimates. Our maximization method is particularly robust in the regime of a small number of samples. Additionally, to adapt to overparameterized neural networks, we partition the entire training dataset into two subsets: one for model training during the minimization step and the other for updating the target prior during the maximization step. Our proposed algorithm has a provable convergence property, and empirical results indicate that our algorithm performs better than or is comparable to existing methods. All codes are publicly available at https://github.com/hansung-choi/TLA-linear-ascent.
Chinese: 我们提出了一种新的极小极大学习算法,通过迭代优化模型参数和目标先验来最小化最差类别的风险,在理论保证和实验验证下,对样本不平衡数据实现了更优的泛化能力和鲁棒性。
English: We introduce a novel minimax learning algorithm that iteratively optimizes model parameters and target priors to minimize worst-class risk, achieving superior generalization and robustness on imbalanced data with theoretical guarantees and empirical validation.
Authors:Guoqi Yu, Yaoming Li, Juncheng Wang, Xiaoyu Guo, Angelica I. Aviles-Rivero, Tong Yang, Shujun Wang
Abstract:
Recent advancements have progressively incorporated frequency-based techniques into deep learning models, leading to notable improvements in accuracy and efficiency for time series analysis tasks. However, the Mid-Frequency Spectrum Gap in the real-world time series, where the energy is concentrated at the low-frequency region while the middle-frequency band is negligible, hinders the ability of existing deep learning models to extract the crucial frequency information. Additionally, the shared Key-Frequency in multivariate time series, where different time series share indistinguishable frequency patterns, is rarely exploited by existing literature. This work introduces a novel module, Adaptive Mid-Frequency Energy Optimizer, based on convolution and residual learning, to emphasize the significance of mid-frequency bands. We also propose an Energy-based Key-Frequency Picking Block to capture shared Key-Frequency, which achieves superior inter-series modeling performance with fewer parameters. A novel Key-Frequency Enhanced Training strategy is employed to further enhance Key-Frequency modeling, where spectral information from other channels is randomly introduced into each channel. Our approach advanced multivariate time series forecasting on the challenging Traffic, ECL, and Solar benchmarks, reducing MSE by 4%, 6%, and 5% compared to the previous SOTA iTransformer. Code is available at this GitHub Repository: https://github.com/Levi-Ackman/ReFocus.
Chinese Summary: 本研究提出自适应中频能量优化器和基于能量的关键频率提取模块,解决了现实时间序列中的中频频谱间隙问题并利用多变量序列共享的关键频率模式,在多个基准测试中实现了最先进的预测性能并降低了均方误差。
English Summary: This study introduces an Adaptive Mid-Frequency Energy Optimizer and an Energy-based Key-Frequency Picking Block to address the mid-frequency spectrum gap and leverage shared key-frequency patterns in multivariate time series, achieving state-of-the-art forecasting performance with reduced mean squared error on multiple benchmarks.
Authors:Taeyoung Yun, Kiyoung Om, Jaewoo Lee, Sujin Yun, Jinkyoo Park
Abstract:
Optimizing high-dimensional and complex black-box functions is crucial in numerous scientific applications. While Bayesian optimization (BO) is a powerful method for sample-efficient optimization, it struggles with the curse of dimensionality and scaling to thousands of evaluations. Recently, leveraging generative models to solve black-box optimization problems has emerged as a promising framework. However, those methods often underperform compared to BO methods due to limited expressivity and difficulty of uncertainty estimation in high-dimensional spaces. To overcome these issues, we introduce \textbf{DiBO}, a novel framework for solving high-dimensional black-box optimization problems. Our method iterates two stages. First, we train a diffusion model to capture the data distribution and deep ensembles to predict function values with uncertainty quantification. Second, we cast the candidate selection as a posterior inference problem to balance exploration and exploitation in high-dimensional spaces. Concretely, we fine-tune diffusion models to amortize posterior inference. Extensive experiments demonstrate that our method outperforms state-of-the-art baselines across synthetic and real-world tasks. Our code is publicly available \href{https://github.com/umkiyoung/DiBO}{here}.
中文: DiBO是一种新颖框架,结合扩散模型和深度集成方法,通过在高维空间中平衡探索与利用,有效解决黑盒优化问题,实验表明其性能优于现有先进方法。
English: DiBO is a novel framework that combines diffusion models and deep ensembles to effectively solve high-dimensional black-box optimization problems by balancing exploration and exploitation, outperforming existing methods in experiments.
Authors:Yancheng Zhang, Jiaqi Xue, Mengxin Zheng, Mimi Xie, Mingzhe Zhang, Lei Jiang, Qian Lou
Abstract:
Private Transformer inference using cryptographic protocols offers promising solutions for privacy-preserving machine learning; however, it still faces significant runtime overhead (efficiency issues) and challenges in handling long-token inputs (scalability issues). We observe that the Transformer's operational complexity scales quadratically with the number of input tokens, making it essential to reduce the input token length. Notably, each token varies in importance, and many inputs contain redundant tokens. Additionally, prior private inference methods that rely on high-degree polynomial approximations for non-linear activations are computationally expensive. Therefore, reducing the polynomial degree for less important tokens can significantly accelerate private inference. Building on these observations, we propose \textit{CipherPrune}, an efficient and scalable private inference framework that includes a secure encrypted token pruning protocol, a polynomial reduction protocol, and corresponding Transformer network optimizations. At the protocol level, encrypted token pruning adaptively removes unimportant tokens from encrypted inputs in a progressive, layer-wise manner. Additionally, encrypted polynomial reduction assigns lower-degree polynomials to less important tokens after pruning, enhancing efficiency without decryption. At the network level, we introduce protocol-aware network optimization via a gradient-based search to maximize pruning thresholds and polynomial reduction conditions while maintaining the desired accuracy. Our experiments demonstrate that CipherPrune reduces the execution overhead of private Transformer inference by approximately $6.1\times$ for 128-token inputs and $10.6\times$ for 512-token inputs, compared to previous methods, with only a marginal drop in accuracy. The code is publicly available at https://github.com/UCF-Lou-Lab-PET/cipher-prune-inference.
中文:CipherPrune框架通过加密令牌剪枝和多项式降阶协议,在保持精度的同时显著提升了私有Transformer推理的效率,实现了最高10.6倍的加速效果。
English: The proposed CipherPrune framework addresses efficiency and scalability challenges in private Transformer inference by securely pruning unimportant tokens and reducing polynomial approximations, achieving significant speedups with minimal accuracy loss.
Authors:Avinandan Bose, Laurent Lessard, Maryam Fazel, Krishnamurthy Dj Dvijotham
Abstract:
The rise of foundation models fine-tuned on human feedback from potentially untrusted users has increased the risk of adversarial data poisoning, necessitating the study of robustness of learning algorithms against such attacks. Existing research on provable certified robustness against data poisoning attacks primarily focuses on certifying robustness for static adversaries who modify a fraction of the dataset used to train the model before the training algorithm is applied. In practice, particularly when learning from human feedback in an online sense, adversaries can observe and react to the learning process and inject poisoned samples that optimize adversarial objectives better than when they are restricted to poisoning a static dataset once, before the learning algorithm is applied. Indeed, it has been shown in prior work that online dynamic adversaries can be significantly more powerful than static ones. We present a novel framework for computing certified bounds on the impact of dynamic poisoning, and use these certificates to design robust learning algorithms. We give an illustration of the framework for the mean estimation and binary classification problems and outline directions for extending this in further work. The code to implement our certificates and replicate our results is available at https://github.com/Avinandan22/Certified-Robustness.
中文: 本研究提出了一种新颖框架,用于计算针对动态数据投毒攻击的认证边界,其中攻击者在学习过程中自适应地注入恶意样本,并通过均值估计和二元分类的应用展示了该框架如何提升算法的鲁棒性。
English: This study introduces a novel framework for computing certified bounds against dynamic data poisoning attacks, where adversaries adaptively inject malicious samples during the learning process, and demonstrates its application in mean estimation and binary classification to enhance algorithm robustness.
Authors:Haiteng Zhao, Chang Ma, Fangzhi Xu, Lingpeng Kong, Zhi-Hong Deng
Abstract:
The applications of large language models (LLMs) in various biological domains have been explored recently, but their reasoning ability in complex biological systems, such as pathways, remains underexplored, which is crucial for predicting biological phenomena, formulating hypotheses, and designing experiments. This work explores the potential of LLMs in pathway reasoning. We introduce BioMaze, a dataset with 5.1K complex pathway problems derived from real research, covering various biological contexts including natural dynamic changes, disturbances, additional intervention conditions, and multi-scale research targets. Our evaluation of methods such as CoT and graph-augmented reasoning, shows that LLMs struggle with pathway reasoning, especially in perturbed systems. To address this, we propose PathSeeker, an LLM agent that enhances reasoning through interactive subgraph-based navigation, enabling a more effective approach to handling the complexities of biological systems in a scientifically aligned manner. The dataset and code are available at https://github.com/zhao-ht/BioMaze.
中文摘要:本研究探索大语言模型在生物通路推理中的潜力,揭示了其在复杂系统中的局限性,并提出PathSeeker这一通过基于子图的交互导航来提升推理能力的智能体。
English Summary: This study investigates the potential of large language models (LLMs) in biological pathway reasoning, revealing their limitations in complex systems and proposing PathSeeker, an interactive agent that improves reasoning through subgraph-based navigation.
Authors:Ruichu Cai, Junxian Huang, Zhenhui Yang, Zijian Li, Emadeldeen Eldele, Min Wu, Fuchun Sun
Abstract:
Time series domain adaptation aims to transfer the complex temporal dependence from the labeled source domain to the unlabeled target domain. Recent advances leverage the stable causal mechanism over observed variables to model the domain-invariant temporal dependence. However, modeling precise causal structures in high-dimensional data, such as videos, remains challenging. Additionally, direct causal edges may not exist among observed variables (e.g., pixels). These limitations hinder the applicability of existing approaches to real-world scenarios. To address these challenges, we find that the high-dimension time series data are generated from the low-dimension latent variables, which motivates us to model the causal mechanisms of the temporal latent process. Based on this intuition, we propose a latent causal mechanism identification framework that guarantees the uniqueness of the reconstructed latent causal structures. Specifically, we first identify latent variables by utilizing sufficient changes in historical information. Moreover, by enforcing the sparsity of the relationships of latent variables, we can achieve identifiable latent causal structures. Built on the theoretical results, we develop the Latent Causality Alignment (LCA) model that leverages variational inference, which incorporates an intra-domain latent sparsity constraint for latent structure reconstruction and an inter-domain latent sparsity constraint for domain-invariant structure reconstruction. Experiment results on eight benchmarks show a general improvement in the domain-adaptive time series classification and forecasting tasks, highlighting the effectiveness of our method in real-world scenarios. Codes are available at https://github.com/DMIRLAB-Group/LCA.
中文摘要:本文提出一种潜在因果对齐(LCA)框架,通过建模潜在变量的因果机制来获得可识别的因果结构,在八个基准测试的时间序列领域自适应任务中展现出优越性能。
English Summary: This paper introduces a Latent Causality Alignment (LCA) framework that models causal mechanisms in latent variables to achieve identifiable causal structures, demonstrating improved performance in time series domain adaptation tasks across eight benchmarks.
Authors:Liancheng Fang, Aiwei Liu, Hengrui Zhang, Henry Peng Zou, Weizhi Zhang, Philip S. Yu
Abstract:
Large Language models (LLMs) have achieved encouraging results in tabular data generation. However, existing approaches require fine-tuning, which is computationally expensive. This paper explores an alternative: prompting a fixed LLM with in-context examples. We observe that using randomly selected in-context examples hampers the LLM's performance, resulting in sub-optimal generation quality. To address this, we propose a novel in-context learning framework: TabGen-ICL, to enhance the in-context learning ability of LLMs for tabular data generation. TabGen-ICL operates iteratively, retrieving a subset of real samples that represent the residual between currently generated samples and true data distributions. This approach serves two purposes: locally, it provides more effective in-context learning examples for the LLM in each iteration; globally, it progressively narrows the gap between generated and real data. Extensive experiments on five real-world tabular datasets demonstrate that TabGen-ICL significantly outperforms the random selection strategy. Specifically, it reduces the error rate by a margin of $3.5\%-42.2\%$ on fidelity metrics. We demonstrate for the first time that prompting a fixed LLM can yield high-quality synthetic tabular data. The code is provided in the \href{https://github.com/fangliancheng/TabGEN-ICL}{link}.
中文: 本文提出TabGen-ICL框架,通过迭代选择代表性样本来缩小生成数据与真实分布之间的差距,在五个真实表格数据集上的实验表明,该方法较随机选择策略显著提升了生成质量,错误率降低了3.5%至42.2%。
English: This paper introduces TabGen-ICL, an iterative in-context learning framework that enhances tabular data generation by selecting representative examples to narrow the gap between synthetic and real data, significantly outperforming random selection with error reductions of 3.5% to 42.2%.
Authors:Zahra Shahrooei, Ali Baheri
Abstract:
The primary goal of reinforcement learning is to develop decision-making policies that prioritize optimal performance, frequently without considering safety. In contrast, safe reinforcement learning seeks to reduce or avoid unsafe behavior. This paper views safety as taking actions with more predictable consequences under environment stochasticity and introduces a temporal difference algorithm that uses optimal transport theory to quantify the uncertainty associated with actions. By integrating this uncertainty score into the decision-making objective, the agent is encouraged to favor actions with more predictable outcomes. We theoretically prove that our algorithm leads to a reduction in the probability of visiting unsafe states. We evaluate the proposed algorithm on several case studies in the presence of various forms of environment uncertainty. The results demonstrate that our method not only provides safer behavior but also maintains the performance. A Python implementation of our algorithm is available at \href{https://github.com/SAILRIT/Risk-averse-TD-Learning}{https://github.com/SAILRIT/OT-guided-TD-Learning}.
Chinese: 本文提出了一种安全强化学习算法,利用最优传输理论量化行动不确定性,引导智能体选择结果更可预测的行动,在降低不安全状态访问概率的同时保持性能。
English: This paper introduces a safe reinforcement learning algorithm that uses optimal transport theory to quantify action uncertainty, encouraging predictable outcomes and reducing unsafe state visits while maintaining performance.
Authors:Tuan-Anh Yang, Truong-Son Hy, Phuong D. Dao
Abstract:
This paper introduces a novel multiscale object-based graph neural network called MOB-GCN for hyperspectral image (HSI) classification. The central aim of this study is to enhance feature extraction and classification performance by utilizing multiscale object-based image analysis (OBIA). Traditional pixel-based methods often suffer from low accuracy and speckle noise, while single-scale OBIA approaches may overlook crucial information of image objects at different levels of detail. MOB-GCN addresses this issue by extracting and integrating features from multiple segmentation scales to improve classification results using the Multiresolution Graph Network (MGN) architecture that can model fine-grained and global spatial patterns. By constructing a dynamic multiscale graph hierarchy, MOB-GCN offers a more comprehensive understanding of the intricate details and global context of HSIs. Experimental results demonstrate that MOB-GCN consistently outperforms single-scale graph convolutional networks (GCNs) in terms of classification accuracy, computational efficiency, and noise reduction, particularly when labeled data is limited. The implementation of MOB-GCN is publicly available at https://github.com/HySonLab/MultiscaleHSI
中文: 本文提出MOB-GCN多尺度对象图神经网络,通过整合多尺度分割特征改进高光谱图像分类,在精度和效率上均优于单尺度方法。
English: This paper presents MOB-GCN, a multiscale object-based graph neural network that enhances hyperspectral image classification by integrating features from multiple segmentation scales, demonstrating superior accuracy and efficiency over single-scale methods.
Authors:Megan Tjandrasuwita, Chanakya Ekbote, Liu Ziyin, Paul Pu Liang
Abstract:
Multimodal representation learning is fundamentally about transforming incomparable modalities into comparable representations. While prior research primarily focused on explicitly aligning these representations through targeted learning objectives and model architectures, a recent line of work has found that independently trained unimodal models of increasing scale and performance can become implicitly aligned with each other. These findings raise fundamental questions regarding the emergence of aligned representations in multimodal learning. Specifically: (1) when and why does alignment emerge implicitly? and (2) is alignment a reliable indicator of performance? Through a comprehensive empirical investigation, we demonstrate that both the emergence of alignment and its relationship with task performance depend on several critical data characteristics. These include, but are not necessarily limited to, the degree of similarity between the modalities and the balance between redundant and unique information they provide for the task. Our findings suggest that alignment may not be universally beneficial; rather, its impact on performance varies depending on the dataset and task. These insights can help practitioners determine whether increasing alignment between modalities is advantageous or, in some cases, detrimental to achieving optimal performance. Code is released at https://github.com/MeganTj/multimodal_alignment.
中文摘要:最新研究表明,独立训练的大规模单模态模型会自发产生隐式对齐,但其效果取决于模态相似性和信息平衡等数据特征,因此对齐并非总能提升任务性能。
English Summary: Recent research reveals that implicit alignment emerges in independently trained large-scale unimodal models, but its effectiveness depends on data characteristics like modality similarity and information balance, making alignment not universally beneficial for performance.
Authors:Arshia Afzal, Elias Abad Rocamora, Leyla Naz Candogan, Pol Puigdemont, Francesco Tonin, Yongtao Wu, Mahsa Shoaran, Volkan Cevher
Abstract:
Transformers with linear attention enable fast and parallel training. Moreover, they can be formulated as Recurrent Neural Networks (RNNs), for efficient linear-time inference. While extensively evaluated in causal sequence modeling, they have yet to be extended to the bidirectional setting. This work introduces the LION framework, establishing new theoretical foundations for linear transformers in bidirectional sequence modeling. LION constructs a bidirectional RNN equivalent to full Linear Attention. This extends the benefits of linear transformers: parallel training, and efficient inference, into the bidirectional setting. Using LION, we cast three linear transformers to their bidirectional form: LION-LIT, the bidirectional variant corresponding to (Katharopoulos et al., 2020); LION-D, extending RetNet (Sun et al., 2023); and LION-S, a linear transformer with a stable selective mask inspired by selectivity of SSMs (Dao & Gu, 2024). Replacing the attention block with LION (-LIT, -D, -S) achieves performance on bidirectional tasks that approaches that of Transformers and State-Space Models (SSMs), while delivering significant improvements in training speed. Our implementation is available in http://github.com/LIONS-EPFL/LION.
Chinese: LION框架将线性变换器扩展至双向序列建模,实现了并行训练与高效推理,在双向任务中性能接近Transformer和状态空间模型,同时显著提升了训练速度。
English: The LION framework extends linear transformers to bidirectional sequence modeling, enabling parallel training and efficient inference while achieving performance comparable to Transformers and SSMs with improved training speed.
Authors:Sayedmohammadreza Rastegari, Sina Tabakhi, Xianyuan Liu, Wei Sang, Haiping Lu
Abstract:
In computational structural biology, predicting metal-binding sites and their corresponding metal types is challenging due to the complexity of protein structures and interactions. Conventional sequence- and structure-based prediction approaches cannot capture the complex evolutionary relationships driving these interactions to facilitate understanding, while recent co-evolution-based approaches do not fully consider the entire structure of the co-evolved residue network. In this paper, we introduce MBGNN (Metal-Binding Graph Neural Network) that utilizes the entire co-evolved residue network and effectively captures the complex dependencies within protein structures via graph neural networks to enhance the prediction of co-evolved metal-binding residues and their associated metal types. Experimental results on a public dataset show that MBGNN outperforms existing co-evolution-based metal-binding prediction methods, and it is also competitive against recent sequence-based methods, showing the potential of integrating co-evolutionary insights with advanced machine learning to deepen our understanding of protein-metal interactions. The MBGNN code is publicly available at https://github.com/SRastegari/MBGNN.
Chinese: MBGNN是一种图神经网络模型,通过利用完整的共进化残基网络,提高了蛋白质中金属结合位点及其类型的预测准确性,优于现有方法,并深化了对蛋白质-金属相互作用的理解。
English: MBGNN, a graph neural network model, improves the prediction of metal-binding sites and their types in proteins by leveraging the full co-evolved residue network, outperforming existing methods and advancing the understanding of protein-metal interactions.
Authors:Lin Ge, Hengrui Cai, Runzhe Wan, Yang Xu, Rui Song
Abstract:
To make effective decisions, it is important to have a thorough understanding of the causal relationships among actions, environments, and outcomes. This review aims to surface three crucial aspects of decision-making through a causal lens: 1) the discovery of causal relationships through causal structure learning, 2) understanding the impacts of these relationships through causal effect learning, and 3) applying the knowledge gained from the first two aspects to support decision making via causal policy learning. Moreover, we identify challenges that hinder the broader utilization of causal decision-making and discuss recent advances in overcoming these challenges. Finally, we provide future research directions to address these challenges and to further enhance the implementation of causal decision-making in practice, with real-world applications illustrated based on the proposed causal decision-making. We aim to offer a comprehensive methodology and practical implementation framework by consolidating various methods in this area into a Python-based collection. URL: https://causaldm.github.io/Causal-Decision-Making.
Authors:Heng Gao, Zhuolin He, Jian Pu
Abstract:
To deploy machine learning models in the real world, researchers have proposed many OOD detection algorithms to help models identify unknown samples during the inference phase and prevent them from making untrustworthy predictions. Unlike methods that rely on extra data for outlier exposure training, post hoc methods detect Out-of-Distribution (OOD) samples by developing scoring functions, which are model agnostic and do not require additional training. However, previous post hoc methods may fail to capture the geometric cues embedded in network representations. Thus, in this study, we propose a novel score function based on the optimal transport theory, named OTOD, for OOD detection. We utilize information from features, logits, and the softmax probability space to calculate the OOD score for each test sample. Our experiments show that combining this information can boost the performance of OTOD with a certain margin. Experiments on the CIFAR-10 and CIFAR-100 benchmarks demonstrate the superior performance of our method. Notably, OTOD outperforms the state-of-the-art method GEN by 7.19% in the mean FPR@95 on the CIFAR-10 benchmark using ResNet-18 as the backbone, and by 12.51% in the mean FPR@95 using WideResNet-28 as the backbone. In addition, we provide theoretical guarantees for OTOD. The code is available in https://github.com/HengGao12/OTOD.
中文: 本研究提出基于最优传输理论的OTOD方法,通过综合特征、逻辑值和Softmax概率信息进行分布外检测,在CIFAR基准测试中显著优于现有最优方法。
English: This study introduces OTOD, a novel post hoc OOD detection method using optimal transport theory that outperforms state-of-the-art approaches by combining features, logits, and softmax probabilities, achieving significant improvements on CIFAR benchmarks.
Authors:Patrick Tser Jern Kon, Jiachen Liu, Qiuyi Ding, Yiming Qiu, Zhenning Yang, Yibo Huang, Jayanth Srinivasa, Myungjin Lee, Mosharaf Chowdhury, Ang Chen
Abstract:
Scientific experimentation, a cornerstone of human progress, demands rigor in reliability, methodical control, and interpretability to yield meaningful results. Despite the growing capabilities of large language models (LLMs) in automating different aspects of the scientific process, automating rigorous experimentation remains a significant challenge. To address this gap, we propose Curie, an AI agent framework designed to embed rigor into the experimentation process through three key components: an intra-agent rigor module to enhance reliability, an inter-agent rigor module to maintain methodical control, and an experiment knowledge module to enhance interpretability. To evaluate Curie, we design a novel experimental benchmark composed of 46 questions across four computer science domains, derived from influential research papers, and widely adopted open-source projects. Compared to the strongest baseline tested, we achieve a 3.4$\times$ improvement in correctly answering experimental questions. Curie is open-sourced at https://github.com/Just-Curieous/Curie.
Chinese: Curie AI代理框架通过可靠性、控制性和可解释性模块将严谨性融入科学实验过程,在涵盖46个计算机科学问题的基准测试中,准确性提升了3.4倍。
English: The Curie AI agent framework enhances scientific experimentation by embedding rigor through reliability, control, and interpretability modules, achieving a 3.4x improvement in accuracy on a benchmark of 46 computer science questions.
Authors:Zheling Tan, Kexin Ding, Jin Gao, Mu Zhou, Dimitris Metaxas, Shaoting Zhang, Dequan Wang
Abstract:
Foundational models (FMs) have made significant strides in the healthcare domain. Yet the data silo challenge and privacy concern remain in healthcare systems, hindering safe medical data sharing and collaborative model development among institutions. The collection and curation of scalable clinical datasets increasingly become the bottleneck for training strong FMs. In this study, we propose Medical Foundation Models Merging (MedForge), a cooperative framework enabling a community-driven medical foundation model development, meanwhile preventing the information leakage of raw patient data and mitigating synchronization model development issues across clinical institutions. MedForge offers a bottom-up model construction mechanism by flexibly merging task-specific Low-Rank Adaptation (LoRA) modules, which can adapt to downstream tasks while retaining original model parameters. Through an asynchronous LoRA module integration scheme, the resulting composite model can progressively enhance its comprehensive performance on various clinical tasks. MedForge shows strong performance on multiple clinical datasets (e.g., breast cancer, lung cancer, and colon cancer) collected from different institutions. Our major findings highlight the value of collaborative foundation models in advancing multi-center clinical collaboration effectively and cohesively. Our code is publicly available at https://github.com/TanZheling/MedForge.
中文: MedForge提出了一种通过融合任务特定LoRA模块的协作式医疗基础模型开发框架,既保护原始患者数据隐私又实现跨机构协同建模,在多种临床数据集上展现出卓越性能。
English: MedForge introduces a collaborative framework for developing medical foundation models by merging task-specific LoRA modules, enabling multi-institutional cooperation without sharing raw patient data and demonstrating strong performance across various clinical datasets.
Authors:Alan Zhu, Jiaqi Ma, Qiaozhu Mei
Abstract:
As large graph datasets become increasingly common across many fields, sampling is often needed to reduce the graphs into manageable sizes. This procedure raises critical questions about representativeness as no sample can capture the properties of the original graph perfectly, and different parts of the graph are not evenly affected by the loss. Recent work has shown that the distances from the non-sampled nodes to the sampled nodes can be a quantitative indicator of bias and fairness in graph machine learning. However, to our knowledge, there is no method for evaluating how a sampling method affects the distribution of shortest-path distances without actually performing the sampling and shortest-path calculation.
In this paper, we present an accurate and efficient framework for estimating the distribution of shortest-path distances to the sample, applicable to a wide range of sampling methods and graph structures. Our framework is faster than empirical methods and only requires the specification of degree distributions. We also extend our framework to handle graphs with community structures. While this introduces a decrease in accuracy, we demonstrate that our framework remains highly accurate on downstream comparison-based tasks. Code is publicly available at https://github.com/az1326/shortest_paths.
中文: 本文提出了一种高效框架,用于估计大图中采样节点间最短路径距离的分布,适用于多种采样方法且仅需度分布信息,同时可扩展到具有社区结构的图。
English: This paper introduces an efficient framework for estimating the distribution of shortest-path distances to sampled nodes in large graphs, applicable to various sampling methods and requiring only degree distributions, while also extending to graphs with community structures.
Authors:Aryan Jadon, Avinash Patil, Shashank Kumar
Abstract:
Retrieval-Augmented Generation (RAG) systems face significant performance gaps when applied to technical domains requiring precise information extraction from complex documents. Current evaluation methodologies relying on document-level metrics inadequately capture token-resolution retrieval accuracy that is critical for domain-related documents. We propose a framework combining granular evaluation metrics with synthetic data generation to optimize domain-specific RAG performance. First, we introduce token-aware metrics Precision $Ω$ and Intersection-over-Union (IoU) that quantify context preservation versus information density trade-offs inherent in technical texts. Second, we develop a reasoning model-driven pipeline using instruction-tuned LLMs (DeepSeek-R1, DeepSeek-R1 distilled variants, and Phi-4) to generate context-anchored QA pairs with discontinuous reference spans across three specialized corpora: SEC 10-K filings (finance), biomedical abstracts (PubMed), and APT threat reports (cybersecurity).
Our empirical analysis reveals critical insights: smaller chunks (less than 10 tokens) improve precision by 31-42% (IoU = 0.071 vs. baseline 0.053) at recall costs (-18%), while domain-specific embedding strategies yield 22% variance in optimal chunk sizing (5-20 tokens). The DeepSeek-R1-Distill-Qwen-32B model demonstrates superior concept alignment (+14% mean IoU over alternatives), though no configuration universally dominates. Financial texts favor larger chunks for risk factor coverage (Recall = 0.81 at size = 20), whereas cybersecurity content benefits from atomic segmentation, Precision $Ω= 0.28$ at size = 5.
Our code is available on https://github.com/aryan-jadon/Synthetic-Data-Generation-and-Evaluation-using-Reasoning-Model
中文摘要:本研究提出了一种结合细粒度评估指标与合成数据生成的框架,以提升检索增强生成(RAG)在技术领域的性能,发现金融、生物医学和网络安全等专业文献的最佳文本块大小与嵌入策略存在显著差异。
English Summary: This study introduces a framework combining token-level evaluation metrics and synthetic data generation to enhance Retrieval-Augmented Generation (RAG) performance in technical domains, revealing that optimal chunk sizes and embedding strategies vary significantly across specialized corpora like finance, biomedical, and cybersecurity documents.
Authors:Mengyang Sun, Yihao Wang, Tao Feng, Dan Zhang, Yifan Zhu, Jie Tang
Abstract:
In order to streamline the fine-tuning of foundation models, Low-Rank Adapters (LoRAs) have been substantially adopted across various fields, including instruction tuning and domain adaptation. The underlying concept of LoRA involves decomposing a full-rank matrix into the product of two lower-rank matrices, which reduces storage consumption and accelerates the training process. Furthermore, to address the limited expressive capacity of LoRA, the Mixture-of-Expert (MoE) has been introduced for incorporating multiple LoRA adapters. The integration of LoRA experts leads to a visible improvement across several downstream scenes. However, the mixture of LoRAs (MoE-LoRA) still exhibits its low robustness during tuning and inferring. Inspired by the Riemannian Preconditioners which train LoRA as a sub-space projector, we propose a new training strategy for MoE-LoRA, to stabilize and boost its feature learning procedure by multi-space projections. Examinations on SGD and AdamW optimizers demonstrate the effectiveness of our methodology. Source code is available at https://github.com/THUDM/MoELoRA_Riemannian.
中文: 针对混合专家低秩适配器(MoE-LoRA)在调优和推理中稳定性不足的问题,提出了一种基于多空间投影的新训练策略,通过在SGD和AdamW优化器上的测试验证了其有效性。
English: To enhance the robustness and feature learning of Mixture-of-Experts Low-Rank Adapters (MoE-LoRA), which face stability issues during tuning and inference, a novel training strategy using multi-space projections is proposed and validated on optimizers like SGD and AdamW.
Authors:Wenyue Hua, Tyler Wong, Sun Fei, Liangming Pan, Adam Jardine, William Yang Wang
Abstract:
Large language models (LLMs) have shown remarkable improvements in reasoning and many existing benchmarks have been addressed by models such as o1 and o3 either fully or partially. However, a majority of these benchmarks emphasize deductive reasoning, including mathematical and coding tasks in which rules such as mathematical axioms or programming syntax are clearly defined, based on which LLMs can plan and apply these rules to arrive at a solution. In contrast, inductive reasoning, where one infers the underlying rules from observed data, remains less explored. Such inductive processes lie at the heart of scientific discovery, as they enable researchers to extract general principles from empirical observations. To assess whether LLMs possess this capacity, we introduce InductionBench, a new benchmark designed to evaluate the inductive reasoning ability of LLMs. Our experimental findings reveal that even the most advanced models available struggle to master the simplest complexity classes within the subregular hierarchy of functions, highlighting a notable deficiency in current LLMs' inductive reasoning capabilities. Coda and data are available https://github.com/Wenyueh/inductive_reasoning_benchmark.
中文摘要:大型语言模型在演绎推理方面表现出色,但在归纳推理方面存在明显不足,新基准测试InductionBench显示,即使是最先进的模型也难以从数据中推断出基本规则。
English Summary: Large language models excel in deductive reasoning but struggle with inductive reasoning, as shown by the new benchmark InductionBench, which reveals their difficulty in inferring rules from data despite their advanced capabilities.
Authors:Yuxuan Zhou, Heng Li, Zhi-Qi Cheng, Xudong Yan, Yifei Dong, Mario Fritz, Margret Keuper
Abstract:
Label Smoothing (LS) is widely adopted to reduce overconfidence in neural network predictions and improve generalization. Despite these benefits, recent studies reveal two critical issues with LS. First, LS induces overconfidence in misclassified samples. Second, it compacts feature representations into overly tight clusters, diluting intra-class diversity, although the precise cause of this phenomenon remained elusive. In this paper, we analytically decompose the LS-induced loss, exposing two key terms: (i) a regularization term that dampens overconfidence only when the prediction is correct, and (ii) an error-amplification term that arises under misclassifications. This latter term compels the network to reinforce incorrect predictions with undue certainty, exacerbating representation collapse. To address these shortcomings, we propose Max Suppression (MaxSup), which applies uniform regularization to both correct and incorrect predictions by penalizing the top-1 logit rather than the ground-truth logit. Through extensive feature-space analyses, we show that MaxSup restores intra-class variation and sharpens inter-class boundaries. Experiments on large-scale image classification and multiple downstream tasks confirm that MaxSup is a more robust alternative to LS, consistently reducing overconfidence while preserving richer feature representations. Code is available at: https://github.com/ZhouYuxuanYX/Maximum-Suppression-Regularization
中文: 标签平滑会导致错误预测的过度自信和特征压缩,引发表示坍塌,而提出的最大抑制方法通过均匀正则化预测,恢复特征多样性并减少过度自信。
English: Label Smoothing is found to induce overconfidence in errors and compress features, leading to representation collapse, while the proposed Max Suppression method uniformly regularizes predictions to restore feature diversity and reduce overconfidence.
Authors:Yanyang Li, Michael Lyu, Liwei Wang
Abstract:
Solving complex tasks in a single attempt is challenging for large language models (LLMs). Iterative interaction with the environment and feedback is often required to achieve success, making effective feedback utilization a critical topic. Existing approaches either struggle with length generalization or rely on naive retries without leveraging prior information. In this paper, we introduce FTTT, a novel paradigm that formulates feedback utilization as an optimization problem at test time. Additionally, we propose a learnable test-time optimizer, OpTune, to effectively exploit feedback. Experiments on two LLMs across four reasoning datasets demonstrate that FTTT and OpTune achieve superior scalability and performance.
中文: 大语言模型常需迭代反馈解决复杂任务,本文提出的FTTT范式及其可学习优化器OpTune将反馈利用构建为测试时优化问题,有效克服现有方法的局限,在实验中展现出卓越的扩展性和性能。
English: Large language models often require iterative feedback to solve complex tasks, and the proposed FTTT paradigm with its learnable optimizer OpTune effectively addresses existing limitations by treating feedback utilization as a test-time optimization problem, achieving superior scalability and performance in experiments.
Authors:Hao Bai, Yifei Zhou, Li Erran Li, Sergey Levine, Aviral Kumar
Abstract:
While a number of existing approaches for building foundation model agents rely on prompting or fine-tuning with human demonstrations, it is not sufficient in dynamic environments (e.g., mobile device control). On-policy reinforcement learning (RL) should address these limitations, but collecting actual rollouts in an environment is often undesirable in truly open-ended agentic problems such as mobile device control or interacting with humans, where each unit of interaction is associated with a cost. In such scenarios, a method for policy learning that can utilize off-policy experience by learning a trained action-value function is much more effective. In this paper, we develop an approach, called Digi-Q, to train VLM-based action-value Q-functions which are then used to extract the agent policy. We study our approach in the mobile device control setting. Digi-Q trains the Q-function using offline temporal-difference (TD) learning, on top of frozen, intermediate-layer features of a VLM. Compared to fine-tuning the whole VLM, this approach saves us compute and enhances scalability. To make the VLM features amenable for representing the Q-function, we need to employ an initial phase of fine-tuning to amplify coverage over actionable information needed for value function. Once trained, we use this Q-function via a Best-of-N policy extraction operator that imitates the best action out of multiple candidate actions from the current policy as ranked by the value function, enabling policy improvement without environment interaction. Digi-Q outperforms several prior methods on user-scale device control tasks in Android-in-the-Wild, attaining 21.2% improvement over prior best-performing method. In some cases, our Digi-Q approach already matches state-of-the-art RL methods that require interaction. The project is open-sourced at https://github.com/DigiRL-agent/digiq
中文摘要:Digi-Q方法通过离线时序差分学习训练基于视觉语言模型的Q函数,无需环境交互即可实现策略改进,在移动设备控制任务中取得了显著性能提升。
English Summary: The Digi-Q approach trains vision-language model-based Q-functions using offline temporal-difference learning to enable policy improvement without environment interaction, achieving significant performance gains in mobile device control tasks.
Authors:Leonardo Berti, Gjergji Kasneci
Abstract:
Price Trend Prediction (PTP) based on Limit Order Book (LOB) data is a fundamental challenge in financial markets. Despite advances in deep learning, existing models fail to generalize across different market conditions and assets. Surprisingly, by adapting a simple MLP-based architecture to LOB, we show that we surpass SoTA performance; thus, challenging the necessity of complex architectures. Unlike past work that shows robustness issues, we propose TLOB, a transformer-based model that uses a dual attention mechanism to capture spatial and temporal dependencies in LOB data. This allows it to adaptively focus on the market microstructure, making it particularly effective for longer-horizon predictions and volatile market conditions. We also introduce a new labeling method that improves on previous ones, removing the horizon bias. We evaluate TLOB's effectiveness across four horizons, using the established FI-2010 benchmark, a NASDAQ and a Bitcoin dataset. TLOB outperforms SoTA methods in every dataset and horizon. Additionally, we empirically show how stock price predictability has declined over time, -6.68 in F1-score, highlighting the growing market efficiency. Predictability must be considered in relation to transaction costs, so we experimented with defining trends using an average spread, reflecting the primary transaction cost. The resulting performance deterioration underscores the complexity of translating trend classification into profitable trading strategies. We argue that our work provides new insights into the evolving landscape of stock price trend prediction and sets a strong foundation for future advancements in financial AI. We release the code at https://github.com/LeonardoBerti00/TLOB.
中文: TLOB模型通过双注意力机制的Transformer架构,在限价订单簿数据中捕捉时空依赖性,超越了现有最优方法,在不同市场和资产的价格趋势预测中表现卓越。
English: The TLOB model, utilizing a dual attention transformer, surpasses state-of-the-art methods in price trend prediction by effectively capturing spatial and temporal dependencies in limit order book data across various market conditions and assets.
Authors:Sewoong Oh, Himanshu Tyagi, Pramod Viswanath
Abstract:
Loyal AI is loyal to the community that builds it. An AI is loyal to a community if the community has ownership, alignment, and control. Community owned models can only be used with the approval of the community and share the economic rewards communally. Community aligned models have values that are aligned with the consensus of the community. Community controlled models perform functions designed by the community. Since we would like permissionless access to the loyal AI's community, we need the AI to be open source. The key scientific question then is: how can we build models that are openly accessible (open source) and yet are owned and governed by the community. This seeming impossibility is the focus of this paper where we outline a concrete pathway to Open, Monetizable and Loyal models (OML), building on our earlier work on OML, arXiv:2411.03887(1) , and a representation via a cryptographic-ML library http://github.com/sentient-agi/oml-1.0-fingerprinting .
中文: 本文提出了一个开发开放、可盈利且忠诚(OML)AI模型的框架,通过密码学与机器学习相结合,实现开源模型在社区所有权、价值观对齐和功能控制下的协同治理。
English: This paper proposes a framework for developing Open, Monetizable, and Loyal (OML) AI models that are both open-source and governed by community ownership, alignment, and control, addressing the challenge through cryptographic-ML integration.
Authors:Zongkai Zhao, Guozeng Xu, Xiuhua Li, Kaiwen Wei, Jiang Zhong
Abstract:
Locate-then-Edit Knowledge Editing (LEKE) is a key technique for updating large language models (LLMs) without full retraining. However, existing methods assume a single-user setting and become inefficient in real-world multi-client scenarios, where decentralized organizations (e.g., hospitals, financial institutions) independently update overlapping knowledge, leading to redundant mediator knowledge vector (MKV) computations and privacy concerns. To address these challenges, we introduce Federated Locate-then-Edit Knowledge Editing (FLEKE), a novel task that enables multiple clients to collaboratively perform LEKE while preserving privacy and reducing computational overhead. To achieve this, we propose FedEdit, a two-stage framework that optimizes MKV selection and reuse. In the first stage, clients locally apply LEKE and upload the computed MKVs. In the second stage, rather than relying solely on server-based MKV sharing, FLEKE allows clients retrieve relevant MKVs based on cosine similarity, enabling knowledge re-edit and minimizing redundant computations. Experimental results on two benchmark datasets demonstrate that FedEdit retains over 96% of the performance of non-federated LEKE while significantly outperforming a FedAvg-based baseline by approximately twofold. Besides, we find that MEMIT performs more consistently than PMET in the FLEKE task with our FedEdit framework. Our code is available at https://github.com/zongkaiz/FLEKE.
Chinese: FLEKE提出了一种联邦式知识编辑方法,允许多个客户端在保护隐私的同时协作更新大语言模型,并通过优化知识向量复用显著减少冗余计算。
English: FLEKE introduces a federated approach to knowledge editing that enables multiple clients to collaboratively update LLMs while preserving privacy and reducing redundant computations through optimized MKV reuse.
Authors:Qi Le, Enmao Diao, Ziyan Wang, Xinran Wang, Jie Ding, Li Yang, Ali Anwar
Abstract:
We introduce Probe Pruning (PP), a novel framework for online, dynamic, structured pruning of Large Language Models (LLMs) applied in a batch-wise manner. PP leverages the insight that not all samples and tokens contribute equally to the model's output, and probing a small portion of each batch effectively identifies crucial weights, enabling tailored dynamic pruning for different batches. It comprises three main stages: probing, history-informed pruning, and full inference. In the probing stage, PP selects a small yet crucial set of hidden states, based on residual importance, to run a few model layers ahead. During the history-informed pruning stage, PP strategically integrates the probing states with historical states. Subsequently, it structurally prunes weights based on the integrated states and the PP importance score, a metric developed specifically to assess the importance of each weight channel in maintaining performance. In the final stage, full inference is conducted on the remaining weights. A major advantage of PP is its compatibility with existing models, as it operates without requiring additional neural network modules or fine-tuning. Comprehensive evaluations of PP on LLaMA-2/3 and OPT models reveal that even minimal probing-using just 1.5% of FLOPs-can substantially enhance the efficiency of structured pruning of LLMs. For instance, when evaluated on LLaMA-2-7B with WikiText2, PP achieves a 2.56 times lower ratio of performance degradation per unit of runtime reduction compared to the state-of-the-art method at a 40% pruning ratio. Our code is available at https://github.com/Qi-Le1/Probe_Pruning.
中文: 探针剪枝是一种动态批量剪枝框架,通过选择性探测关键标记和权重来优化大语言模型的结构化剪枝,无需额外模块或微调即可显著提升效率。
English: Probe Pruning is a dynamic, batch-wise framework that enhances structured pruning of Large Language Models by selectively probing key tokens and weights, significantly improving efficiency without extra modules or fine-tuning.
Authors:Jixiu Zhai, Zikun Wang, Tianchi Lu, Haitian Zhong, Ziyang Xu, Yuhuan Liu, Shengrui Xu, Jingwan Wang, Dan Huang
Abstract:
Accurate identification of bioactive peptides (BPs) and protein post-translational modifications (PTMs) is essential for understanding protein function and advancing therapeutic discovery. However, most computational methods remain limited in their generalizability across diverse peptide functions. Here, we present PDeepPP, a unified deep learning framework that integrates pretrained protein language models with a hybrid transformer-convolutional architecture, enabling robust identification across diverse peptide classes and PTM sites. We curated comprehensive benchmark datasets and implemented strategies to address data imbalance, allowing PDeepPP to systematically extract both global and local sequence features. Through extensive analyses-including dimensionality reduction and comparison studies-PDeepPP demonstrates strong, interpretable peptide representations and achieves state-of-the-art performance in 25 of the 33 biological identification tasks. Notably, PDeepPP attains high accuracy in antimicrobial (0.9726) and phosphorylation site (0.9984) identification, with 99.5% specificity in glycosylation site prediction and substantial reduction in false negatives in antimalarial tasks. By enabling large-scale, accurate peptide analysis, PDeepPP supports biomedical research and the discovery of novel therapeutic targets for disease treatment. All code, datasets, and pretrained models are publicly available via GitHub:https://github.com/fondress/PDeepPP and Hugging Face:https://huggingface.co/fondress/PDeppPP.
中文: PDeepPP是一个统一的深度学习框架,结合预训练蛋白质语言模型与混合Transformer-卷积架构,在多种生物识别任务中实现了最先进的性能,能准确识别各类生物活性肽和翻译后修饰位点。
English: PDeepPP is a unified deep learning framework that integrates pretrained protein language models with a hybrid transformer-convolutional architecture, achieving state-of-the-art performance in identifying diverse bioactive peptides and post-translational modifications across multiple biological tasks.
Authors:Jintian Zhang, Yuqi Zhu, Mengshu Sun, Yujie Luo, Shuofei Qiao, Lun Du, Da Zheng, Huajun Chen, Ningyu Zhang
Abstract:
Large language models (LLMs) have shown remarkable performance in complex reasoning tasks, but their efficiency is hindered by the substantial memory and computational costs associated with generating lengthy tokens. In this paper, we propose LightThinker, a novel method that enables LLMs to dynamically compress intermediate thoughts during reasoning. Inspired by human cognitive processes, LightThinker compresses verbose thought steps into compact representations and discards the original reasoning chains, thereby significantly reducing the number of tokens stored in the context window. This is achieved by training the model on when and how to perform compression through data construction, mapping hidden states to condensed gist tokens, and creating specialized attention masks. Additionally, we introduce the Dependency (Dep) metric to quantify the degree of compression by measuring the reliance on historical tokens during generation. Extensive experiments on four datasets and two models show that LightThinker reduces peak memory usage and inference time, while maintaining competitive accuracy. Our work provides a new direction for improving the efficiency of LLMs in complex reasoning tasks without sacrificing performance. Code is released at https://github.com/zjunlp/LightThinker.
中文摘要:LightThinker是一种创新方法,通过将推理过程中的中间思维动态压缩为紧凑表征,在保持准确性的同时显著降低内存使用和推理时间。
English Summary: LightThinker is a novel method that enhances LLM efficiency by dynamically compressing intermediate reasoning steps into compact representations, reducing memory usage and inference time while preserving accuracy.
Authors:Jinda Liu, Yi Chang, Yuan Wu
Abstract:
Fine-tuning large language models (LLMs) is computationally expensive, and Low-Rank Adaptation (LoRA) provides a cost-effective solution by approximating weight updates through low-rank matrices. In real-world scenarios, LLMs are fine-tuned on data from multiple domains to perform tasks across various fields, embodying multi-task learning (MTL). LoRA often underperforms in such complex scenarios. To enhance LoRA's capability in multi-task learning, we propose R-LoRA, which incorporates Multi-Head Randomization. Multi-Head Randomization diversifies the head matrices through Multi-Head Dropout and Multi-Head Random Initialization, enabling more efficient learning of task-specific features while maintaining shared knowledge representation. Our approach not only improves performance in MTL but also reduces GPU memory usage and training time. Experiments show that R-LoRA's gains stem from increased diversity in the head matrices, demonstrating its effectiveness for multi-task learning. The code is available at https://github.com/jinda-liu/R-LoRA
中文: R-LoRA通过引入多头随机化技术增强多头矩阵的多样性,在提升多任务学习性能的同时有效降低了计算资源消耗。
English: R-LoRA enhances multi-task learning by introducing Multi-Head Randomization to diversify head matrices, improving performance while reducing computational costs.
Authors:Yuan Sun
Abstract:
For pre-training of MoE (Mixture-of-Experts) models, one of the main issues is unbalanced expert loads, which may cause routing collapse or increased computational overhead. Existing methods contain the Loss-Controlled method and the Loss-Free method, where both the unbalanced degrees at first several training steps are still high and decrease slowly. In this work, we propose BIP-Based Balancing, an expert load balancing algorithm based on binary integer programming (BIP). The algorithm maintains an additional vector q on each MoE layer that can help change the top-K order of s by solving a binary integer programming with very small time costs. We implement the algorithm on two MoE language models: 16-expert (0.3B) and 64-expert (1.1B). The experimental results show that on both models comparing with the Loss-Controlled method and the Loss-Free method, our algorithm trains models with the lowest perplexities, while saves at least 13% of pre-training time compared with the Loss-Controlled method. Within our current knowledge, this is the first routing algorithm that achieves maintaining load balance status on every expert in every MoE layer from the first step to the last step during the whole pre-training process, while the trained MoE models also perform well. The code material of this work is available at https://github.com/sunyuanLLM/bip_routing_algorithm.
中文: 本文提出了一种基于二进制整数规划的专家负载均衡算法,该算法从训练第一步起即可解决MoE模型中专家负载不均衡的问题,在获得最低困惑度的同时,相比现有方法至少节省13%的预训练时间。
English: This paper introduces a BIP-based expert load balancing algorithm that effectively resolves unbalanced expert loads in MoE models from the first training step, achieving the lowest perplexities and reducing pre-training time by at least 13% compared to existing methods.
Authors:Raghav Singhal, Kaustubh Ponkshe, Rohit Vartak, Lav R. Varshney, Praneeth Vepakomma
Abstract:
Low-Rank Adaptation (LoRA) has become ubiquitous for efficiently fine-tuning foundation models. However, federated fine-tuning using LoRA is challenging due to suboptimal updates arising from traditional federated averaging of individual adapters. Existing solutions either incur prohibitively high communication cost that scales linearly with the number of clients or suffer from performance degradation due to limited expressivity. We introduce Federated Silver Bullet (Fed-SB), a novel approach for federated fine-tuning of LLMs using LoRA-SB, a recently proposed low-rank adaptation method. LoRA-SB optimally aligns the optimization trajectory with the ideal low-rank full fine-tuning projection by learning a small square matrix (R) between adapters B and A, keeping other components fixed. Direct averaging of R guarantees exact updates, substantially reducing communication cost, which remains independent of the number of clients, and enables scalability. Fed-SB achieves state-of-the-art performance across commonsense reasoning, arithmetic reasoning, and language inference tasks while reducing communication costs by up to 230x. In private settings, Fed-SB further improves performance by (1) reducing trainable parameters, thereby lowering the noise required for differential privacy and (2) avoiding noise amplification introduced by other methods. Overall, Fed-SB establishes a new Pareto frontier in the tradeoff between communication and performance, offering an efficient and scalable solution for both private and non-private federated fine-tuning. Our code is publicly available at https://github.com/CERT-Lab/fed-sb.
中文: Fed-SB 提出了一种基于 LoRA-SB 的高效联邦微调方法,通过学习小型矩阵实现精确更新,通信成本降低高达 230 倍,并在多项推理任务中达到最优性能。
English: Fed-SB introduces an efficient federated fine-tuning method using LoRA-SB, which reduces communication costs by up to 230x and achieves top performance across reasoning tasks by learning a small matrix for exact updates.
Authors:Giulio Zizzo, Giandomenico Cornacchia, Kieran Fraser, Muhammad Zaid Hameed, Ambrish Rawat, Beat Buesser, Mark Purcell, Pin-Yu Chen, Prasanna Sattigeri, Kush Varshney
Abstract:
As large language models (LLMs) become integrated into everyday applications, ensuring their robustness and security is increasingly critical. In particular, LLMs can be manipulated into unsafe behaviour by prompts known as jailbreaks. The variety of jailbreak styles is growing, necessitating the use of external defences known as guardrails. While many jailbreak defences have been proposed, not all defences are able to handle new out-of-distribution attacks due to the narrow segment of jailbreaks used to align them. Moreover, the lack of systematisation around defences has created significant gaps in their practical application. In this work, we perform systematic benchmarking across 15 different defences, considering a broad swathe of malicious and benign datasets. We find that there is significant performance variation depending on the style of jailbreak a defence is subject to. Additionally, we show that based on current datasets available for evaluation, simple baselines can display competitive out-of-distribution performance compared to many state-of-the-art defences. Code is available at https://github.com/IBM/Adversarial-Prompt-Evaluation.
Chinese: 随着大语言模型面临日益多样化的越狱攻击威胁,本研究系统评估了15种防御方法,发现其性能存在显著差异,并证明简单基线方法在应对分布外攻击时能与前沿防御技术相媲美。
English: As large language models face growing threats from diverse jailbreak attacks, this study systematically benchmarks 15 defense methods and reveals significant performance variations, demonstrating that simple baselines can match state-of-the-art defenses against out-of-distribution attacks.
Authors:Longde Huang, Oleksandr Balabanov, Hampus Linander, Mats Granath, Daniel Persson, Jan E. Gerken
Abstract:
Equivariant network architectures are a well-established tool for predicting invariant or equivariant quantities. However, almost all learning problems considered in this context feature a global symmetry, i.e. each point of the underlying space is transformed with the same group element, as opposed to a local ``gauge'' symmetry, where each point is transformed with a different group element, exponentially enlarging the size of the symmetry group. Gauge equivariant networks have so far mainly been applied to problems in quantum chromodynamics. Here, we introduce a novel application domain for gauge-equivariant networks in the theory of topological condensed matter physics. We use gauge equivariant networks to predict topological invariants (Chern numbers) of multiband topological insulators. The gauge symmetry of the network guarantees that the predicted quantity is a topological invariant. We introduce a novel gauge equivariant normalization layer to stabilize the training and prove a universal approximation theorem for our setup. We train on samples with trivial Chern number only but show that our models generalize to samples with non-trivial Chern number. We provide various ablations of our setup. Our code is available at https://github.com/sitronsea/GENet/tree/main.
中文摘要:本研究将规范等变神经网络创新应用于拓扑凝聚态物理领域,通过规范对称性保证预测的拓扑不变量(陈数)准确性,并在仅使用平凡陈数样本训练的情况下实现了对非平凡陈数样本的泛化预测。
English Summary: The study introduces a novel application of gauge-equivariant neural networks in topological condensed matter physics, enabling accurate prediction of Chern numbers for topological insulators with guaranteed topological invariance through gauge symmetry.
Authors:Feiyang Chen, Yu Cheng, Lei Wang, Yuqing Xia, Ziming Miao, Lingxiao Ma, Fan Yang, Jilong Xue, Zhi Yang, Mao Yang, Haibo Chen
Abstract:
Transformers and large language models (LLMs) have revolutionized machine learning, with attention mechanisms at the core of their success. As the landscape of attention variants expands, so too do the challenges of optimizing their performance, particularly across different hardware platforms. Current optimization strategies are often narrowly focused, requiring extensive manual intervention to accommodate changes in model configurations or hardware environments. In this paper, we introduce AttentionEngine, a comprehensive framework designed to streamline the optimization of attention mechanisms across heterogeneous hardware backends. By decomposing attention computation into modular operations with customizable components, AttentionEngine enables flexible adaptation to diverse algorithmic requirements. The framework further automates kernel optimization through a combination of programmable templates and a robust cross-platform scheduling strategy. Empirical results reveal performance gains of up to 10x on configurations beyond the reach of existing methods. AttentionEngine offers a scalable, efficient foundation for developing and deploying attention mechanisms with minimal manual tuning. Our code has been open-sourced and is available at https://github.com/microsoft/AttentionEngine.
Chinese: AttentionEngine 是一个综合性框架,可在多样化硬件平台上自动优化注意力机制,实现高达10倍的性能提升,同时极大减少了人工调优需求。
English: AttentionEngine is a comprehensive framework that automates and optimizes attention mechanisms across diverse hardware platforms, achieving up to 10x performance improvements with minimal manual intervention.
Authors:Jinyu Zhang, Chao Li, Zhongying Zhao
Abstract:
Graph-based Sequential Recommender systems (GSRs) have gained significant research attention due to their ability to simultaneously handle user-item interactions and sequential relationships between items. Current GSRs often utilize composite or in-depth structures for graph encoding (e.g., the Graph Transformer). Nevertheless, they have high computational complexity, hindering the deployment on resource-constrained edge devices. Moreover, the relative position encoding in Graph Transformer has difficulty in considering the complicated positional dependencies within sequence. To this end, we propose an External Attentive Graph convolutional network with Positional prompts for Sequential recommendation, namely EA-GPS. Specifically, we first introduce an external attentive graph convolutional network that linearly measures the global associations among nodes via two external memory units. Then, we present a positional prompt-based decoder that explicitly treats the absolute item positions as external prompts. By introducing length-adaptive sequential masking and a soft attention network, such a decoder facilitates the model to capture the long-term positional dependencies and contextual relationships within sequences. Extensive experimental results on five real-world datasets demonstrate that the proposed EA-GPS outperforms the state-of-the-art methods. Remarkably, it achieves the superior performance while maintaining a smaller parameter size and lower training overhead. The implementation of this work is publicly available at https://github.com/ZZY-GraphMiningLab/EA-GPS.
中文摘要:提出的EA-GPS模型通过外部注意力图卷积网络和位置提示解码器,解决了图序列推荐系统计算复杂和位置依赖建模的难题,在减少参数量的同时实现了更优性能。
English Summary: The proposed EA-GPS model introduces an external attentive graph convolutional network and positional prompt decoder to address computational complexity and positional dependency limitations in graph-based sequential recommenders, achieving superior performance with reduced parameters.
Authors:Hojoon Lee, Youngdo Lee, Takuma Seno, Donghu Kim, Peter Stone, Jaegul Choo
Abstract:
Scaling up the model size and computation has brought consistent performance improvements in supervised learning. However, this lesson often fails to apply to reinforcement learning (RL) because training the model on non-stationary data easily leads to overfitting and unstable optimization. In response, we introduce SimbaV2, a novel RL architecture designed to stabilize optimization by (i) constraining the growth of weight and feature norm by hyperspherical normalization; and (ii) using a distributional value estimation with reward scaling to maintain stable gradients under varying reward magnitudes. Using the soft actor-critic as a base algorithm, SimbaV2 scales up effectively with larger models and greater compute, achieving state-of-the-art performance on 57 continuous control tasks across 4 domains. The code is available at https://dojeon-ai.github.io/SimbaV2.
Authors:Nicholas DiSalvo
Abstract:
Image Steganography is a cryptographic technique that embeds secret information into an image, ensuring the hidden data remains undetectable to the human eye while preserving the image's original visual integrity. Least Significant Bit (LSB) Steganography achieves this by replacing the k least significant bits of an image with the k most significant bits of a secret image, maintaining the appearance of the original image while simultaneously encoding the essential elements of the hidden data. In this work, we shift away from conventional applications of steganography in deep learning and explore its potential from a new angle. We present experimental results on CIFAR-10 showing that LSB Steganography, when used as a data augmentation strategy for downstream computer vision tasks such as image classification, can significantly improve the training efficiency of deep neural networks. It can also act as an implicit, uniformly discretized piecewise linear approximation of color augmentations such as (brightness, contrast, hue, and saturation), without introducing additional training overhead through a new joint image training regime that disregards the need for tuning sensitive augmentation hyperparameters.
中文: 本研究证明,最低有效位隐写术可作为计算机视觉任务的有效数据增强方法,在无需额外调整超参数的情况下,既提升了深度神经网络的训练效率,又实现了对色彩增强的隐式分段线性逼近。
English: This study demonstrates that Least Significant Bit Steganography serves as an effective data augmentation method for computer vision tasks, enhancing deep neural network training efficiency while approximating color augmentations without extra hyperparameter tuning.
Authors:Madhurima Chakraborty, Peter Pirkelbauer, Qing Yi
Abstract:
FormalSpecCpp is a dataset designed to fill the gap in standardized benchmarks for verifying formal specifications in C++ programs. To the best of our knowledge, this is the first comprehensive collection of C++ programs with well-defined preconditions and postconditions. It provides a structured benchmark for evaluating specification inference tools and testing theaccuracy of generated specifications. Researchers and developers can use this dataset to benchmark specification inference tools,fine-tune Large Language Models (LLMs) for automated specification generation, and analyze the role of formal specifications in improving program verification and automated testing. By making this dataset publicly available, we aim to advance research in program verification, specification inference, and AI-assisted software development. The dataset and the code are available at https://github.com/MadhuNimmo/FormalSpecCpp.
中文: FormalSpecCpp是首个包含形式化规范的C++程序综合数据集,旨在为规范推断工具提供基准测试,并推动程序验证和AI辅助开发的研究。
English: FormalSpecCpp is the first comprehensive dataset of C++ programs with formal specifications, designed to benchmark specification inference tools and advance research in program verification and AI-assisted development.
Authors:Ebenezer Tarubinga, Jenifer Kalafatovich, Seong-Whan Lee
Abstract:
Semi-supervised semantic segmentation (SSSS) aims to improve segmentation performance by utilizing large amounts of unlabeled data with limited labeled samples. Existing methods often suffer from coupling, where over-reliance on initial labeled data leads to suboptimal learning; confirmation bias, where incorrect predictions reinforce themselves repeatedly; and boundary blur caused by limited boundary-awareness and ambiguous edge cues. To address these issues, we propose CW-BASS, a novel framework for SSSS. In order to mitigate the impact of incorrect predictions, we assign confidence weights to pseudo-labels. Additionally, we leverage boundary-delineation techniques, which, despite being extensively explored in weakly-supervised semantic segmentation (WSSS), remain underutilized in SSSS. Specifically, our method: (1) reduces coupling via a confidence-weighted loss that adjusts pseudo-label influence based on their predicted confidence scores, (2) mitigates confirmation bias with a dynamic thresholding mechanism that learns to filter out pseudo-labels based on model performance, (3) tackles boundary blur using a boundary-aware module to refine segmentation near object edges, and (4) reduces label noise through a confidence decay strategy that progressively refines pseudo-labels during training. Extensive experiments on Pascal VOC 2012 and Cityscapes demonstrate that CW-BASS achieves state-of-the-art performance. Notably, CW-BASS achieves a 65.9% mIoU on Cityscapes under a challenging and underexplored 1/30 (3.3%) split (100 images), highlighting its effectiveness in limited-label settings. Our code is available at https://github.com/psychofict/CW-BASS.
中文: CW-BASS框架通过引入置信度加权的伪标签和边界描绘技术,有效解决了半监督语义分割中的耦合、确认偏差和边界模糊问题,在有限标注数据下于多个基准数据集上取得了领先性能。
English: The CW-BASS framework enhances semi-supervised semantic segmentation by introducing confidence-weighted pseudo-labeling and boundary-delineation techniques to address coupling, confirmation bias, and boundary blur issues, achieving state-of-the-art results on benchmark datasets with limited labeled data.
Authors:Vaidehi Patil, Elias Stengel-Eskin, Mohit Bansal
Abstract:
User specifications or legal frameworks often require information to be removed from pretrained models, including large language models (LLMs). This requires deleting or "forgetting" a set of data points from an already-trained model, which typically degrades its performance on other data points. Thus, a balance must be struck between removing information and keeping the model's other abilities intact, with a failure to balance this trade-off leading to poor deletion or an unusable model. To this end, we propose UPCORE (Utility-Preserving Coreset Selection), a method-agnostic data selection framework for mitigating collateral damage during unlearning. Finding that the model damage is correlated with the variance of the model's representations on the forget set, we selectively prune the forget set to remove outliers, thereby minimizing model degradation after unlearning. Across three standard unlearning methods, UPCORE consistently achieves a superior balance between the competing objectives of deletion efficacy and model preservation. To better evaluate this trade-off, we introduce a new metric, measuring the area-under-the-curve (AUC) across standard metrics. Our results show that UPCORE improves both standard metrics and AUC, benefiting from positive transfer between the coreset and pruned points while reducing negative transfer from the forget set to points outside of it.
中文:UPCORE框架通过选择性修剪待遗忘数据中的异常值,在不同遗忘方法中实现了删除效果与模型效用保持的最佳平衡。
English: The proposed UPCORE framework selectively prunes outliers from data to be forgotten, effectively balancing deletion efficacy and model utility preservation across various unlearning methods.
Authors:Mohsen Hariri, Alan Luo, Mohammadreza Nemati, Lam Nguyen, Shaochen Zhong, Qifan Wang, Xia Hu, Xiaotian Han, Vipin Chaudhary
Abstract:
Large Language Models (LLMs) have introduced significant advancements to the capabilities of Natural Language Processing (NLP) in recent years. However, as these models continue to scale in size, memory constraints pose substantial challenge. Key and Value cache (KV cache) quantization has been well-documented as a promising solution to this limitation. In this work, we provide two novel theorems aimed at enhancing KV quantization methods. Our first theorem, termed Key-Value Norm Disparity, states that the key weight matrices by nature carry richer information compared to the value weight matrices, as evidenced by higher spectral and Frobenius norms across most of the layers. Our second theorem, Key-Driven Quantization, posits that prioritizing the quantization precision of keys over values induces significant improvements to the overall quantization performance. In particular, assigning greater precision to the keys compared to the values achieves a higher degree of precision reduction with minimal impact on model accuracy. We validate these theorems through theory and extensive experiments on several state-of-the-art LLM architectures and benchmarks. These findings offer valuable guidelines for improving KV cache quantization strategies, facilitating more efficient memory utilization without compromising model performance across diverse NLP tasks. Source code is available at https://github.com/mohsenhariri/spectral-kv.
中文: 本文针对大语言模型中的KV缓存量化提出两项创新理论,证明在量化过程中优先处理键而非值能在保持模型精度的同时显著提升内存使用效率。
English: This paper introduces two novel theorems for enhancing KV cache quantization in large language models, demonstrating that prioritizing key quantization over value quantization improves memory efficiency while maintaining model accuracy.
Authors:Tong Zhao, Yozen Liu, Matthew Kolodner, Kyle Montemayor, Elham Ghazizadeh, Ankit Batra, Zihao Fan, Xiaobin Gao, Xuan Guo, Jiwen Ren, Serim Park, Peicheng Yu, Jun Yu, Shubham Vij, Neil Shah
Abstract:
Recent advances in graph machine learning (ML) with the introduction of Graph Neural Networks (GNNs) have led to a widespread interest in applying these approaches to business applications at scale. GNNs enable differentiable end-to-end (E2E) learning of model parameters given graph structure which enables optimization towards popular node, edge (link) and graph-level tasks. While the research innovation in new GNN layers and training strategies has been rapid, industrial adoption and utility of GNNs has lagged considerably due to the unique scale challenges that large-scale graph ML problems create. In this work, we share our approach to training, inference, and utilization of GNNs at Snapchat. To this end, we present GiGL (Gigantic Graph Learning), an open-source library to enable large-scale distributed graph ML to the benefit of researchers, ML engineers, and practitioners. We use GiGL internally at Snapchat to manage the heavy lifting of GNN workflows, including graph data preprocessing from relational DBs, subgraph sampling, distributed training, inference, and orchestration. GiGL is designed to interface cleanly with open-source GNN modeling libraries prominent in academia like PyTorch Geometric (PyG), while handling scaling and productionization challenges that make it easier for internal practitioners to focus on modeling. GiGL is used in multiple production settings, and has powered over 35 launches across multiple business domains in the last 2 years in the contexts of friend recommendation, content recommendation and advertising. This work details high-level design and tools the library provides, scaling properties, case studies in diverse business settings with industry-scale graphs, and several key lessons learned in employing graph ML at scale on large social data. GiGL is open-sourced at https://github.com/Snapchat/GiGL.
中文: 图神经网络(GNNs)的最新进展激发了商业应用兴趣,但工业应用因规模挑战而滞后,因此Snapchat开发了开源库GiGL,以支持大规模分布式图机器学习在生产中的运用。
English: Recent advances in Graph Neural Networks (GNNs) have spurred interest in business applications, but industrial adoption lags due to scaling challenges, leading to the development of GiGL, an open-source library by Snapchat that facilitates large-scale distributed graph machine learning for production use.
Authors:Masatoshi Uehara, Xingyu Su, Yulai Zhao, Xiner Li, Aviv Regev, Shuiwang Ji, Sergey Levine, Tommaso Biancalani
Abstract:
To fully leverage the capabilities of diffusion models, we are often interested in optimizing downstream reward functions during inference. While numerous algorithms for reward-guided generation have been recently proposed due to their significance, current approaches predominantly focus on single-shot generation, transitioning from fully noised to denoised states. We propose a novel framework for inference-time reward optimization with diffusion models inspired by evolutionary algorithms. Our approach employs an iterative refinement process consisting of two steps in each iteration: noising and reward-guided denoising. This sequential refinement allows for the gradual correction of errors introduced during reward optimization. Besides, we provide a theoretical guarantee for our framework. Finally, we demonstrate its superior empirical performance in protein and cell-type-specific regulatory DNA design. The code is available at \href{https://github.com/masa-ue/ProDifEvo-Refinement}{https://github.com/masa-ue/ProDifEvo-Refinement}.
中文摘要:本文提出了一种受进化算法启发的创新框架,通过在扩散模型推理过程中采用迭代式的加噪和奖励引导去噪步骤,实现误差的渐进修正,并在生物序列设计中展现出卓越性能。
English Summary: This paper introduces a novel evolutionary algorithm-inspired framework for optimizing reward functions during diffusion model inference, featuring iterative noising and reward-guided denoising steps that enable gradual error correction and demonstrate superior performance in biological sequence design.
Authors:Zizhuo Zhang, Lijun Wu, Kaiyuan Gao, Jiangchao Yao, Tao Qin, Bo Han
Abstract:
Molecular docking that predicts the bound structures of small molecules (ligands) to their protein targets, plays a vital role in drug discovery. However, existing docking methods often face limitations: they either overlook crucial structural changes by assuming protein rigidity or suffer from low computational efficiency due to their reliance on generative models for structure sampling. To address these challenges, we propose FABFlex, a fast and accurate regression-based multi-task learning model designed for realistic blind flexible docking scenarios, where proteins exhibit flexibility and binding pocket sites are unknown (blind). Specifically, FABFlex's architecture comprises three specialized modules working in concert: (1) A pocket prediction module that identifies potential binding sites, addressing the challenges inherent in blind docking scenarios. (2) A ligand docking module that predicts the bound (holo) structures of ligands from their unbound (apo) states. (3) A pocket docking module that forecasts the holo structures of protein pockets from their apo conformations. Notably, FABFlex incorporates an iterative update mechanism that serves as a conduit between the ligand and pocket docking modules, enabling continuous structural refinements. This approach effectively integrates the three subtasks of blind flexible docking-pocket identification, ligand conformation prediction, and protein flexibility modeling-into a unified, coherent framework. Extensive experiments on public benchmark datasets demonstrate that FABFlex not only achieves superior effectiveness in predicting accurate binding modes but also exhibits a significant speed advantage (208 $\times$) compared to existing state-of-the-art methods. Our code is released at https://github.com/tmlr-group/FABFlex.
中文: FABFlex是一种快速准确的回归式多任务学习模型,通过整合口袋预测、配体对接和口袋对接模块,并采用迭代更新机制解决盲柔性对接难题,在预测精度上表现卓越且计算速度比现有方法快208倍。
English: FABFlex is a fast and accurate regression-based multi-task learning model that integrates pocket prediction, ligand docking, and pocket docking with an iterative update mechanism to address challenges in blind flexible docking, achieving superior accuracy and a 208× speed advantage over existing methods.
Authors:Shang Yang, Junxian Guo, Haotian Tang, Qinghao Hu, Guangxuan Xiao, Jiaming Tang, Yujun Lin, Zhijian Liu, Yao Lu, Song Han
Abstract:
Large language models (LLMs) have shown remarkable potential in processing long sequences and complex reasoning tasks, yet efficiently serving these models remains challenging due to the quadratic computational complexity of attention in the prefilling stage and the large memory footprint of the KV cache in the decoding stage. To address these issues, we introduce LServe, an efficient system that accelerates long-sequence LLM serving via hybrid sparse attention. This method unifies different hardware-friendly, structured sparsity patterns for both prefilling and decoding attention into a single framework, where computations on less important tokens are skipped block-wise. LServe demonstrates the compatibility of static and dynamic sparsity in long-context LLM attention. This design enables multiplicative speedups by combining these optimizations. Specifically, we convert half of the attention heads to nearly free streaming heads in both the prefilling and decoding stages. Additionally, we find that only a constant number of KV pages is required to preserve long-context and reasoning capabilities, irrespective of context length. We then design a hierarchical KV page selection policy that dynamically prunes KV pages based on query-centric similarity. On average, LServe accelerates LLM prefilling by up to 2.9x and decoding by 1.3-2.1x over vLLM, maintaining long-context accuracy. Code is released at https://github.com/mit-han-lab/omniserve.
中文: LServe系统通过混合稀疏注意力机制高效加速长序列大语言模型服务,在预填充阶段提速最高达2.9倍、解码阶段提速1.3-2.1倍,同时通过跳过次要令牌计算和动态剪枝KV缓存页保持了长上下文处理精度。
English: LServe is an efficient system that accelerates long-sequence LLM serving through hybrid sparse attention, achieving up to 2.9x faster prefilling and 1.3-2.1x faster decoding while maintaining accuracy by skipping computations on less important tokens and dynamically pruning KV pages.
Authors:Sara Ghaboura, Ketan More, Ritesh Thawkar, Wafa Alghallabi, Omkar Thawakar, Fahad Shahbaz Khan, Hisham Cholakkal, Salman Khan, Rao Muhammad Anwer
Abstract:
Understanding historical and cultural artifacts demands human expertise and advanced computational techniques, yet the process remains complex and time-intensive. While large multimodal models offer promising support, their evaluation and improvement require a standardized benchmark. To address this, we introduce TimeTravel, a benchmark of 10,250 expert-verified samples spanning 266 distinct cultures across 10 major historical regions. Designed for AI-driven analysis of manuscripts, artworks, inscriptions, and archaeological discoveries, TimeTravel provides a structured dataset and robust evaluation framework to assess AI models' capabilities in classification, interpretation, and historical comprehension. By integrating AI with historical research, TimeTravel fosters AI-powered tools for historians, archaeologists, researchers, and cultural tourists to extract valuable insights while ensuring technology contributes meaningfully to historical discovery and cultural heritage preservation. We evaluate contemporary AI models on TimeTravel, highlighting their strengths and identifying areas for improvement. Our goal is to establish AI as a reliable partner in preserving cultural heritage, ensuring that technological advancements contribute meaningfully to historical discovery. Our code is available at: \url{https://github.com/mbzuai-oryx/TimeTravel}.
中文摘要:TimeTravel基准提供了一个涵盖多文化历史文物的标准化数据集和评估框架,旨在推动人工智能在文化遗产分析和历史研究中的应用,确保技术进步对历史发现与保护做出实质性贡献。
English Summary: The TimeTravel benchmark introduces a comprehensive dataset and evaluation framework to enhance AI models' capabilities in analyzing historical artifacts, aiming to integrate AI as a reliable tool for cultural heritage preservation and historical research.
Authors:Weilin Zhao, Tengyu Pan, Xu Han, Yudi Zhang, Ao Sun, Yuxiang Huang, Kaihuo Zhang, Weilun Zhao, Yuxuan Li, Jianyong Wang, Zhiyuan Liu, Maosong Sun
Abstract:
Speculative sampling has emerged as an important technique for accelerating the auto-regressive generation process of large language models (LLMs) by utilizing a draft-then-verify mechanism to produce multiple tokens per forward pass. While state-of-the-art speculative sampling methods use only a single layer and a language modeling (LM) head as the draft model to achieve impressive layer compression, their efficiency gains are substantially reduced for large-vocabulary LLMs, such as Llama-3-8B with a vocabulary of 128k tokens. To address this, we present FR-Spec, a frequency-ranked speculative sampling framework that optimizes draft candidate selection through vocabulary space compression. By constraining the draft search to a frequency-prioritized token subset, our method reduces LM Head computation overhead by 75% while ensuring the equivalence of the final output distribution. Experiments across multiple datasets demonstrate an average of 1.12$\times$ speedup over the state-of-the-art speculative sampling method EAGLE-2. Code available at https://github.com/thunlp/FR-Spec.
Chinese: FR-Spec提出了一种基于频率排序的推测采样框架,通过压缩词汇空间来加速大语言模型生成,在保持输出质量的同时,将计算开销降低75%,并比现有最优方法提速1.12倍。
English: FR-Spec introduces a frequency-ranked speculative sampling framework that accelerates large language model generation by compressing vocabulary space, achieving a 75% reduction in computational overhead and a 1.12× speedup over existing methods while maintaining output quality.
Authors:Evan Frick, Connor Chen, Joseph Tennyson, Tianle Li, Wei-Lin Chiang, Anastasios N. Angelopoulos, Ion Stoica
Abstract:
Large language model (LLM) evaluations typically rely on aggregated metrics like accuracy or human preference, averaging across users and prompts. This averaging obscures user- and prompt-specific variations in model performance. To address this, we propose Prompt-to-Leaderboard (P2L), a method that produces leaderboards specific to a prompt. The core idea is to train an LLM taking natural language prompts as input to output a vector of Bradley-Terry coefficients which are then used to predict the human preference vote. The resulting prompt-dependent leaderboards allow for unsupervised task-specific evaluation, optimal routing of queries to models, personalization, and automated evaluation of model strengths and weaknesses. Data from Chatbot Arena suggest that P2L better captures the nuanced landscape of language model performance than the averaged leaderboard. Furthermore, our findings suggest that P2L's ability to produce prompt-specific evaluations follows a power law scaling similar to that observed in LLMs themselves. In January 2025, the router we trained based on this methodology achieved the #1 spot on the Chatbot Arena leaderboard. Our code is available on GitHub at https://github.com/lmarena/p2l.
Chinese: 作者提出了Prompt-to-Leaderboard(P2L)方法,通过训练大语言模型根据自然语言提示预测人类偏好,生成针对特定提示的排行榜,从而实现个性化模型评估与查询路由,该方法优于传统聚合指标,并于2025年1月在Chatbot Arena排行榜上获得首位。
English: The authors introduce Prompt-to-Leaderboard (P2L), a method that generates prompt-specific leaderboards by training an LLM to predict human preferences, enabling personalized model evaluation and routing, which outperforms traditional aggregated metrics and achieved top ranking on Chatbot Arena in January 2025.
Authors:Rameen Abdal, Or Patashnik, Ivan Skorokhodov, Willi Menapace, Aliaksandr Siarohin, Sergey Tulyakov, Daniel Cohen-Or, Kfir Aberman
Abstract:
Personalizing generative text-to-image models has seen remarkable progress, but extending this personalization to text-to-video models presents unique challenges. Unlike static concepts, personalizing text-to-video models has the potential to capture dynamic concepts, i.e., entities defined not only by their appearance but also by their motion. In this paper, we introduce Set-and-Sequence, a novel framework for personalizing Diffusion Transformers (DiTs)-based generative video models with dynamic concepts. Our approach imposes a spatio-temporal weight space within an architecture that does not explicitly separate spatial and temporal features. This is achieved in two key stages. First, we fine-tune Low-Rank Adaptation (LoRA) layers using an unordered set of frames from the video to learn an identity LoRA basis that represents the appearance, free from temporal interference. In the second stage, with the identity LoRAs frozen, we augment their coefficients with Motion Residuals and fine-tune them on the full video sequence, capturing motion dynamics. Our Set-and-Sequence framework results in a spatio-temporal weight space that effectively embeds dynamic concepts into the video model's output domain, enabling unprecedented editability and compositionality while setting a new benchmark for personalizing dynamic concepts.
中文: Set-and-Sequence框架通过无序帧学习外观特征和序列训练捕捉运动动态的两阶段方法,实现了文本到视频模型中动态概念的个性化,为动态概念编辑设立了新标准。
English: The Set-and-Sequence framework introduces a novel two-stage approach for personalizing text-to-video models by first learning appearance through unordered frames and then capturing motion dynamics via sequential training, enabling unprecedented editability of dynamic concepts.
Authors:Alexia Jolicoeur-Martineau, Yan Zhang, Boris Knyazev, Aristide Baratin, Cheng-Hao Liu
Abstract:
Generating novel molecules with out-of-distribution properties is a major challenge in molecular discovery. While supervised learning methods generate high-quality molecules similar to those in a dataset, they struggle to generalize to out-of-distribution properties. Reinforcement learning can explore new chemical spaces but often conducts 'reward-hacking' and generates non-synthesizable molecules. In this work, we address this problem by integrating a state-of-the-art supervised learning method, STGG+, in an active learning loop. Our approach iteratively generates, evaluates, and fine-tunes STGG+ to continuously expand its knowledge. We denote this approach STGG+AL. We apply STGG+AL to the design of organic $Ï$-functional materials, specifically two challenging tasks: 1) generating highly absorptive molecules characterized by high oscillator strength and 2) designing absorptive molecules with reasonable oscillator strength in the near-infrared (NIR) range. The generated molecules are validated and rationalized in-silico with time-dependent density functional theory. Our results demonstrate that our method is highly effective in generating novel molecules with high oscillator strength, contrary to existing methods such as reinforcement learning (RL) methods. We open-source our active-learning code along with our Conjugated-xTB dataset containing 2.9 million $Ï$-conjugated molecules and the function for approximating the oscillator strength and absorption wavelength (based on sTDA-xTB).
中文: 本研究提出STGG+AL方法,通过将监督学习与主动学习循环结合,能有效生成具有高振荡器强度的新型分子,其性能优于传统强化学习方法。
English: The study introduces STGG+AL, an active learning approach that integrates supervised learning with iterative fine-tuning to effectively generate novel molecules with high oscillator strength, outperforming traditional reinforcement learning methods.
Authors:Ivan Skorokhodov, Sharath Girish, Benran Hu, Willi Menapace, Yanyu Li, Rameen Abdal, Sergey Tulyakov, Aliaksandr Siarohin
Abstract:
Latent diffusion models have emerged as the leading approach for generating high-quality images and videos, utilizing compressed latent representations to reduce the computational burden of the diffusion process. While recent advancements have primarily focused on scaling diffusion backbones and improving autoencoder reconstruction quality, the interaction between these components has received comparatively less attention. In this work, we perform a spectral analysis of modern autoencoders and identify inordinate high-frequency components in their latent spaces, which are especially pronounced in the autoencoders with a large bottleneck channel size. We hypothesize that this high-frequency component interferes with the coarse-to-fine nature of the diffusion synthesis process and hinders the generation quality. To mitigate the issue, we propose scale equivariance: a simple regularization strategy that aligns latent and RGB spaces across frequencies by enforcing scale equivariance in the decoder. It requires minimal code changes and only up to 20K autoencoder fine-tuning steps, yet significantly improves generation quality, reducing FID by 19% for image generation on ImageNet-1K $256^2$ and FVD by at least 44% for video generation on Kinetics-700 $17 \times 256^2$. The source code is available at https://github.com/snap-research/diffusability.
中文: 潜在扩散模型因自编码器潜在空间中的过度高频分量而影响生成质量,通过提出的尺度等变性正则化方法,仅需少量调整即可显著提升图像和视频生成的性能指标。
English: Latent diffusion models face quality issues due to high-frequency components in autoencoder latent spaces, which are mitigated by a proposed scale equivariance regularization that significantly improves image and video generation metrics with minimal adjustments.
Authors:Weizhong Huang, Yuxin Zhang, Xiawu Zheng, Yang Liu, Jing Lin, Yiwu Yao, Rongrong Ji
Abstract:
Despite the efficacy of network sparsity in alleviating the deployment strain of Large Language Models (LLMs), it endures significant performance degradation. Applying Low-Rank Adaptation (LoRA) to fine-tune the sparse LLMs offers an intuitive approach to counter this predicament, while it holds shortcomings include: 1) The inability to integrate LoRA weights into sparse LLMs post-training, and 2) Insufficient performance recovery at high sparsity ratios. In this paper, we introduce dynamic Low-rank Sparse Adaptation (LoSA), a novel method that seamlessly integrates low-rank adaptation into LLM sparsity within a unified framework, thereby enhancing the performance of sparse LLMs without increasing the inference latency. In particular, LoSA dynamically sparsifies the LoRA outcomes based on the corresponding sparse weights during fine-tuning, thus guaranteeing that the LoRA module can be integrated into the sparse LLMs post-training. Besides, LoSA leverages Representation Mutual Information (RMI) as an indicator to determine the importance of layers, thereby efficiently determining the layer-wise sparsity rates during fine-tuning. Predicated on this, LoSA adjusts the rank of the LoRA module based on the variability in layer-wise reconstruction errors, allocating an appropriate fine-tuning for each layer to reduce the output discrepancies between dense and sparse LLMs. Extensive experiments tell that LoSA can efficiently boost the efficacy of sparse LLMs within a few hours, without introducing any additional inferential burden. For example, LoSA reduced the perplexity of sparse LLaMA-2-7B by 68.73 and increased zero-shot accuracy by 16.32$\%$, achieving a 2.60$\times$ speedup on CPU and 2.23$\times$ speedup on GPU, requiring only 45 minutes of fine-tuning on a single NVIDIA A100 80GB GPU. Code is available at https://github.com/wzhuang-xmu/LoSA.
中文: LoSA是一种创新方法,通过将低秩适应与大型语言模型稀疏化在统一框架中结合,在微调时动态调整稀疏度和秩,从而提升性能且不增加推理延迟。
English: LoSA is a novel method that integrates low-rank adaptation with LLM sparsity in a unified framework, enhancing performance without increasing inference latency by dynamically adjusting sparsity and rank during fine-tuning.
Authors:Jianling Li, Shangzhan Li, Zhenye Gao, Qi Shi, Yuxuan Li, Zefan Wang, Jiacheng Huang, Haojie Wang, Jianrong Wang, Xu Han, Zhiyuan Liu, Maosong Sun
Abstract:
Triton, a high-level Python-like language designed for building efficient GPU kernels, is widely adopted in deep learning frameworks due to its portability, flexibility, and accessibility. However, programming and parallel optimization still require considerable trial and error from Triton developers. Despite advances in large language models (LLMs) for conventional code generation, these models struggle to generate accurate, performance-optimized Triton code, as they lack awareness of its specifications and the complexities of GPU programming. More critically, there is an urgent need for systematic evaluations tailored to Triton. In this work, we introduce TritonBench, the first comprehensive benchmark for Triton operator generation. TritonBench features two evaluation channels: a curated set of 184 real-world operators from GitHub and a collection of operators aligned with PyTorch interfaces. Unlike conventional code benchmarks prioritizing functional correctness, TritonBench also profiles efficiency performance on widely deployed GPUs aligned with industry applications. Our study reveals that current state-of-the-art code LLMs struggle to generate efficient Triton operators, highlighting a significant gap in high-performance code generation. TritonBench will be available at https://github.com/thunlp/TritonBench.
中文: TritonBench作为首个全面的Triton代码生成基准,不仅评估功能正确性还关注性能效率,揭示了当前大型语言模型在生成优化GPU算子方面存在显著不足。
English: TritonBench is introduced as the first comprehensive benchmark for evaluating Triton code generation, focusing on both functional correctness and efficiency performance, revealing that current LLMs struggle to produce optimized GPU operators.
Authors:Zichun Yu, Fei Peng, Jie Lei, Arnold Overwijk, Wen-tau Yih, Chenyan Xiong
Abstract:
In this paper, we introduce Group-MATES, an efficient group-level data selection approach to optimize the speed-quality frontier of language model pretraining. Specifically, Group-MATES parameterizes costly group-level selection with a relational data influence model. To train this model, we sample training trajectories of the language model and collect oracle data influences alongside. The relational data influence model approximates the oracle data influence by weighting individual influence with relationships among training data. To enable efficient selection with our relational data influence model, we partition the dataset into small clusters using relationship weights and select data within each cluster independently. Experiments on DCLM 400M-4x, 1B-1x, and 3B-1x show that Group-MATES achieves 3.5%-9.4% relative performance gains over random selection across 22 downstream tasks, nearly doubling the improvements achieved by state-of-the-art individual data selection baselines. Furthermore, Group-MATES reduces the number of tokens required to reach a certain downstream performance by up to 1.75x, substantially elevating the speed-quality frontier. Further analyses highlight the critical role of relationship weights in the relational data influence model and the effectiveness of our cluster-based inference. Our code is open-sourced at https://github.com/facebookresearch/Group-MATES.
Chinese: Group-MATES 提出了一种基于关系数据影响模型的高效群组级数据选择方法,用于优化语言模型预训练的速度与质量边界,实验表明其性能显著提升且所需标记数量最多减少1.75倍。
English: Group-MATES introduces an efficient group-level data selection method using a relational data influence model to optimize language model pretraining, achieving significant performance gains and reducing token requirements by up to 1.75x in experiments.
Authors:Angxiao Yue, Zichong Wang, Hongteng Xu
Abstract:
Protein backbone generation plays a central role in de novo protein design and is significant for many biological and medical applications. Although diffusion and flow-based generative models provide potential solutions to this challenging task, they often generate proteins with undesired designability and suffer computational inefficiency. In this study, we propose a novel rectified quaternion flow (ReQFlow) matching method for fast and high-quality protein backbone generation. In particular, our method generates a local translation and a 3D rotation from random noise for each residue in a protein chain, which represents each 3D rotation as a unit quaternion and constructs its flow by spherical linear interpolation (SLERP) in an exponential format. We train the model by quaternion flow (QFlow) matching with guaranteed numerical stability and rectify the QFlow model to accelerate its inference and improve the designability of generated protein backbones, leading to the proposed ReQFlow model. Experiments show that ReQFlow achieves on-par performance in protein backbone generation while requiring much fewer sampling steps and significantly less inference time (e.g., being 37x faster than RFDiffusion and 63x faster than Genie2 when generating a backbone of length 300), demonstrating its effectiveness and efficiency. The code is available at https://github.com/AngxiaoYue/ReQFlow.
中文摘要:本研究提出ReQFlow方法,通过修正四元数流匹配技术,利用单位四元数高效建模三维旋转,实现了快速高质量的蛋白质骨架生成,在保持同等性能的同时大幅提升了生成效率。
English Summary: The study introduces ReQFlow, a rectified quaternion flow matching method that enables fast and high-quality protein backbone generation by efficiently modeling 3D rotations with unit quaternions, achieving comparable performance while being significantly faster than existing methods.
Authors:Chentao Cao, Zhun Zhong, Zhanke Zhou, Tongliang Liu, Yang Liu, Kun Zhang, Bo Han
Abstract:
Test-time adaptation (TTA) aims to address distribution shifts between source and target data by relying solely on target data during testing. In open-world scenarios, models often encounter noisy samples, i.e., samples outside the in-distribution (ID) label space. Leveraging the zero-shot capability of pre-trained vision-language models (VLMs), this paper introduces Zero-Shot Noisy TTA (ZS-NTTA), focusing on adapting the model to target data with noisy samples during test-time in a zero-shot manner. We find existing TTA methods underperform under ZS-NTTA, often lagging behind even the frozen model. We conduct comprehensive experiments to analyze this phenomenon, revealing that the negative impact of unfiltered noisy data outweighs the benefits of clean data during model updating. Also, adapting a classifier for ID classification and noise detection hampers both sub-tasks. Built on this, we propose a framework that decouples the classifier and detector, focusing on developing an individual detector while keeping the classifier frozen. Technically, we introduce the Adaptive Noise Detector (AdaND), which utilizes the frozen model's outputs as pseudo-labels to train a noise detector. To handle clean data streams, we further inject Gaussian noise during adaptation, preventing the detector from misclassifying clean samples as noisy. Beyond the ZS-NTTA, AdaND can also improve the zero-shot out-of-distribution (ZS-OOD) detection ability of VLMs. Experiments show that AdaND outperforms in both ZS-NTTA and ZS-OOD detection. On ImageNet, AdaND achieves a notable improvement of $8.32\%$ in harmonic mean accuracy ($\text{Acc}_\text{H}$) for ZS-NTTA and $9.40\%$ in FPR95 for ZS-OOD detection, compared to SOTA methods. Importantly, AdaND is computationally efficient and comparable to the model-frozen method. The code is publicly available at: https://github.com/tmlr-group/ZS-NTTA.
中文: 本文提出零样本噪声测试时适应方法(ZS-NTTA),通过设计自适应噪声检测器(AdaND)框架,将分类器与检测器解耦,在保持计算效率的同时有效处理测试过程中的噪声样本。
English: This paper introduces Zero-Shot Noisy Test-Time Adaptation (ZS-NTTA), proposing the Adaptive Noise Detector (AdaND) framework that decouples classifier and detector functions to effectively handle noisy samples during testing while maintaining computational efficiency.
Authors:Rongzhen Wang, Yan Zhang, Chenyu Zheng, Chongxuan Li, Guoqiang Wu
Abstract:
The success of large generative models has driven a paradigm shift, leveraging massive multi-source data to enhance model capabilities. However, the interaction among these sources remains theoretically underexplored. This paper takes the first step toward a rigorous analysis of multi-source training in conditional generative modeling, where each condition represents a distinct data source. Specifically, we establish a general distribution estimation error bound in average total variation distance for conditional maximum likelihood estimation based on the bracketing number. Our result shows that when source distributions share certain similarities and the model is expressive enough, multi-source training guarantees a sharper bound than single-source training. We further instantiate the general theory on conditional Gaussian estimation and deep generative models including autoregressive and flexible energy-based models, by characterizing their bracketing numbers. The results highlight that the number of sources and similarity among source distributions improve the advantage of multi-source training. Simulations and real-world experiments are conducted to validate the theory, with code available at: https://github.com/ML-GSAI/Multi-Source-GM.
中文总结:本文首次对条件生成模型中的多源训练进行了严格理论分析,证明当源分布具有相似性且模型表达能力足够时,多源训练能获得比单源训练更优的误差界限。
English Summary: This paper provides the first rigorous theoretical analysis of multi-source training in conditional generative models, demonstrating that shared similarities among source distributions and model expressiveness yield sharper error bounds than single-source training.
Authors:Eric Egli, Matteo Manica, Jannis Born
Abstract:
Bytes form the basis of the digital world and thus are a promising building block for multimodal foundation models. Recently, Byte Language Models (BLMs) have emerged to overcome tokenization, yet the excessive length of bytestreams requires new architectural paradigms. Therefore, we present the Multiscale Byte Language Model (MBLM), a model-agnostic hierarchical decoder stack that allows training with context windows of $5$M bytes on single GPU in full model precision. We thoroughly examine MBLM's performance with Transformer and Mamba blocks on both unimodal and multimodal tasks. Our experiments demonstrate that hybrid architectures are efficient in handling extremely long byte sequences during training while achieving near-linear generational efficiency. To the best of our knowledge, we present the first evaluation of BLMs on visual Q\&A tasks and find that, despite serializing images and the absence of an encoder, a MBLM with pure next token prediction can match custom CNN-LSTM architectures with designated classification heads. We show that MBLMs exhibit strong adaptability in integrating diverse data representations, including pixel and image filestream bytes, underlining their potential toward omnimodal foundation models. Source code is publicly available at: https://github.com/ai4sd/multiscale-byte-lm
中文摘要:多尺度字节语言模型(MBLM)提出分层解码器架构,可在单GPU上高效处理百万字节序列训练,通过纯下一词元预测在多模态任务中实现与定制模型相媲美的性能,无需专用编码器。
English Summary: The Multiscale Byte Language Model (MBLM) introduces a hierarchical decoder architecture enabling efficient training on million-byte sequences with standard GPUs, demonstrating competitive performance in multimodal tasks through pure next-token prediction without specialized encoders.
Authors:Jannik Irmai, Maximilian Moeller, Bjoern Andres
Abstract:
The NP-hard maximum value preordering problem is both a joint relaxation and a hybrid of the clique partition problem (a clustering problem) and the partial ordering problem. Toward approximate solutions and lower bounds, we introduce a linear-time 4-approximation algorithm that constructs a maximum dicut of a subgraph and define local search heuristics. Toward upper bounds, we tighten a linear program relaxation by the class of odd closed walk inequalities that define facets, as we show, of the preorder polytope. We contribute implementations of the algorithms, apply these to the task of jointly clustering and partially ordering the accounts of published social networks, and compare the output and efficiency qualitatively and quantitatively.
中文: 本研究针对最大价值预排序问题提出了4-近似算法和局部搜索启发式方法,并通过定义多面体的不等式加强线性规划松弛,最后将这些方法应用于社交网络分析的聚类和偏序任务。
English: This study presents a 4-approximation algorithm and local search heuristics for the maximum value preordering problem, along with tightened linear programming relaxations using facet-defining inequalities, and applies these methods to social network analysis.
Authors:Cristian A. Galvis-Florez, Ahmad Farooq, Simo Särkkä
Abstract:
The aim of this paper is to develop novel quantum algorithms for Gaussian process quadrature methods. Gaussian process quadratures are numerical integration methods where Gaussian processes are used as functional priors for the integrands to capture the uncertainty arising from the sparse function evaluations. Quantum computers have emerged as potential replacements for classical computers, offering exponential reductions in the computational complexity of machine learning tasks. In this paper, we combine Gaussian process quadratures and quantum computing by proposing a quantum low-rank Gaussian process quadrature method based on a Hilbert space approximation of the Gaussian process kernel and enhancing the quadrature using a quantum circuit. The method combines the quantum phase estimation algorithm with the quantum principal component analysis technique to extract information up to a desired rank. Then, Hadamard and SWAP tests are implemented to find the expected value and variance that determines the quadrature. We use numerical simulations of a quantum computer to demonstrate the effectiveness of the method. Furthermore, we provide a theoretical complexity analysis that shows a polynomial advantage over classical Gaussian process quadrature methods. The code is available at https://github.com/cagalvisf/Quantum_HSGPQ.
本文提出了一种新颖的高斯过程求积量子算法,通过结合量子相位估计与主成分分析技术,在模拟实验中验证了其计算效率,并证明相比经典方法具有多项式级加速优势。
This paper introduces a novel quantum algorithm for Gaussian process quadrature that combines quantum phase estimation with principal component analysis, demonstrating both computational efficiency through simulations and a polynomial speed advantage over classical methods.
Authors:Dacheng Li, Shiyi Cao, Chengkun Cao, Xiuyu Li, Shangyin Tan, Kurt Keutzer, Jiarong Xing, Joseph E. Gonzalez, Ion Stoica
Abstract:
Increasing test-time compute for LLMs shows promise across domains but remains underexplored in code generation, despite extensive study in math. In this paper, we propose S*, the first hybrid test-time scaling framework that substantially improves the coverage and selection accuracy of generated code. S* extends the existing parallel scaling paradigm with sequential scaling to push performance boundaries. It further leverages a novel selection mechanism that adaptively generates distinguishing inputs for pairwise comparison, combined with execution-grounded information to robustly identify correct solutions. We evaluate across 12 Large Language Models and Large Reasoning Model and show: (1) S* consistently improves performance across model families and sizes, enabling a 3B model to outperform GPT-4o-mini; (2) S* enables non-reasoning models to surpass reasoning models - GPT-4o-mini with S* outperforms o1-preview by 3.7% on LiveCodeBench; (3) S* further boosts state-of-the-art reasoning models - DeepSeek-R1-Distill-Qwen-32B with S* achieves 85.7% on LiveCodeBench, approaching o1 (high) at 88.5%. Code will be available under https://github.com/NovaSky-AI/SkyThought.
中文:S*是一个创新的混合测试时扩展框架,通过结合并行与顺序扩展策略,并采用自适应输入选择和基于执行的验证机制,显著提升了代码生成性能,使小模型可超越大模型、非推理模型能超过推理模型。
English: S* is a novel hybrid test-time scaling framework that enhances code generation by combining parallel and sequential scaling with adaptive input selection and execution-based verification, significantly boosting performance across various models including enabling smaller models to surpass larger ones and non-reasoning models to exceed reasoning models.
Authors:Louis Carpentier, Nick Seeuws, Wannes Meert, Mathias Verbeke
Abstract:
dtaianomaly is an open-source Python library for time series anomaly detection, designed to bridge the gap between academic research and real-world applications. Our goal is to (1) accelerate the development of novel state-of-the-art anomaly detection techniques through simple extensibility; (2) offer functionality for large-scale experimental validation; and thereby (3) bring cutting-edge research to business and industry through a standardized API, similar to scikit-learn to lower the entry barrier for both new and experienced users. Besides these key features, dtaianomaly offers (1) a broad range of built-in anomaly detectors, (2) support for time series preprocessing, (3) tools for visual analysis, (4) confidence prediction of anomaly scores, (5) runtime and memory profiling, (6) comprehensive documentation, and (7) cross-platform unit testing.
The source code of dtaianomaly, documentation, code examples and installation guides are publicly available at https://github.com/ML-KULeuven/dtaianomaly.
中文: dtaianomaly 是一个开源的时间序列异常检测Python库,旨在通过类scikit-learn的标准化API降低使用门槛,既支持前沿算法的扩展开发与实验验证,又为工业应用提供丰富的内置功能。
English: dtaianomaly is an open-source Python library for time series anomaly detection that facilitates the development and validation of advanced techniques while providing industry-ready tools through a scikit-learn-like API.
Authors:Moxin Li, Yuantao Zhang, Wenjie Wang, Wentao Shi, Zhuo Liu, Fuli Feng, Tat-Seng Chua
Abstract:
Multi-Objective Alignment (MOA) aims to align LLMs' responses with multiple human preference objectives, with Direct Preference Optimization (DPO) emerging as a prominent approach. However, we find that DPO-based MOA approaches suffer from widespread preference conflicts in the data, where different objectives favor different responses. This results in conflicting optimization directions, hindering the optimization on the Pareto Front. To address this, we propose to construct Pareto-optimal responses to resolve preference conflicts. To efficiently obtain and utilize such responses, we propose a self-improving DPO framework that enables LLMs to self-generate and select Pareto-optimal responses for self-supervised preference alignment. Extensive experiments on two datasets demonstrate the superior Pareto Front achieved by our framework compared to various baselines. Code is available at https://github.com/zyttt-coder/SIPO.
Chinese: 该研究提出了一种自改进的直接偏好优化框架,使大语言模型能够自我生成并选择帕累托最优响应,有效解决偏好冲突,在多目标对齐方面相比现有方法实现了更优的性能。
English: The study introduces a self-improving Direct Preference Optimization framework that enables large language models to self-generate and select Pareto-optimal responses, effectively resolving preference conflicts and achieving superior multi-objective alignment compared to existing methods.
Authors:Hanlin Wang, Jian Wang, Chak Tou Leong, Wenjie Li
Abstract:
Large language model (LLM)-based agents have shown promise in tackling complex tasks by interacting dynamically with the environment. Existing work primarily focuses on behavior cloning from expert demonstrations or preference learning through exploratory trajectory sampling. However, these methods often struggle to address long-horizon tasks, where suboptimal actions accumulate step by step, causing agents to deviate from correct task trajectories. To address this, we highlight the importance of timely calibration and the need to automatically construct calibration trajectories for training agents. We propose Step-Level Trajectory Calibration (STeCa), a novel framework for LLM agent learning. Specifically, STeCa identifies suboptimal actions through a step-level reward comparison during exploration. It constructs calibrated trajectories using LLM-driven reflection, enabling agents to learn from improved decision-making processes. We finally leverage these calibrated trajectories with successful trajectories for reinforced training. Extensive experiments demonstrate that STeCa significantly outperforms existing methods. Further analysis highlights that timely calibration enables agents to complete tasks with greater robustness. Our code and data are available at https://github.com/WangHanLinHenry/STeCa.
Chinese: 提出的STeCa框架通过步骤级奖励比较识别次优行动,并利用基于大语言模型的反思构建校准轨迹,有效解决了现有方法在长周期任务中的不足,显著提升了智能体的任务完成能力和鲁棒性。
English: The proposed STeCa framework addresses the limitations of existing LLM agent training methods by identifying suboptimal actions through step-level reward comparisons and constructing calibrated trajectories via LLM-driven reflection, significantly enhancing performance and robustness in long-horizon tasks.
Authors:Gengxu Li, Yuan Wu
Abstract:
Source-free unsupervised domain adaptation (SFUDA) has gained significant attention as an alternative to traditional unsupervised domain adaptation (UDA), which relies on the constant availability of labeled source data. However, SFUDA approaches come with inherent limitations that are frequently overlooked. These challenges include performance degradation when the unlabeled target data fails to meet critical assumptions, such as having a closed-set label distribution identical to that of the source domain, or when sufficient unlabeled target data is unavailable-a common situation in real-world applications. To address these issues, we propose an asymmetric co-training (ACT) method specifically designed for the SFFSDA scenario. SFFSDA presents a more practical alternative to SFUDA, as gathering a few labeled target instances is more feasible than acquiring large volumes of unlabeled target data in many real-world contexts. Our ACT method begins by employing a weak-strong augmentation to enhance data diversity. Then we use a two-step optimization process to train the target model. In the first step, we optimize the label smoothing cross-entropy loss, the entropy of the class-conditional distribution, and the reverse-entropy loss to bolster the model's discriminative ability while mitigating overfitting. The second step focuses on reducing redundancy in the output space by minimizing classifier determinacy disparity. Extensive experiments across four benchmarks demonstrate the superiority of our ACT approach, which outperforms state-of-the-art SFUDA methods and transfer learning techniques. Our findings suggest that adapting a source pre-trained model using only a small amount of labeled target data offers a practical and dependable solution. The code is available at https://github.com/gengxuli/ACT.
中文: 提出的非对称协同训练方法通过强弱数据增强和两步优化过程解决了无源无监督域适应的局限性,仅需少量目标域标注数据即可超越现有方法的性能表现。
English: The proposed asymmetric co-training (ACT) method addresses the limitations of source-free unsupervised domain adaptation by using weak-strong data augmentation and a two-step optimization process, demonstrating superior performance over existing methods with only minimal labeled target data.
Authors:Shokhrukh Ibragimov, Arnulf Jentzen, Benno Kuckuck
Abstract:
We present a method of generating first-order logic statements whose complexity can be controlled along multiple dimensions. We use this method to automatically create several datasets consisting of questions asking for the truth or falsity of first-order logic statements in Zermelo-Fraenkel set theory. While the resolution of these questions does not require any knowledge beyond basic notation of first-order logic and set theory, it does require a degree of planning and logical reasoning, which can be controlled up to arbitrarily high difficulty by the complexity of the generated statements. Furthermore, we do extensive evaluations of the performance of various large language models, including recent models such as DeepSeek-R1 and OpenAI's o3-mini, on these datasets. All of the datasets along with the code used for generating them, as well as all data from the evaluations is publicly available at https://github.com/bkuckuck/logical-skills-of-llms.
Chinese: 本文提出了一种可控制复杂度的生成一阶逻辑语句的方法,并利用该方法创建数据集来评估包括DeepSeek-R1和OpenAI的o3-mini在内的多种大语言模型的逻辑推理能力。
English: This paper introduces a method for generating first-order logic statements with controllable complexity and uses it to create datasets for evaluating the logical reasoning abilities of large language models, including recent ones like DeepSeek-R1 and OpenAI's o3-mini.
Authors:Payman Behnam, Yaosheng Fu, Ritchie Zhao, Po-An Tsai, Zhiding Yu, Alexey Tumanov
Abstract:
Transformer-based Large Language Models rely critically on the KV cache to efficiently handle extended contexts during the decode phase. Yet, the size of the KV cache grows proportionally with the input length, burdening both memory bandwidth and capacity as decoding progresses. To address this challenge, we present RocketKV, a training-free KV cache compression strategy containing two consecutive stages. In the first stage, it performs coarse-grain permanent KV cache eviction on the input sequence tokens. In the second stage, it adopts a hybrid sparse attention method to conduct fine-grain top-k sparse attention, approximating the attention scores by leveraging both head and sequence dimensionality reductions. We show that RocketKV provides a compression ratio of up to 400$\times$, end-to-end speedup of up to 3.7$\times$ as well as peak memory reduction of up to 32.6% in the decode phase on an NVIDIA A100 GPU compared to the full KV cache baseline, while achieving negligible accuracy loss on a variety of long-context tasks. We also propose a variant of RocketKV for multi-turn scenarios, which consistently outperforms other existing methods and achieves accuracy nearly on par with an oracle top-k attention scheme. The source code is available here: https://github.com/NVlabs/RocketKV.
中文: RocketKV是一种无需训练的KV缓存压缩方法,通过粗粒度淘汰和细粒度稀疏注意力,在长上下文任务中实现高达400倍压缩和3.7倍加速,同时保持近乎无损的精度。
English: RocketKV is a training-free KV cache compression method that employs coarse-grain eviction and fine-grain sparse attention, achieving up to 400× compression and 3.7× speedup with minimal accuracy loss in long-context tasks.
Authors:Yucheng Shi, Quanzheng Li, Jin Sun, Xiang Li, Ninghao Liu
Abstract:
Large Multimodal Models (LMMs), or Vision-Language Models (VLMs), have shown impressive capabilities in a wide range of visual tasks. However, they often struggle with fine-grained visual reasoning, failing to identify domain-specific objectives and provide justifiable explanations for their predictions. To address the above challenge, we propose a novel visual rejection sampling framework to improve the cognition and explainability of LMMs using self-synthesized data. Specifically, visual fine-tuning requires images, queries, and target answers. Our approach begins by synthesizing interpretable answers that include human-verifiable visual features. These features are based on expert-defined concepts, and carefully selected based on their alignment with the image content. After each round of fine-tuning, we apply a reward model-free filtering mechanism to select the highest-quality interpretable answers for the next round of tuning. This iterative process of synthetic data generation and fine-tuning progressively improves the model's ability to generate accurate and reasonable explanations. Experimental results demonstrate the effectiveness of our method in improving both the accuracy and explainability of specialized visual classification tasks.
中文: 该研究提出的视觉拒绝采样框架通过迭代生成基于专家定义概念的可验证视觉特征合成数据,并利用无奖励模型的筛选机制进行微调,有效提升了大规模多模态模型在专业视觉分类任务中的准确性和可解释性。
English: The proposed visual rejection sampling framework enhances Large Multimodal Models' fine-grained reasoning by iteratively generating and fine-tuning with synthetic data featuring human-verifiable visual features, significantly improving both accuracy and explainability in specialized visual tasks.
Authors:Andrew Wang, Steven McDonagh, Mike Davies
Abstract:
Reconstructing MRI from highly undersampled measurements is crucial for accelerating medical imaging, but is challenging due to the ill-posedness of the inverse problem. While supervised deep learning (DL) approaches have shown remarkable success, they traditionally rely on fully-sampled ground truth (GT) images, which are expensive or impossible to obtain in real scenarios. This problem has created a recent surge in interest in self-supervised learning methods that do not require GT. Although recent methods are now fast approaching "oracle" supervised performance, the lack of systematic comparison and standard experimental setups are hindering targeted methodological research and precluding widespread trustworthy industry adoption. We present SSIBench, a modular and flexible comparison framework to unify and thoroughly benchmark Self-Supervised Imaging methods (SSI) without GT. We evaluate 18 methods across 4 realistic MRI scenarios on real data, showing a wide performance landscape whose method ranking differs across scenarios and metrics, exposing the need for further SSI research. Our insights also show how complementary methods could be compounded for future improvements, exemplified by a novel loss we propose, Multi-Operator Equivariant Imaging. To accelerate reproducible research and lower the barrier to entry, we provide the extensible benchmark and open-source reimplementations of all methods at https://andrewwango.github.io/ssibench, allowing researchers to rapidly and fairly contribute and evaluate new methods on the standardised setup for potential leaderboard ranking, or benchmark existing methods on custom datasets, forward operators, or models, unlocking the application of SSI to other valuable GT free domains such as 4D MRI and other nascent scientific imaging modalities.
Authors:Peirong Zhang, Jiaxin Zhang, Jiahuan Cao, Hongliang Li, Lianwen Jin
Abstract:
We propose LGGPT, an LLM-based model tailored for unified layout generation. First, we propose Arbitrary Layout Instruction (ALI) and Universal Layout Response (ULR) as the uniform I/O template. ALI accommodates arbitrary layout generation task inputs across multiple layout domains, enabling LGGPT to unify both task-generic and domain-generic layout generation hitherto unexplored. Collectively, ALI and ULR boast a succinct structure that forgoes superfluous tokens typically found in existing HTML-based formats, facilitating efficient instruction tuning and boosting unified generation performance. In addition, we propose an Interval Quantization Encoding (IQE) strategy that compresses ALI into a more condensed structure. IQE precisely preserves valid layout clues while eliminating the less informative placeholders, facilitating LGGPT to capture complex and variable layout generation conditions during the unified training process. Experimental results demonstrate that LGGPT achieves superior or on par performance compared to existing methods. Notably, LGGPT strikes a prominent balance between proficiency and efficiency with a compact 1.5B parameter LLM, which beats prior 7B or 175B models even in the most extensive and challenging unified scenario. Furthermore, we underscore the necessity of employing LLMs for unified layout generation and suggest that 1.5B could be an optimal parameter size by comparing LLMs of varying scales. Code is available at https://github.com/NiceRingNode/LGGPT.
Chinese: LGGPT是一种基于大语言模型的统一布局生成模型,通过任意布局指令和通用布局响应简化输入输出模板,并结合区间量化编码压缩布局数据,仅用1.5B参数就在性能与效率上超越更大模型,实现了布局生成任务的最优平衡。
English: LGGPT is a unified layout generation model that introduces Arbitrary Layout Instruction and Universal Layout Response to streamline input-output templates, along with Interval Quantization Encoding to compress layout data, achieving superior performance with a compact 1.5B parameter LLM that outperforms larger models in efficiency and effectiveness.
Authors:Masane Fuchi, Tomohiro Takagi
Abstract:
Studies have been conducted to prevent specific concepts from being generated from pretrained text-to-image generative models, achieving concept erasure in various ways. However, the performance evaluation of these studies is still largely reliant on visualization, with the superiority of studies often determined by human subjectivity. The metrics of quantitative evaluation also vary, making comprehensive comparisons difficult. We propose EraseEval, an evaluation method that differs from previous evaluation methods in that it involves three fundamental evaluation criteria: (1) How well does the prompt containing the target concept be reflected, (2) To what extent the concepts related to the erased concept can reduce the impact of the erased concept, and (3) Whether other concepts are preserved. These criteria are evaluated and integrated into a single metric, such that a lower score is given if any of the evaluations are low, leading to a more robust assessment. We experimentally evaluated baseline concept erasure methods, organized their characteristics, and identified challenges with them. Despite being fundamental evaluation criteria, some concept erasure methods failed to achieve high scores, which point toward future research directions for concept erasure methods. Our code is available at https://github.com/fmp453/erase-eval.
中文: 本文提出EraseEval评估框架,通过将三个核心标准整合为单一指标来系统评估文本到图像模型的概念消除效果,实验发现现有方法在此标准下存在不足,为未来研究指明了方向。
English: This paper introduces EraseEval, a novel evaluation framework for concept erasure in text-to-image models that integrates three key criteria into a single robust metric, addressing limitations in current visualization-dependent assessments and revealing challenges in existing methods through experimental analysis.
Authors:Yan Huang, Yongru Chen, Lei Cao, Yongnian Cao, Xuechun Yang, Yilin Dong, Tianyu Liu
Abstract:
In recent years, deep learning (DL) models have shown outstanding performance in EEG classification tasks, particularly in Steady-State Visually Evoked Potential(SSVEP)-based Brain-Computer-Interfaces(BCI)systems. DL methods have been successfully applied to SSVEP-BCI. This study proposes a new model called IncepFormerNet, which is a hybrid of the Inception and Transformer architectures. IncepFormerNet adeptly extracts multi-scale temporal information from time series data using parallel convolution kernels of varying sizes, accurately capturing the subtle variations and critical features within SSVEP signals.Furthermore, the model integrates the multi-head attention mechanism from the Transformer architecture, which not only provides insights into global dependencies but also significantly enhances the understanding and representation of complex patterns.Additionally, it takes advantage of filter bank techniques to extract features based on the spectral characteristics of SSVEP data. To validate the effectiveness of the proposed model, we conducted experiments on two public datasets, . The experimental results show that IncepFormerNet achieves an accuracy of 87.41 on Dataset 1 and 71.97 on Dataset 2 using a 1.0-second time window. To further verify the superiority of the proposed model, we compared it with other deep learning models, and the results indicate that our method achieves significantly higher accuracy than the others.The source codes in this work are available at: https://github.com/CECNL/SSVEP-DAN.
中文: 本研究提出了一种名为IncepFormerNet的混合模型,融合了Inception和Transformer架构,能有效提取SSVEP信号的多尺度时间特征和全局依赖关系,在公开数据集上相比其他深度学习方法取得了更高的准确率。
English: This study introduces IncepFormerNet, a hybrid model combining Inception and Transformer architectures, which effectively captures multi-scale temporal features and global dependencies in SSVEP signals, achieving superior accuracy on public datasets compared to other deep learning methods.
Authors:Arjun Gupta, Rishik Sathua, Saurabh Gupta
Abstract:
Many everyday mobile manipulation tasks require precise interaction with small objects, such as grasping a knob to open a cabinet or pressing a light switch. In this paper, we develop Servoing with Vision Models (SVM), a closed-loop training-free framework that enables a mobile manipulator to tackle such precise tasks involving the manipulation of small objects. SVM employs an RGB-D wrist camera and uses visual servoing for control. Our novelty lies in the use of state-of-the-art vision models to reliably compute 3D targets from the wrist image for diverse tasks and under occlusion due to the end-effector. To mitigate occlusion artifacts, we employ vision models to out-paint the end-effector thereby significantly enhancing target localization. We demonstrate that aided by out-painting methods, open-vocabulary object detectors can serve as a drop-in module to identify semantic targets (e.g. knobs) and point tracking methods can reliably track interaction sites indicated by user clicks. This training-free method obtains an 85% zero-shot success rate on manipulating unseen objects in novel environments in the real world, outperforming an open-loop control method and an imitation learning baseline trained on 1000+ demonstrations by an absolute success rate of 50%.
Authors:Jingwang Huang, Jiang Zhong, Qin Lei, Jinpeng Gao, Yuming Yang, Sirui Wang, Peiguang Li, Kaiwen Wei
Abstract:
Multimodal multi-label emotion recognition (MMER) aims to identify the concurrent presence of multiple emotions in multimodal data. Existing studies primarily focus on improving fusion strategies and modeling modality-to-label dependencies. However, they often overlook the impact of \textbf{aleatoric uncertainty}, which is the inherent noise in the multimodal data and hinders the effectiveness of modality fusion by introducing ambiguity into feature representations. To address this issue and effectively model aleatoric uncertainty, this paper proposes Latent emotional Distribution Decomposition with Uncertainty perception (LDDU) framework from a novel perspective of latent emotional space probabilistic modeling. Specifically, we introduce a contrastive disentangled distribution mechanism within the emotion space to model the multimodal data, allowing for the extraction of semantic features and uncertainty. Furthermore, we design an uncertainty-aware fusion multimodal method that accounts for the dispersed distribution of uncertainty and integrates distribution information. Experimental results show that LDDU achieves state-of-the-art performance on the CMU-MOSEI and M$^3$ED datasets, highlighting the importance of uncertainty modeling in MMER. Code is available at https://github.com/201983290498/lddu\_mmer.git.
中文摘要:LDDU框架通过潜在情感空间概率建模和不确定性感知融合方法,有效解决多模态多标签情感识别中的随机不确定性,在基准数据集上取得了最优性能。
English Summary: The LDDU framework addresses aleatoric uncertainty in multimodal multi-label emotion recognition by modeling latent emotional distributions and employing uncertainty-aware fusion, achieving state-of-the-art results on benchmark datasets.
Authors:Guanzheng Chen, Xin Li, Michael Qizhe Shieh, Lidong Bing
Abstract:
Large Language Models (LLMs) have demonstrated remarkable capabilities through pretraining and alignment. However, superior short-context LLMs may underperform in long-context scenarios due to insufficient long-context alignment. This alignment process remains challenging due to the impracticality of human annotation for extended contexts and the difficulty in balancing short- and long-context performance. To address these challenges, we introduce LongPO, that enables short-context LLMs to self-evolve to excel on long-context tasks by internally transferring short-context capabilities. LongPO harnesses LLMs to learn from self-generated short-to-long preference data, comprising paired responses generated for identical instructions with long-context inputs and their compressed short-context counterparts, respectively. This preference reveals capabilities and potentials of LLMs cultivated during short-context alignment that may be diminished in under-aligned long-context scenarios. Additionally, LongPO incorporates a short-to-long KL constraint to mitigate short-context performance decline during long-context alignment. When applied to Mistral-7B-Instruct-v0.2 from 128K to 512K context lengths, LongPO fully retains short-context performance and largely outperforms naive SFT and DPO in both long- and short-context tasks. Specifically, LongPO-trained models can achieve results on long-context benchmarks comparable to, or even surpassing, those of superior LLMs (e.g., GPT-4-128K) that involve extensive long-context annotation and larger parameter scales. Our code is available at https://github.com/DAMO-NLP-SG/LongPO.
中文: LongPO通过让短上下文大语言模型利用内部能力转移和自生成的偏好数据来自我进化,以胜任长上下文任务,在保持短上下文性能的同时,实现了与GPT-4-128K等先进模型相媲美甚至更优的长上下文表现。
English: LongPO enables short-context LLMs to self-evolve for long-context tasks by leveraging internal capability transfer and self-generated preference data, maintaining short-context performance while achieving superior results comparable to advanced models like GPT-4-128K.
Authors:Dan Zhang, Sining Zhoubian, Min Cai, Fengzu Li, Lekang Yang, Wei Wang, Tianjiao Dong, Ziniu Hu, Jie Tang, Yisong Yue
Abstract:
This paper presents DataSciBench, a comprehensive benchmark for evaluating Large Language Model (LLM) capabilities in data science. Recent related benchmarks have primarily focused on single tasks, easily obtainable ground truth, and straightforward evaluation metrics, which limits the scope of tasks that can be evaluated. In contrast, DataSciBench is constructed based on a more comprehensive and curated collection of natural and challenging prompts for uncertain ground truth and evaluation metrics. We develop a semi-automated pipeline for generating ground truth (GT) and validating evaluation metrics. This pipeline utilizes and implements an LLM-based self-consistency and human verification strategy to produce accurate GT by leveraging collected prompts, predefined task types, and aggregate functions (metrics). Furthermore, we propose an innovative Task - Function - Code (TFC) framework to assess each code execution outcome based on precisely defined metrics and programmatic rules. Our experimental framework involves testing 6 API-based models, 8 open-source general models, and 9 open-source code generation models using the diverse set of prompts we have gathered. This approach aims to provide a more comprehensive and rigorous evaluation of LLMs in data science, revealing their strengths and weaknesses. Experimental results demonstrate that API-based models outperform open-sourced models on all metrics and Deepseek-Coder-33B-Instruct achieves the highest score among open-sourced models. We release all code and data at https://github.com/THUDM/DataSciBench.
中文: 本文提出了DataSciBench,这是一个通过半自动化流程和任务-函数-代码框架来全面评估大语言模型在数据科学领域能力的新型基准测试,实验表明基于API的模型在所有指标上均优于开源模型。
English: This paper introduces DataSciBench, a novel benchmark designed to comprehensively assess Large Language Models' capabilities in data science through a semi-automated pipeline and a Task-Function-Code framework, revealing that API-based models consistently outperform open-source alternatives.
Authors:Jusen Du, Weigao Sun, Disen Lan, Jiaxi Hu, Yu Cheng
Abstract:
Linear sequence modeling methods, such as linear attention, state space modeling, and linear RNNs, offer significant efficiency improvements by reducing the complexity of training and inference. However, these methods typically compress the entire input sequence into a single fixed-size memory state, which leads to suboptimal performance on recall-intensive downstream tasks. Drawing inspiration from neuroscience, particularly the brain's ability to maintain robust long-term memory while mitigating "memory interference", we introduce a novel architecture called Mixture-of-Memories (MoM). MoM utilizes multiple independent memory states, with a router network directing input tokens to specific memory states. This approach greatly enhances the overall memory capacity while minimizing memory interference. As a result, MoM performs exceptionally well on recall-intensive tasks, surpassing existing linear sequence modeling techniques. Despite incorporating multiple memory states, the computation of each memory state remains linear in complexity, allowing MoM to retain the linear-complexity advantage during training, while constant-complexity during inference. Our experimental results show that MoM significantly outperforms current linear sequence models on downstream language tasks, particularly recall-intensive tasks, and even achieves performance comparable to Transformer models. The code is released at https://github.com/OpenSparseLLMs/MoM and is also released as a part of https://github.com/OpenSparseLLMs/Linear-MoE.
Chinese: Mixture-of-Memories (MoM) 架构通过采用带路由网络的多个独立记忆状态,显著提升了线性序列模型在记忆密集型任务上的性能,同时保持了线性训练复杂度和常数推理复杂度的优势。
English: The Mixture-of-Memories (MoM) architecture enhances linear sequence models by employing multiple independent memory states with a router network, significantly improving performance on recall-intensive tasks while maintaining linear training and constant inference complexity.
Authors:Ruida Hu, Chao Peng, Xinchen Wang, Junjielong Xu, Cuiyun Gao
Abstract:
Scaling up executable code data is significant for improving language models' software engineering capability. The intricate nature of the process makes it labor-intensive, time-consuming and expert-knowledge-dependent to build a large number of executable code repositories, limiting the scalability of existing work based on running tests. The primary bottleneck lies in the automated building of test environments for different repositories, which is an essential yet underexplored task. To mitigate the gap, we introduce Repo2Run, the first LLM-based agent aiming at automating the building of executable test environments for any repositories at scale. Specifically, given a code repository, Repo2Run iteratively builds the Docker image, runs unit tests based on the feedback of the building, and synthesizes the Dockerfile until the entire pipeline is executed successfully. The resulting Dockerfile can then be used to create Docker container environments for running code and tests. We created a benchmark containing 420 Python repositories with unit tests for evaluation. The results illustrate that Repo2Run achieves an 86.0% success rate, outperforming SWE-agent by 77.0%. The resources of Repo2Run are available at https://github.com/bytedance/Repo2Run.
中文: 扩展可执行代码数据对提升语言模型的软件工程能力至关重要,Repo2Run作为首个基于大语言模型的代理,能自动为代码仓库构建测试环境,成功率达到86.0%,远超现有方法。
English: Scaling executable code data is crucial for enhancing language models' software engineering capabilities, and Repo2Run, an LLM-based agent, automates the building of test environments for repositories, achieving an 86.0% success rate and significantly outperforming existing methods.
Authors:Ziming Hong, Yongli Xiang, Tongliang Liu
Abstract:
Over the past decades, researchers have primarily focused on improving the generalization abilities of models, with limited attention given to regulating such generalization. However, the ability of models to generalize to unintended data (e.g., harmful or unauthorized data) can be exploited by malicious adversaries in unforeseen ways, potentially resulting in violations of model ethics. Non-transferable learning (NTL), a task aimed at reshaping the generalization abilities of deep learning models, was proposed to address these challenges. While numerous methods have been proposed in this field, a comprehensive review of existing progress and a thorough analysis of current limitations remain lacking. In this paper, we bridge this gap by presenting the first comprehensive survey on NTL and introducing NTLBench, the first benchmark to evaluate NTL performance and robustness within a unified framework. Specifically, we first introduce the task settings, general framework, and criteria of NTL, followed by a summary of NTL approaches. Furthermore, we emphasize the often-overlooked issue of robustness against various attacks that can destroy the non-transferable mechanism established by NTL. Experiments conducted via NTLBench verify the limitations of existing NTL methods in robustness. Finally, we discuss the practical applications of NTL, along with its future directions and associated challenges.
Chinese Summary: 本文首次对不可迁移学习进行全面综述,提出了首个评估NTL方法性能与鲁棒性的基准NTLBench,同时揭示了现有方法在抗攻击鲁棒性方面的局限性。
English Summary: This paper presents the first comprehensive survey on non-transferable learning (NTL), introducing NTLBench as the inaugural benchmark to evaluate NTL methods' performance and robustness while highlighting their limitations against attacks.
Authors:Yupeng Hou, Jianmo Ni, Zhankui He, Noveen Sachdeva, Wang-Cheng Kang, Ed H. Chi, Julian McAuley, Derek Zhiyuan Cheng
Abstract:
Generative recommendation (GR) is an emerging paradigm where user actions are tokenized into discrete token patterns and autoregressively generated as predictions. However, existing GR models tokenize each action independently, assigning the same fixed tokens to identical actions across all sequences without considering contextual relationships. This lack of context-awareness can lead to suboptimal performance, as the same action may hold different meanings depending on its surrounding context. To address this issue, we propose ActionPiece to explicitly incorporate context when tokenizing action sequences. In ActionPiece, each action is represented as a set of item features. Given the action sequence corpora, we construct the vocabulary by merging feature patterns as new tokens, based on their co-occurrence frequency both within individual sets and across adjacent sets. Considering the unordered nature of feature sets, we further introduce set permutation regularization, which produces multiple segmentations of action sequences with the same semantics. Our code is available at: https://github.com/google-deepmind/action_piece.
中文摘要:传统生成式推荐模型独立地对动作进行标记化,而提出的ActionPiece方法通过引入项目特征和共现模式来增强上下文感知能力,从而提升标记化准确性。
English Summary: Generative recommendation models traditionally tokenize actions independently, but the proposed ActionPiece method enhances context-awareness by incorporating item features and co-occurrence patterns to improve tokenization accuracy.
Authors:Coleman Hooper, Sehoon Kim, Suhong Moon, Kerem Dilmen, Monishwaran Maheswaran, Nicholas Lee, Michael W. Mahoney, Sophia Shao, Kurt Keutzer, Amir Gholami
Abstract:
Test-time compute scaling has emerged as a new axis along which to improve model accuracy, where additional computation is used at inference time to allow the model to think longer for more challenging problems. One promising approach for test-time compute scaling is search against a process reward model, where a model generates multiple potential candidates at each step of the search, and these partial trajectories are then scored by a separate reward model in order to guide the search process. The diversity of trajectories in the tree search process affects the accuracy of the search, since increasing diversity promotes more exploration. However, this diversity comes at a cost, as divergent trajectories have less KV sharing, which means they consume more memory and slow down the search process. Previous search methods either do not perform sufficient exploration, or else explore diverse trajectories but have high latency. We address this challenge by proposing Efficient Tree Search (ETS), which promotes KV sharing by pruning redundant trajectories while maintaining necessary diverse trajectories. ETS incorporates a linear programming cost model to promote KV cache sharing by penalizing the number of nodes retained, while incorporating a semantic coverage term into the cost model to ensure that we retain trajectories which are semantically different. We demonstrate how ETS can achieve 1.8$\times$ reduction in average KV cache size during the search process, leading to 1.4$\times$ increased throughput relative to prior state-of-the-art methods, with minimal accuracy degradation and without requiring any custom kernel implementation. Code is available at: https://github.com/SqueezeAILab/ETS.
Chinese: 高效树搜索(ETS)通过剪枝冗余轨迹来优化测试时计算扩展,增强KV缓存共享,从而在保持精度的同时降低内存使用并提升吞吐量。
English: Efficient Tree Search (ETS) is a method that optimizes test-time compute scaling by pruning redundant trajectories to enhance KV cache sharing, thereby reducing memory usage and increasing throughput without significant accuracy loss.
Authors:Yuan Yao, Xiaopu Zhang, Yu Zhang, Jian Jin, Qiang Yang
Abstract:
Semi-supervised heterogeneous domain adaptation (SHDA) addresses learning across domains with distinct feature representations and distributions, where source samples are labeled while most target samples are unlabeled, with only a small fraction labeled. Moreover, there is no one-to-one correspondence between source and target samples. Although various SHDA methods have been developed to tackle this problem, the nature of the knowledge transferred across heterogeneous domains remains unclear. This paper delves into this question from an empirical perspective. We conduct extensive experiments on about 330 SHDA tasks, employing two supervised learning methods and seven representative SHDA methods. Surprisingly, our observations indicate that both the category and feature information of source samples do not significantly impact the performance of the target domain. Additionally, noise drawn from simple distributions, when used as source samples, may contain transferable knowledge. Based on this insight, we perform a series of experiments to uncover the underlying principles of transferable knowledge in SHDA. Specifically, we design a unified Knowledge Transfer Framework (KTF) for SHDA. Based on the KTF, we find that the transferable knowledge in SHDA primarily stems from the transferability and discriminability of the source domain. Consequently, ensuring those properties in source samples, regardless of their origin (e.g., image, text, noise), can enhance the effectiveness of knowledge transfer in SHDA tasks. The codes and datasets are available at https://github.com/yyyaoyuan/SHDA.
中文: 本研究揭示了在半监督异构域自适应中,可迁移知识主要源于源域的可迁移性和可区分性,而非其具体内容,并证明只要确保这些特性,即便是合成噪声也能促进有效的知识迁移。
English: This study reveals that in semi-supervised heterogeneous domain adaptation, transferable knowledge primarily depends on the source domain's transferability and discriminability, rather than its specific content, and demonstrates that even synthetic noise can facilitate effective knowledge transfer when these properties are ensured.
Authors:Jun Zhang, Jue Wang, Huan Li, Lidan Shou, Ke Chen, Yang You, Guiming Xie, Xuejian Gong, Kunlong Zhou
Abstract:
Large Language Models (LLMs) have significantly advanced natural language processing with exceptional task generalization capabilities. Low-Rank Adaption (LoRA) offers a cost-effective fine-tuning solution, freezing the original model parameters and training only lightweight, low-rank adapter matrices. However, the memory footprint of LoRA is largely dominated by the original model parameters. To mitigate this, we propose LoRAM, a memory-efficient LoRA training scheme founded on the intuition that many neurons in over-parameterized LLMs have low training utility but are essential for inference. LoRAM presents a unique twist: it trains on a pruned (small) model to obtain pruned low-rank matrices, which are then recovered and utilized with the original (large) model for inference. Additionally, minimal-cost continual pre-training, performed by the model publishers in advance, aligns the knowledge discrepancy between pruned and original models. Our extensive experiments demonstrate the efficacy of LoRAM across various pruning strategies and downstream tasks. For a model with 70 billion parameters, LoRAM enables training on a GPU with only 20G HBM, replacing an A100-80G GPU for LoRA training and 15 GPUs for full fine-tuning. Specifically, QLoRAM implemented by structured pruning combined with 4-bit quantization, for LLaMA-3.1-70B (LLaMA-2-70B), reduces the parameter storage cost that dominates the memory usage in low-rank matrix training by 15.81$\times$ (16.95$\times$), while achieving dominant performance gains over both the original LLaMA-3.1-70B (LLaMA-2-70B) and LoRA-trained LLaMA-3.1-8B (LLaMA-2-13B). Code is available at https://github.com/junzhang-zj/LoRAM.
大语言模型得益于LoRA的高效微调,但其内存使用受限于原始参数,因此LoRAM提出在剪枝后的小模型上训练,通过恢复矩阵进行推理,以降低内存需求并保持性能。
Large Language Models benefit from LoRA's efficient fine-tuning, but its memory use is constrained by original parameters, so LoRAM introduces training on a pruned model to reduce memory demands while maintaining performance through recovered matrices for inference.
Authors:Dongki Kim, Wonbin Lee, Sung Ju Hwang
Abstract:
Understanding molecules is key to understanding organisms and driving advances in drug discovery, requiring interdisciplinary knowledge across chemistry and biology. Although large molecular language models have achieved notable success in task transfer, they often struggle to accurately analyze molecular features due to limited knowledge and reasoning capabilities. To address this issue, we present Mol-LLaMA, a large molecular language model that grasps the general knowledge centered on molecules and exhibits explainability and reasoning ability. To this end, we design key data types that encompass the fundamental molecular features, taking into account the essential abilities for molecular reasoning. Further, to improve molecular understanding, we propose a module that integrates complementary information from different molecular encoders, leveraging the distinct advantages of molecular representations. Our experimental results demonstrate that Mol-LLaMA is capable of comprehending the general features of molecules and providing informative responses, implying its potential as a general-purpose assistant for molecular analysis. Our project page is at https://mol-llama.github.io/.
Authors:Jialin Ouyang
Abstract:
Large language models (LLMs) now achieve near-human performance on standard math word problem benchmarks (e.g., GSM8K), yet their true reasoning ability remains disputed. A key concern is that models often produce confident, yet unfounded, answers to unanswerable problems. We introduce TreeCut, a synthetic dataset that systematically generates infinite unanswerable math word problems and their answerable counterparts, by representing each question as a tree and removing chosen necessary conditions. Experiments show TreeCut effectively induce hallucinations in large language models, including GPT-4o and o3-mini, with rates of 64% and 44% in their respective worst-case scenarios under zero-shot setting. Further analysis highlights that deeper or more complex trees, composite item names, and removing necessary condition near the middle of a path all increase the likelihood of hallucinations, underscoring the persistent challenges LLMs face in identifying unanswerable math problems. The dataset generation code and sample data are available at https://github.com/j-bagel/treecut-math.
Chinese Summary: 大型语言模型在面对无解数学题时常会自信地给出错误答案,TreeCut数据集通过系统生成不可解问题,揭示了GPT-4o等模型在零样本条件下最高达64%的幻觉产生率。
English Summary: Large language models frequently provide confident but incorrect answers to unsolvable math problems, as demonstrated by the TreeCut dataset, which reveals hallucination rates up to 64% in models like GPT-4o under specific conditions.
Authors:Vishal Dey, Xiao Hu, Xia Ning
Abstract:
Despite recent advancements, most computational methods for molecule optimization are constrained to single- or double-property optimization tasks and suffer from poor scalability and generalizability to novel optimization tasks. Meanwhile, Large Language Models (LLMs) demonstrate remarkable out-of-domain generalizability to novel tasks. To demonstrate LLMs' potential for molecule optimization, we introduce MuMOInstruct, the first high-quality instruction-tuning dataset specifically focused on complex multi-property molecule optimization tasks. Leveraging MuMOInstruct, we develop GeLLMOs, a series of instruction-tuned LLMs for molecule optimization. Extensive evaluations across 5 in-domain and 5 out-of-domain tasks demonstrate that GeLLMOs consistently outperform state-of-the-art baselines. GeLLMOs also exhibit outstanding zero-shot generalization to unseen tasks, significantly outperforming powerful closed-source LLMs. Such strong generalizability demonstrates the tremendous potential of GeLLMOs as foundational models for molecule optimization, thereby tackling novel optimization tasks without resource-intensive retraining. MuMOInstruct, models, and code are accessible through https://github.com/ninglab/GeLLMO.
中文: 本研究推出了首个多属性分子优化指令数据集MuMOInstruct,并开发了GeLLMOs模型,该模型在多项任务中超越现有方法,展现出卓越的零样本泛化能力,为复杂分子优化提供了无需重复训练的高效解决方案。
English: This study introduces MuMOInstruct, a dataset for multi-property molecule optimization, and develops GeLLMOs, an instruction-tuned LLM that outperforms existing methods and demonstrates strong zero-shot generalization to novel tasks, offering a resource-efficient solution for complex optimization challenges.
Authors:Noel Ngu, Aditya Taparia, Gerardo I. Simari, Mario Leiva, Jack Corcoran, Ransalu Senanayake, Paulo Shakarian, Nathaniel D. Bastian
Abstract:
Machine learning models assume that training and test samples are drawn from the same distribution. As such, significant differences between training and test distributions often lead to degradations in performance. We introduce Multiple Distribution Shift -- Aerial (MDS-A) -- a collection of inter-related datasets of the same aerial domain that are perturbed in different ways to better characterize the effects of out-of-distribution performance. Specifically, MDS-A is a set of simulated aerial datasets collected under different weather conditions. We include six datasets under different simulated weather conditions along with six baseline object-detection models, as well as several test datasets that are a mix of weather conditions that we show have significant differences from the training data. In this paper, we present characterizations of MDS-A, provide performance results for the baseline machine learning models (on both their specific training datasets and the test data), as well as results of the baselines after employing recent knowledge-engineering error-detection techniques (EDR) thought to improve out-of-distribution performance. The dataset is available at https://lab-v2.github.io/mdsa-dataset-website.
Authors:Sangwoong Yoon, Himchan Hwang, Hyeokju Jeong, Dong Kyu Shin, Che-Sang Park, Sehee Kweon, Frank Chongwoo Park
Abstract:
We propose the Value Gradient Sampler (VGS), a trainable sampler based on the interpretation of sampling as discrete-time sequential decision-making. VGS generates samples from a given unnormalized density (i.e., energy) by drifting and diffusing randomly initialized particles. In VGS, finding the optimal drift is equivalent to solving an optimal control problem where the cost is the upper bound of the KL divergence between the target density and the samples. We employ value-based dynamic programming to solve this optimal control problem, which gives the gradient of the value function as the optimal drift vector. The connection to sequential decision making allows VGS to leverage extensively studied techniques in reinforcement learning, making VGS a fast, adaptive, and accurate sampler that achieves competitive results in various sampling benchmarks. Furthermore, VGS can replace MCMC in contrastive divergence training of energy-based models. We demonstrate the effectiveness of VGS in training accurate energy-based models in industrial anomaly detection applications.
Chinese: 价值梯度采样器(VGS)是一种可训练的采样方法,将采样视为序列决策过程,通过基于价值的动态规划优化粒子漂移,实现高效精确的采样,在基准测试中表现优异,并能有效训练基于能量的模型,适用于工业异常检测等场景。
English: The Value Gradient Sampler (VGS) is a trainable sampling method that frames sampling as sequential decision-making, using value-based dynamic programming to optimize particle drift for efficient and accurate sampling, achieving competitive results in benchmarks and enabling effective training of energy-based models in applications like anomaly detection.
Authors:Jake C. Snell, Thomas L. Griffiths
Abstract:
As machine learning-based prediction systems are increasingly used in high-stakes situations, it is important to understand how such predictive models will perform upon deployment. Distribution-free uncertainty quantification techniques such as conformal prediction provide guarantees about the loss black-box models will incur even when the details of the models are hidden. However, such methods are based on frequentist probability, which unduly limits their applicability. We revisit the central aspects of conformal prediction from a Bayesian perspective and thereby illuminate the shortcomings of frequentist guarantees. We propose a practical alternative based on Bayesian quadrature that provides interpretable guarantees and offers a richer representation of the likely range of losses to be observed at test time.
中文: 该摘要提出了一种贝叶斯替代方法,取代了频率主义的共形预测技术,为高风险应用中的机器学习模型提供了可解释的保证和更丰富的潜在损失范围描述。
English: This abstract proposes a Bayesian alternative to frequentist conformal prediction methods, offering interpretable guarantees and a richer representation of potential losses for machine learning models in high-stakes applications.
Authors:Junyi Guan, Abhijith Sharma, Chong Tian, Salem Lahlou
Abstract:
Spiking Neural Networks (SNNs) are increasingly explored for their energy efficiency and robustness in real-world applications, yet their privacy risks remain largely unexamined. In this work, we investigate the susceptibility of SNNs to Membership Inference Attacks (MIAs) -- a major privacy threat where an adversary attempts to determine whether a given sample was part of the training dataset. While prior work suggests that SNNs may offer inherent robustness due to their discrete, event-driven nature, we find that its resilience diminishes as latency (T) increases. Furthermore, we introduce an input dropout strategy under black box setting, that significantly enhances membership inference in SNNs. Our findings challenge the assumption that SNNs are inherently more secure, and even though they are expected to be better, our results reveal that SNNs exhibit privacy vulnerabilities that are equally comparable to Artificial Neural Networks (ANNs). Our code is available at https://github.com/sharmaabhijith/MIA_SNN.
中文: 脉冲神经网络(SNN)易受成员推理攻击,其隐私风险随延迟增加而加剧,与人工神经网络(ANN)相当,尽管人们曾假设其具有固有的鲁棒性。
English: Spiking Neural Networks (SNNs) are vulnerable to membership inference attacks, with privacy risks increasing with latency and comparable to those of Artificial Neural Networks (ANNs), despite assumptions of inherent robustness.
Authors:Enzhe Lu, Zhejun Jiang, Jingyuan Liu, Yulun Du, Tao Jiang, Chao Hong, Shaowei Liu, Weiran He, Enming Yuan, Yuzhi Wang, Zhiqi Huang, Huan Yuan, Suting Xu, Xinran Xu, Guokun Lai, Yanru Chen, Huabin Zheng, Junjie Yan, Jianlin Su, Yuxin Wu, Neo Y. Zhang, Zhilin Yang, Xinyu Zhou, Mingxing Zhang, Jiezhong Qiu
Abstract:
Scaling the effective context length is essential for advancing large language models (LLMs) toward artificial general intelligence (AGI). However, the quadratic increase in computational complexity inherent in traditional attention mechanisms presents a prohibitive overhead. Existing approaches either impose strongly biased structures, such as sink or window attention which are task-specific, or radically modify the attention mechanism into linear approximations, whose performance in complex reasoning tasks remains inadequately explored.
In this work, we propose a solution that adheres to the ``less structure'' principle, allowing the model to determine where to attend autonomously, rather than introducing predefined biases. We introduce Mixture of Block Attention (MoBA), an innovative approach that applies the principles of Mixture of Experts (MoE) to the attention mechanism. This novel architecture demonstrates superior performance on long-context tasks while offering a key advantage: the ability to seamlessly transition between full and sparse attention, enhancing efficiency without the risk of compromising performance. MoBA has already been deployed to support Kimi's long-context requests and demonstrates significant advancements in efficient attention computation for LLMs. Our code is available at https://github.com/MoonshotAI/MoBA.
中文摘要:提出的混合块注意力(MoBA)使大型语言模型能够自主决定注意力模式,在长上下文任务中表现优异,同时实现完整与稀疏注意力机制的高效切换。
English Summary: The proposed Mixture of Block Attention (MoBA) enables large language models to autonomously determine attention patterns, achieving superior performance on long-context tasks while efficiently switching between full and sparse attention mechanisms.
Authors:Jiaqi Zhao, Miao Zhang, Ming Wang, Yuzhang Shang, Kaihao Zhang, Weili Guan, Yaowei Wang, Min Zhang
Abstract:
Large Language Models (LLMs) suffer severe performance degradation when facing extremely low-bit (sub 2-bit) quantization. Several existing sub 2-bit post-training quantization (PTQ) methods utilize a mix-precision scheme by leveraging an unstructured fine-grained mask to explicitly distinguish salient weights, while which introduces an extra 1-bit or more per weight. To explore the real limit of PTQ, we propose an extremely low-bit PTQ method called PTQ1.61, which enables weight quantization to 1.61-bit for the first time. Specifically, we first introduce a one-dimensional structured mask with negligibly additional 0.0002-bit per weight based on input activations from the perspective of reducing the upper bound of quantization error to allocate corresponding salient weight channels to 4-bit. For non-salient channels binarization, an efficient block-wise scaling factors optimization framework is then presented to take implicit row-wise correlations and angular biases into account. Different from prior works that concentrate on adjusting quantization methodologies, we further propose a novel paradigm called quantization preprocessing, where we argue that transforming the weight distribution of the pretrained model before quantization can alleviate the difficulty in per-channel extremely low-bit PTQ. Extensive experiments indicate our PTQ1.61 achieves state-of-the-art performance in extremely low-bit quantization. Codes are available at https://github.com/zjq0455/PTQ1.61.
Chinese: 大语言模型在超低位(低于2位)量化时性能严重下降,而提出的PTQ1.61方法通过结构化掩码和量化预处理优化权重分布,首次实现了1.61位量化的最先进性能。
English: Large Language Models experience significant performance loss with sub 2-bit quantization, but the proposed PTQ1.61 method achieves state-of-the-art 1.61-bit quantization by introducing structured masking and quantization preprocessing to optimize weight distribution.
Authors:Jiaqi Zhao, Ming Wang, Miao Zhang, Yuzhang Shang, Xuebo Liu, Yaowei Wang, Min Zhang, Liqiang Nie
Abstract:
Post-training Quantization (PTQ) technique has been extensively adopted for large language models (LLMs) compression owing to its efficiency and low resource requirement. However, current research lacks a in-depth analysis of the superior and applicable scenarios of each PTQ strategy. In addition, existing algorithms focus primarily on performance, overlooking the trade-off among model size, performance, and quantization bitwidth. To mitigate these confusions, we provide a novel benchmark for LLMs PTQ in this paper. Firstly, in order to support our benchmark, we propose a comprehensive taxonomy for existing mainstream methods by scrutinizing their computational strategies (e.g., optimization-based, compensation-based, etc.). Then, we conduct extensive experiments with the baseline within each class, covering models with various sizes (7B-70B), bitwidths, training levels (LLaMA1/2/3/3.1), architectures (Mixtral, DeepSeekMoE and Mamba) and modality (LLaVA1.5 and VILA1.5) on a wide range of evaluation metrics.Through comparative analysis on the results, we summarize the superior of each PTQ strategy and modelsize-bitwidth trade-off considering the performance. For example, our benchmark reveals that compensation-based technique demonstrates outstanding cross-architecture robustness and extremely low-bit PTQ for ultra large models should be reexamined. Finally, we further accordingly claim that a practical combination of compensation and other PTQ strategy can achieve SOTA various robustness. We believe that our benchmark will provide valuable recommendations for the deployment of LLMs and future research on PTQ approaches.We conduct an repository for our benchmark at https://github.com/zjq0455/PTQ_Benchmark.
中文: 本文为大型语言模型的后训练量化提出了一套全面的基准测试,通过分析不同模型规模、架构和比特宽度的量化策略,揭示了补偿类方法的优越跨架构鲁棒性,并指出了超低比特量化的局限性。
English: This paper introduces a comprehensive benchmark for post-training quantization (PTQ) of large language models, analyzing various strategies across different model sizes, architectures, and bitwidths to identify optimal approaches and trade-offs, while highlighting the robustness of compensation-based techniques.
Authors:Yuze Zhao, Tianyun Ji, Wenjun Feng, Zhenya Huang, Qi Liu, Zhiding Liu, Yixiao Ma, Kai Zhang, Enhong Chen
Abstract:
The reasoning abilities are one of the most enigmatic and captivating aspects of large language models (LLMs). Numerous studies are dedicated to exploring and expanding the boundaries of this reasoning capability. However, tasks that embody both reasoning and recall characteristics are often overlooked. In this paper, we introduce such a novel task, code reasoning, to provide a new perspective for the reasoning abilities of LLMs. We summarize three meta-benchmarks based on established forms of logical reasoning, and instantiate these into eight specific benchmark tasks. Our testing on these benchmarks reveals that LLMs continue to struggle with identifying satisfactory reasoning pathways. Additionally, we present a new pathway exploration pipeline inspired by human intricate problem-solving methods. This Reflective Hypothesis Decomposition and Amendment (RHDA) pipeline consists of the following iterative steps: (1) Proposing potential hypotheses based on observations and decomposing them; (2) Utilizing tools to validate hypotheses and reflection outcomes; (3) Revising hypothesis in light of observations. Our approach effectively mitigates logical chain collapses arising from forgetting or hallucination issues in multi-step reasoning, resulting in performance gains of up to $3\times$. Finally, we expanded this pipeline by applying it to simulate complex household tasks in real-world scenarios, specifically in VirtualHome, enhancing the handling of failure cases. We release our code and all of results at https://github.com/TnTWoW/code_reasoning.
中文摘要:本文提出了一种新的代码推理任务来评估大型语言模型的推理能力,并设计了一种反思性假设分解与修正流程,通过减少多步推理中的逻辑链断裂问题显著提升了性能。
English Summary: This paper introduces a novel code reasoning task to assess LLMs' reasoning abilities, proposing a Reflective Hypothesis Decomposition and Amendment pipeline that significantly improves performance by mitigating logical chain collapses in multi-step reasoning.
Authors:Shuo Xing, Peiran Li, Yuping Wang, Ruizheng Bai, Yueqi Wang, Chan-Wei Hu, Chengxuan Qian, Huaxiu Yao, Zhengzhong Tu
Abstract:
The emergence of large Vision Language Models (VLMs) has broadened the scope and capabilities of single-modal Large Language Models (LLMs) by integrating visual modalities, thereby unlocking transformative cross-modal applications in a variety of real-world scenarios. Despite their impressive performance, VLMs are prone to significant hallucinations, particularly in the form of cross-modal inconsistencies. Building on the success of Reinforcement Learning from Human Feedback (RLHF) in aligning LLMs, recent advancements have focused on applying direct preference optimization (DPO) on carefully curated datasets to mitigate these issues. Yet, such approaches typically introduce preference signals in a brute-force manner, neglecting the crucial role of visual information in the alignment process. In this paper, we introduce Re-Align, a novel alignment framework that leverages image retrieval to construct a dual-preference dataset, effectively incorporating both textual and visual preference signals. We further introduce rDPO, an extension of the standard direct preference optimization that incorporates an additional visual preference objective during fine-tuning. Our experimental results demonstrate that Re-Align not only mitigates hallucinations more effectively than previous methods but also yields significant performance gains in general visual question-answering (VQA) tasks. Moreover, we show that Re-Align maintains robustness and scalability across a wide range of VLM sizes and architectures. This work represents a significant step forward in aligning multimodal LLMs, paving the way for more reliable and effective cross-modal applications. We release all the code in https://github.com/taco-group/Re-Align.
中文: Re-Align框架通过图像检索和扩展的rDPO方法,有效减少视觉语言模型中的跨模态幻觉问题,并显著提升视觉问答任务的性能。
English: The Re-Align framework introduces image retrieval and an extended rDPO method to effectively reduce cross-modal hallucinations in Vision Language Models while enhancing performance in visual question-answering tasks.
Authors:Jianwei Yang, Reuben Tan, Qianhui Wu, Ruijie Zheng, Baolin Peng, Yongyuan Liang, Yu Gu, Mu Cai, Seonghyeon Ye, Joel Jang, Yuquan Deng, Lars Liden, Jianfeng Gao
Abstract:
We present Magma, a foundation model that serves multimodal AI agentic tasks in both the digital and physical worlds. Magma is a significant extension of vision-language (VL) models in that it not only retains the VL understanding ability (verbal intelligence) of the latter, but is also equipped with the ability to plan and act in the visual-spatial world (spatial-temporal intelligence) and complete agentic tasks ranging from UI navigation to robot manipulation. To endow the agentic capabilities, Magma is pretrained on large amounts of heterogeneous datasets spanning from images, videos to robotics data, where the actionable visual objects (e.g., clickable buttons in GUI) in images are labeled by Set-of-Mark (SoM) for action grounding, and the object movements (e.g., the trace of human hands or robotic arms) in videos are labeled by Trace-of-Mark (ToM) for action planning. Extensive experiments show that SoM and ToM reach great synergy and facilitate the acquisition of spatial-temporal intelligence for our Magma model, which is fundamental to a wide range of tasks as shown in Fig.1. In particular, Magma creates new state-of-the-art results on UI navigation and robotic manipulation tasks, outperforming previous models that are specifically tailored to these tasks. On image and video-related multimodal tasks, Magma also compares favorably to popular large multimodal models that are trained on much larger datasets. We make our model and code public for reproducibility at https://microsoft.github.io/Magma.
中文: Magma是一个多模态基础模型,不仅具备视觉语言理解能力,还通过空间-时间智能在数字和物理世界中规划执行任务,在界面导航和机器人操控等任务中创造了最新最优性能。
English: Magma is a multimodal foundation model that extends vision-language capabilities by integrating spatial-temporal intelligence for planning and executing agentic tasks in both digital and physical environments, achieving state-of-the-art performance in UI navigation and robotic manipulation.
Authors:Jingbiao Mei, Jinghong Chen, Guangyu Yang, Weizhe Lin, Bill Byrne
Abstract:
Hateful memes have become a significant concern on the Internet, necessitating robust automated detection systems. While Large Multimodal Models (LMMs) have shown promise in hateful meme detection, they face notable challenges like sub-optimal performance and limited out-of-domain generalization capabilities. Recent studies further reveal the limitations of both supervised fine-tuning (SFT) and in-context learning when applied to LMMs in this setting. To address these issues, we propose a robust adaptation framework for hateful meme detection that enhances in-domain accuracy and cross-domain generalization while preserving the general vision-language capabilities of LMMs. Analysis reveals that our approach achieves improved robustness under adversarial attacks compared to SFT models. Experiments on six meme classification datasets show that our approach achieves state-of-the-art performance, outperforming larger agentic systems. Moreover, our method generates higher-quality rationales for explaining hateful content compared to standard SFT, enhancing model interpretability. Code available at https://github.com/JingbiaoMei/RGCL
Chinese: 本文提出了一种鲁棒适应框架,通过提升领域内准确性和跨领域泛化能力来增强仇恨表情包检测,同时保留大型多模态模型的视觉语言能力,实现了最先进的性能并通过高质量解释提升了模型可解释性。
English: This paper introduces a robust adaptation framework that enhances hateful meme detection by improving in-domain accuracy and cross-domain generalization while preserving LMMs' vision-language capabilities, achieving state-of-the-art performance and superior interpretability through high-quality rationales.
Authors:Longxu Dou, Qian Liu, Fan Zhou, Changyu Chen, Zili Wang, Ziqi Jin, Zichen Liu, Tongyao Zhu, Cunxiao Du, Penghui Yang, Haonan Wang, Jiaheng Liu, Yongchi Zhao, Xiachong Feng, Xin Mao, Man Tsung Yeung, Kunat Pipatanakul, Fajri Koto, Min Si Thu, Hynek KydlÃÄek, Zeyi Liu, Qunshu Lin, Sittipong Sripaisarnmongkol, Kridtaphad Sae-Khow, Nirattisai Thongchim, Taechawat Konkaew, Narong Borijindargoon, Anh Dao, Matichon Maneegard, Phakphum Artkaew, Zheng-Xin Yong, Quan Nguyen, Wannaphong Phatthiyaphaibun, Hoang H. Tran, Mike Zhang, Shiqi Chen, Tianyu Pang, Chao Du, Xinyi Wan, Wei Lu, Min Lin
Abstract:
Sailor2 is a family of cutting-edge multilingual language models for South-East Asian (SEA) languages, available in 1B, 8B, and 20B sizes to suit diverse applications. Building on Qwen2.5, Sailor2 undergoes continuous pre-training on 500B tokens (400B SEA-specific and 100B replay tokens) to support 13 SEA languages while retaining proficiency in Chinese and English. Sailor2-20B model achieves a 50-50 win rate against GPT-4o across SEA languages. We also deliver a comprehensive cookbook on how to develop the multilingual model in an efficient manner, including five key aspects: data curation, pre-training, post-training, model customization and evaluation. We hope that Sailor2 model (Apache 2.0 license) will drive language development in the SEA region, and Sailor2 cookbook will inspire researchers to build more inclusive LLMs for other under-served languages.
中文: Sailor2是面向东南亚语言的先进多语言模型系列,在性能上可与GPT-4o媲美,并附带完整开发指南以促进包容性语言AI发展。
English: Sailor2 is a family of advanced multilingual models for Southeast Asian languages, achieving competitive performance against GPT-4o and including a comprehensive development guide to foster inclusive language AI.
Authors:Steffen Schneider, Rodrigo González Laiz, Anastasiia Filippova, Markus Frey, Mackenzie Weygandt Mathis
Abstract:
Gradient-based attribution methods aim to explain decisions of deep learning models but so far lack identifiability guarantees. Here, we propose a method to generate attribution maps with identifiability guarantees by developing a regularized contrastive learning algorithm trained on time-series data plus a new attribution method called Inverted Neuron Gradient (collectively named xCEBRA). We show theoretically that xCEBRA has favorable properties for identifying the Jacobian matrix of the data generating process. Empirically, we demonstrate robust approximation of zero vs. non-zero entries in the ground-truth attribution map on synthetic datasets, and significant improvements across previous attribution methods based on feature ablation, Shapley values, and other gradient-based methods. Our work constitutes a first example of identifiable inference of time-series attribution maps and opens avenues to a better understanding of time-series data, such as for neural dynamics and decision-processes within neural networks.
中文:提出的xCEBRA方法通过正则化对比学习和反向神经元梯度,为深度学习模型提供可识别的归因图,在时间序列数据分析中展现出理论保证和优于现有方法的实证效果。
English: The proposed xCEBRA method provides identifiable attribution maps for deep learning models through regularized contrastive learning and inverted neuron gradients, demonstrating theoretical guarantees and empirical improvements over existing methods in time-series data analysis.
Authors:Adriana Valentina Costache, Silviu Florin Gheorghe, Eduard Gabriel Poesina, Paul Irofti, Radu Tudor Ionescu
Abstract:
The basic underlying assumption of machine learning (ML) models is that the training and test data are sampled from the same distribution. However, in daily practice, this assumption is often broken, i.e.~the distribution of the test data changes over time, which hinders the application of conventional ML models. One domain where the distribution shift naturally occurs is text classification, since people always find new topics to discuss. To this end, we survey research articles studying open-set text classification and related tasks. We divide the methods in this area based on the constraints that define the kind of distribution shift and the corresponding problem formulation, i.e.~learning with the Universum, zero-shot learning, and open-set learning. We next discuss the predominant mitigation approaches for each problem setup. Finally, we identify several future work directions, aiming to push the boundaries beyond the state of the art. Interestingly, we find that continual learning can solve many of the issues caused by the shifting class distribution. We maintain a list of relevant papers at https://github.com/Eduard6421/Open-Set-Survey.
中文: 本综述探讨了针对测试数据分布变化而设计的开放集文本分类方法,按问题约束和应对策略对方法进行分类,强调持续学习作为关键解决方案,并指出了未来的研究方向。
English: This survey examines open-set text classification methods that address distribution shifts in test data, categorizing approaches by problem constraints and mitigation strategies while highlighting continual learning as a key solution and identifying future research directions.
Authors:Andrei Jarca, Florinel Alin Croitoru, Radu Tudor Ionescu
Abstract:
Masked language modeling has become a widely adopted unsupervised technique to pre-train large language models (LLMs). However, the process of selecting tokens for masking is random, and the percentage of masked tokens is typically fixed for the entire training process. In this paper, we propose to adjust the masking ratio and to decide which tokens to mask based on a novel task-informed anti-curriculum learning scheme. First, we harness task-specific knowledge about useful and harmful tokens in order to determine which tokens to mask. Second, we propose a cyclic decaying masking ratio, which corresponds to an anti-curriculum schedule (from hard to easy). We exemplify our novel task-informed anti-curriculum by masking (TIACBM) approach across three diverse downstream tasks: sentiment analysis, text classification by topic, and authorship attribution. Our findings suggest that TIACBM enhances the ability of the model to focus on key task-relevant features, contributing to statistically significant performance gains across tasks. We release our code at https://github.com/JarcaAndrei/TIACBM.
中文摘要:本文提出了一种基于任务信息的反课程掩码方法,通过动态调整掩码比例和选择策略,使模型能聚焦于任务关键特征,在多项自然语言处理任务中实现了显著性能提升。
English Summary: The paper introduces a task-informed anti-curriculum masking approach that dynamically adjusts masking ratios and token selection, significantly improving model performance across multiple NLP tasks by focusing on task-relevant features.
Authors:Lakshmi Nair, Ian Trase, Mark Kim
Abstract:
We present a novel reasoning approach called Flow-of-Options (FoO), designed to address intrinsic biases in Large Language Models (LLMs). Flow-of-Options enables LLMs to systematically explore a diverse range of possibilities in their reasoning, as demonstrated by an FoO-based agentic framework developed for autonomously solving Machine Learning (ML) tasks. FoO enforces diversity in LLM solutions through compressed and interpretable task representations, resulting in improvements of 38.2% - 69.2% on standard data science tasks, and 37.4% - 47.9% on therapeutic chemistry tasks, as compared to state-of-the-art baselines. With an overall operation cost under $1 per task, our framework is well-suited for cost-sensitive applications. Going beyond tabular classification and regression, we show the broader applicability of our FoO-based agentic system to tasks such as reinforcement learning and image generation. Our code is open-sourced at: https://github.com/flagshippioneering/Flow-of-Options.
中文摘要:Flow-of-Options(FoO)是一种新颖的推理方法,通过系统探索多样化解决方案来减少大语言模型的内在偏差,在数据科学任务上实现38.2%-69.2%、在治疗化学任务上实现37.4%-47.9%的性能提升,且单任务成本低于1美元。
English Summary: Flow-of-Options (FoO) is a novel reasoning approach that mitigates LLM biases by systematically exploring diverse solution possibilities, achieving performance improvements of 38.2%-69.2% on data science tasks and 37.4%-47.9% on therapeutic chemistry tasks while costing under $1 per task.
Authors:Ruotian Ma, Peisong Wang, Cheng Liu, Xingyan Liu, Jiaqi Chen, Bang Zhang, Xin Zhou, Nan Du, Jia Li
Abstract:
Recent studies have demonstrated the effectiveness of LLM test-time scaling. However, existing approaches to incentivize LLMs' deep thinking abilities generally require large-scale data or significant training efforts. Meanwhile, it remains unclear how to improve the thinking abilities of less powerful base models. In this work, we introduce S$^2$R, an efficient framework that enhances LLM reasoning by teaching models to self-verify and self-correct during inference. Specifically, we first initialize LLMs with iterative self-verification and self-correction behaviors through supervised fine-tuning on carefully curated data. The self-verification and self-correction skills are then further strengthened by both outcome-level and process-level reinforcement learning, with minimized resource requirements, enabling the model to adaptively refine its reasoning process during inference. Our results demonstrate that, with only 3.1k self-verifying and self-correcting behavior initialization samples, Qwen2.5-math-7B achieves an accuracy improvement from 51.0\% to 81.6\%, outperforming models trained on an equivalent amount of long-CoT distilled data. Extensive experiments and analysis based on three base models across both in-domain and out-of-domain benchmarks validate the effectiveness of S$^2$R. Our code and data are available at https://github.com/NineAbyss/S2R.
中文: 本文提出S²R框架,通过教导模型在推理过程中自我验证与自我修正来增强大语言模型的推理能力,仅需少量训练数据即可显著提升准确率。
English: This paper introduces S²R, an efficient framework that enhances LLM reasoning by teaching models to self-verify and self-correct during inference, achieving significant accuracy improvements with minimal training data.
Authors:Gianluca Guglielmo, Marc Masana
Abstract:
In real-world applications, machine learning models must reliably detect Out-of-Distribution (OoD) samples to prevent unsafe decisions. Current OoD detection methods often rely on analyzing the logits or the embeddings of the penultimate layer of a neural network. However, little work has been conducted on the exploitation of the rich information encoded in intermediate layers. To address this, we analyze the discriminative power of intermediate layers and show that they can positively be used for OoD detection. Therefore, we propose to regularize intermediate layers with an energy-based contrastive loss, and by grouping multiple layers in a single aggregated response. We demonstrate that intermediate layer activations improves OoD detection performance by running a comprehensive evaluation across multiple datasets.
中文: 机器学习模型可通过基于能量的对比损失和层聚合方法,有效利用中间层信息来提升分布外样本检测性能,多数据集验证了其有效性。
English: Machine learning models can enhance Out-of-Distribution detection by utilizing intermediate layers' information through energy-based contrastive loss and layer aggregation, as validated across multiple datasets.
Authors:Yanru Sun, Zongxia Xie, Haoyu Xing, Hualong Yu, Qinghua Hu
Abstract:
Time series forecasting (TSF) is an essential branch of machine learning with various applications. Most methods for TSF focus on constructing different networks to extract better information and improve performance. However, practical application data contain different internal mechanisms, resulting in a mixture of multiple patterns. That is, the model's ability to fit different patterns is different and generates different errors. In order to solve this problem, we propose an end-to-end framework, namely probability pattern-guided time series forecasting (PPGF). PPGF reformulates the TSF problem as a forecasting task guided by probabilistic pattern classification. Firstly, we propose the grouping strategy to approach forecasting problems as classification and alleviate the impact of data imbalance on classification. Secondly, we predict in the corresponding class interval to guarantee the consistency of classification and forecasting. In addition, True Class Probability (TCP) is introduced to pay more attention to the difficult samples to improve the classification accuracy. Detailedly, PPGF classifies the different patterns to determine which one the target value may belong to and estimates it accurately in the corresponding interval. To demonstrate the effectiveness of the proposed framework, we conduct extensive experiments on real-world datasets, and PPGF achieves significant performance improvements over several baseline methods. Furthermore, the effectiveness of TCP and the necessity of consistency between classification and forecasting are proved in the experiments. All data and codes are available online: https://github.com/syrGitHub/PPGF.
Chinese: 本文提出PPGF框架,将时间序列预测重构为概率模式分类任务,通过分组策略和真实类别概率处理混合数据模式,实验证明其有效提升了预测性能。
English: The authors propose PPGF, an end-to-end framework that reformulates time series forecasting as a probabilistic pattern classification task to handle mixed data patterns and improve accuracy through grouping strategies and true class probability.
Authors:Tanqiu Jiang, Changjiang Li, Fenglong Ma, Ting Wang
Abstract:
Differentially private diffusion models (DPDMs) harness the remarkable generative capabilities of diffusion models while enforcing differential privacy (DP) for sensitive data. However, existing DPDM training approaches often suffer from significant utility loss, large memory footprint, and expensive inference cost, impeding their practical uses. To overcome such limitations, we present RAPID: Retrieval Augmented PrIvate Diffusion model, a novel approach that integrates retrieval augmented generation (RAG) into DPDM training. Specifically, RAPID leverages available public data to build a knowledge base of sample trajectories; when training the diffusion model on private data, RAPID computes the early sampling steps as queries, retrieves similar trajectories from the knowledge base as surrogates, and focuses on training the later sampling steps in a differentially private manner. Extensive evaluation using benchmark datasets and models demonstrates that, with the same privacy guarantee, RAPID significantly outperforms state-of-the-art approaches by large margins in generative quality, memory footprint, and inference cost, suggesting that retrieval-augmented DP training represents a promising direction for developing future privacy-preserving generative models. The code is available at: https://github.com/TanqiuJiang/RAPID
中文: RAPID提出了一种检索增强的差分隐私扩散模型方法,利用公共数据提升训练效率,在同等隐私保护下显著提高了生成质量并降低了内存和计算成本。
English: RAPID introduces a retrieval-augmented approach to differentially private diffusion models, leveraging public data to enhance training efficiency and significantly improving generative quality while reducing memory and computational costs under the same privacy guarantees.
Authors:Haoyuan Wu, Haisheng Zheng, Yuan Pu, Bei Yu
Abstract:
Understanding the structure and function of circuits is crucial for electronic design automation (EDA). Circuits can be formulated as And-Inverter graphs (AIGs), enabling efficient implementation of representation learning through graph neural networks (GNNs). Masked modeling paradigms have been proven effective in graph representation learning. However, masking augmentation to original circuits will destroy their logical equivalence, which is unsuitable for circuit representation learning. Moreover, existing masked modeling paradigms often prioritize structural information at the expense of abstract information such as circuit function. To address these limitations, we introduce MGVGA, a novel constrained masked modeling paradigm incorporating masked gate modeling (MGM) and Verilog-AIG alignment (VGA). Specifically, MGM preserves logical equivalence by masking gates in the latent space rather than in the original circuits, subsequently reconstructing the attributes of these masked gates. Meanwhile, large language models (LLMs) have demonstrated an excellent understanding of the Verilog code functionality. Building upon this capability, VGA performs masking operations on original circuits and reconstructs masked gates under the constraints of equivalent Verilog codes, enabling GNNs to learn circuit functions from LLMs. We evaluate MGVGA on various logic synthesis tasks for EDA and show the superior performance of MGVGA compared to previous state-of-the-art methods. Our code is available at https://github.com/wuhy68/MGVGA.
中文: 本文提出MGVGA,一种新颖的约束掩码建模范式,通过在潜在空间进行掩码门建模并结合Verilog-AIG对齐来保持电路表示学习中的逻辑等价性,在EDA任务中展现出优越性能。
English: The paper introduces MGVGA, a novel constrained masked modeling paradigm that preserves logical equivalence in circuit representation learning by combining masked gate modeling in latent space and Verilog-AIG alignment, demonstrating superior performance in EDA tasks.
Authors:Zhiyuan Liu, Yanchen Luo, Han Huang, Enzhi Zhang, Sihang Li, Junfeng Fang, Yaorui Shi, Xiang Wang, Kenji Kawaguchi, Tat-Seng Chua
Abstract:
3D molecule generation is crucial for drug discovery and material design. While prior efforts focus on 3D diffusion models for their benefits in modeling continuous 3D conformers, they overlook the advantages of 1D SELFIES-based Language Models (LMs), which can generate 100% valid molecules and leverage the billion-scale 1D molecule datasets. To combine these advantages for 3D molecule generation, we propose a foundation model -- NExT-Mol: 3D Diffusion Meets 1D Language Modeling for 3D Molecule Generation. NExT-Mol uses an extensively pretrained molecule LM for 1D molecule generation, and subsequently predicts the generated molecule's 3D conformers with a 3D diffusion model. We enhance NExT-Mol's performance by scaling up the LM's model size, refining the diffusion neural architecture, and applying 1D to 3D transfer learning. Notably, our 1D molecule LM significantly outperforms baselines in distributional similarity while ensuring validity, and our 3D diffusion model achieves leading performances in conformer prediction. Given these improvements in 1D and 3D modeling, NExT-Mol achieves a 26% relative improvement in 3D FCD for de novo 3D generation on GEOM-DRUGS, and a 13% average relative gain for conditional 3D generation on QM9-2014. Our codes and pretrained checkpoints are available at https://github.com/acharkq/NExT-Mol.
中文:NExT-Mol模型结合预训练的一维语言模型生成有效分子与三维扩散模型预测构象,在基准数据集上实现了从头生成和条件性三维分子生成的显著提升。
English: The NExT-Mol model integrates a pretrained 1D language model for generating valid molecules with a 3D diffusion model for predicting conformers, achieving significant improvements in both de novo and conditional 3D molecule generation on benchmark datasets.
Authors:Mingyang Sun, Pengxiang Ding, Weinan Zhang, Donglin Wang
Abstract:
Diffusion policies have shown promise in learning complex behaviors from demonstrations, particularly for tasks requiring precise control and long-term planning. However, they face challenges in robustness when encountering distribution shifts. This paper explores improving diffusion-based imitation learning models through online interactions with the environment. We propose OTPR (Optimal Transport-guided score-based diffusion Policy for Reinforcement learning fine-tuning), a novel method that integrates diffusion policies with RL using optimal transport theory. OTPR leverages the Q-function as a transport cost and views the policy as an optimal transport map, enabling efficient and stable fine-tuning. Moreover, we introduce masked optimal transport to guide state-action matching using expert keypoints and a compatibility-based resampling strategy to enhance training stability. Experiments on three simulation tasks demonstrate OTPR's superior performance and robustness compared to existing methods, especially in complex and sparse-reward environments. In sum, OTPR provides an effective framework for combining IL and RL, achieving versatile and reliable policy learning. The code will be released at https://github.com/Sunmmyy/OTPR.git.
中文: 本文提出OTPR方法,通过最优传输理论将扩散策略与强化学习相结合,有效提升了模仿学习的鲁棒性和性能,尤其在复杂和稀疏奖励环境中表现优异。
English: This paper introduces OTPR, a novel method that integrates diffusion policies with reinforcement learning using optimal transport theory to enhance robustness and performance in imitation learning, particularly in complex and sparse-reward environments.
Authors:Emily Jin, Joy Hsu, Jiajun Wu
Abstract:
State classification of objects and their relations is core to many long-horizon tasks, particularly in robot planning and manipulation. However, the combinatorial explosion of possible object-predicate combinations, coupled with the need to adapt to novel real-world environments, makes it a desideratum for state classification models to generalize to novel queries with few examples. To this end, we propose PHIER, which leverages predicate hierarchies to generalize effectively in few-shot scenarios. PHIER uses an object-centric scene encoder, self-supervised losses that infer semantic relations between predicates, and a hyperbolic distance metric that captures hierarchical structure; it learns a structured latent space of image-predicate pairs that guides reasoning over state classification queries. We evaluate PHIER in the CALVIN and BEHAVIOR robotic environments and show that PHIER significantly outperforms existing methods in few-shot, out-of-distribution state classification, and demonstrates strong zero- and few-shot generalization from simulated to real-world tasks. Our results demonstrate that leveraging predicate hierarchies improves performance on state classification tasks with limited data.
Authors:Anjiang Wei, Jiannan Cao, Ran Li, Hongyu Chen, Yuhui Zhang, Ziheng Wang, Yuan Liu, Thiago S. F. X. Teixeira, Diyi Yang, Ke Wang, Alex Aiken
Abstract:
As large language models (LLMs) become integral to code-related tasks, a central question emerges: Do LLMs truly understand program semantics? We introduce EquiBench, a new benchmark for evaluating LLMs through equivalence checking, i.e., determining whether two programs produce identical outputs for all possible inputs. Unlike prior code generation benchmarks, this task directly tests a model's ability to reason about program semantics. EquiBench consists of 2400 program pairs across four languages and six categories. These pairs are generated through program analysis, compiler scheduling, and superoptimization, ensuring high-confidence labels, nontrivial difficulty, and full automation. We evaluate 19 state-of-the-art LLMs and find that in the most challenging categories, the best accuracies are 63.8% and 76.2%, only modestly above the 50% random baseline. Further analysis reveals that models often rely on syntactic similarity rather than exhibiting robust reasoning about program semantics, highlighting current limitations. Our code and dataset are publicly available at https://github.com/Anjiang-Wei/equibench
中文: EquiBench作为通过等价性检查评估大语言模型程序语义理解能力的新基准,揭示了模型常依赖语法相似性而非深层语义推理的局限性。
English: EquiBench is a novel benchmark that assesses large language models' understanding of program semantics through equivalence checking, revealing their limited reasoning capabilities as they often rely on syntactic cues rather than deep semantic analysis.
Authors:Duy Nguyen, Archiki Prasad, Elias Stengel-Eskin, Mohit Bansal
Abstract:
Inference-time intervention (ITI) has emerged as a promising method for steering large language model (LLM) behavior in a particular direction (e.g., improving helpfulness) by intervening on token representations without costly updates to the LLM's parameters. However, existing ITI approaches fail to scale to multi-attribute settings with conflicts, such as enhancing helpfulness while also reducing toxicity. To address this, we introduce Multi-Attribute Targeted Steering (MAT-Steer), a novel steering framework designed for selective token-level intervention across multiple attributes. MAT-Steer learns steering vectors using an alignment objective that shifts the model's internal representations of undesirable outputs closer to those of desirable ones while enforcing sparsity and orthogonality among vectors for different attributes, thereby reducing inter-attribute conflicts. We evaluate MAT-Steer in two distinct settings: (i) on question answering (QA) tasks where we balance attributes like truthfulness, bias, and toxicity; (ii) on generative tasks where we simultaneously improve attributes like helpfulness, correctness, and coherence. MAT-Steer outperforms existing ITI and parameter-efficient fine-tuning approaches across both task types (e.g., 3% average accuracy gain across QA tasks and 55.82% win rate against the best ITI baseline).
中文: MAT-Steer是一种新颖的引导框架,通过稀疏正交的引导向量对多属性进行选择性干预以减少冲突,在问答和生成任务中均优于现有方法。
English: MAT-Steer is a novel framework that enables selective intervention on multiple attributes in large language models by learning sparse, orthogonal steering vectors to reduce conflicts, outperforming existing methods in both QA and generative tasks.
Authors:Ahmed F. AbouElhamayed, Jordan Dotzel, Yash Akhauri, Chi-Chih Chang, Sameh Gobriel, J. Pablo Muñoz, Vui Seng Chua, Nilesh Jain, Mohamed S. Abdelfattah
Abstract:
Large language models have high compute, latency, and memory requirements. While specialized accelerators such as GPUs and TPUs typically run these workloads, CPUs are more widely available and consume less energy. Accelerating LLMs with CPUs enables broader AI access at a lower cost and power consumption. This acceleration potential for CPUs is especially relevant during the memory-bound decoding stage of LLM inference, which processes one token at a time and is becoming increasingly utilized with reasoning models. We utilize Advanced Matrix Extensions (AMX) support on the latest Intel CPUs together with unstructured sparsity to achieve a $1.42 \times$ reduction in end-to-end latency compared to the current PyTorch implementation by applying our technique in linear layers. We provide a set of open-source customized sparse kernels that can speed up any PyTorch model by automatically replacing all linear layers with our custom sparse implementation. Furthermore, we demonstrate for the first time the use of unstructured sparsity in the attention computation achieving a $1.14 \times$ speedup over the current systems without compromising accuracy. Code: https://github.com/IntelLabs/Hardware-Aware-Automated-Machine-Learning/tree/main/SparAMX
中文: 本研究利用英特尔CPU的高级矩阵扩展和非结构化稀疏技术,在线性层和注意力计算中均实现了延迟显著降低,同时保持模型精度。
English: This work accelerates large language models on Intel CPUs using Advanced Matrix Extensions and unstructured sparsity, achieving significant latency reductions in both linear and attention layers while maintaining accuracy.
Authors:Jiaqi Wang, Yuhang Zhou, Zhixiong Zhang, Qiguang Chen, Yongqiang Chen, James Cheng
Abstract:
Out-of-distribution generalization is a common problem that expects the model to perform well in the different distributions even far from the train data. A popular approach to addressing this issue is invariant learning (IL), in which the model is compiled to focus on invariant features instead of spurious features by adding strong constraints during training. However, there are some potential pitfalls of strong invariant constraints. Due to the limited number of diverse environments and over-regularization in the feature space, it may lead to a loss of important details in the invariant features while alleviating the spurious correlations, namely the over-invariance, which can also degrade the generalization performance. We theoretically define the over-invariance and observe that this issue occurs in various classic IL methods. To alleviate this issue, we propose a simple approach Diverse Invariant Learning (DivIL) by adding the unsupervised contrastive learning and the random masking mechanism compensatory for the invariant constraints, which can be applied to various IL methods. Furthermore, we conduct experiments across multiple modalities across 12 datasets and 6 classic models, verifying our over-invariance insight and the effectiveness of our DivIL framework. Our code is available at https://github.com/kokolerk/DivIL.
中文: 该研究指出过度不变性是导致重要特征细节丢失并削弱泛化性能的潜在问题,提出了通过对比学习和随机掩码机制来补偿约束的DivIL框架,并在多个数据集和模型上验证了其有效性。
English: The study identifies over-invariance as a pitfall in invariant learning methods that can degrade generalization by causing loss of important feature details, and proposes the DivIL framework to mitigate this issue through contrastive learning and random masking, validated across multiple datasets and models.
Authors:Riting Xia, Huibo Liu, Anchen Li, Xueyan Liu, Yan Zhang, Chunxu Zhang, Bo Yang
Abstract:
Graph learning is a prevalent field that operates on ubiquitous graph data. Effective graph learning methods can extract valuable information from graphs. However, these methods are non-robust and affected by missing attributes in graphs, resulting in sub-optimal outcomes. This has led to the emergence of incomplete graph learning, which aims to process and learn from incomplete graphs to achieve more accurate and representative results. In this paper, we conducted a comprehensive review of the literature on incomplete graph learning. Initially, we categorize incomplete graphs and provide precise definitions of relevant concepts, terminologies, and techniques, thereby establishing a solid understanding for readers. Subsequently, we classify incomplete graph learning methods according to the types of incompleteness: (1) attribute-incomplete graph learning methods, (2) attribute-missing graph learning methods, and (3) hybrid-absent graph learning methods. By systematically classifying and summarizing incomplete graph learning methods, we highlight the commonalities and differences among existing approaches, aiding readers in selecting methods and laying the groundwork for further advancements. In addition, we summarize the datasets, incomplete processing modes, evaluation metrics, and application domains used by the current methods. Lastly, we discuss the current challenges and propose future directions for incomplete graph learning, with the aim of stimulating further innovations in this crucial field. To our knowledge, this is the first review dedicated to incomplete graph learning, aiming to offer valuable insights for researchers in related fields.We developed an online resource to follow relevant research based on this review, available at https://github.com/cherry-a11y/Incomplete-graph-learning.git
中文摘要:本文首次系统综述了不完整图学习领域,通过分类方法、总结应用并展望未来方向,为该领域研究提供重要参考。
English Summary: This paper presents the first comprehensive review of incomplete graph learning, categorizing methods by incompleteness types and discussing challenges to advance this emerging field.
Authors:Krishan Rana, Robert Lee, David Pershouse, Niko Suenderhauf
Abstract:
Recent advances in imitation learning, particularly using generative modelling techniques like diffusion, have enabled policies to capture complex multi-modal action distributions. However, these methods often require large datasets and multiple inference steps for action generation, posing challenges in robotics where the cost for data collection is high and computation resources are limited. To address this, we introduce IMLE Policy, a novel behaviour cloning approach based on Implicit Maximum Likelihood Estimation (IMLE). IMLE Policy excels in low-data regimes, effectively learning from minimal demonstrations and requiring 38\% less data on average to match the performance of baseline methods in learning complex multi-modal behaviours. Its simple generator-based architecture enables single-step action generation, improving inference speed by 97.3\% compared to Diffusion Policy, while outperforming single-step Flow Matching. We validate our approach across diverse manipulation tasks in simulated and real-world environments, showcasing its ability to capture complex behaviours under data constraints. Videos and code are provided on our project page: https://imle-policy.github.io/.
Authors:Batu El, Deepro Choudhury, Pietro Liò, Chaitanya K. Joshi
Abstract:
We introduce Attention Graphs, a new tool for mechanistic interpretability of Graph Neural Networks (GNNs) and Graph Transformers based on the mathematical equivalence between message passing in GNNs and the self-attention mechanism in Transformers. Attention Graphs aggregate attention matrices across Transformer layers and heads to describe how information flows among input nodes. Through experiments on homophilous and heterophilous node classification tasks, we analyze Attention Graphs from a network science perspective and find that: (1) When Graph Transformers are allowed to learn the optimal graph structure using all-to-all attention among input nodes, the Attention Graphs learned by the model do not tend to correlate with the input/original graph structure; and (2) For heterophilous graphs, different Graph Transformer variants can achieve similar performance while utilising distinct information flow patterns. Open source code: https://github.com/batu-el/understanding-inductive-biases-of-gnns
中文:我们引入了注意力图作为图神经网络和图变换器的机制解释工具,通过聚合注意力矩阵分析信息流动,发现学习到的图结构与输入图相关性弱,且在异配性图中不同变换器变体可通过不同信息流模式实现相近性能。
English: Attention Graphs are introduced as a tool for interpreting Graph Neural Networks and Graph Transformers by analyzing aggregated attention matrices to reveal information flow, with findings showing learned structures often diverge from input graphs and diverse patterns yield similar performance in heterophilous tasks.
Authors:Petar Steinberg, Juliusz Ziomek, Matej Jusup, Ilija Bogunovic
Abstract:
We address the problem of optimising the average payoff for a large number of cooperating agents, where the payoff function is unknown and treated as a black box. While standard Bayesian Optimisation (BO) methods struggle with the scalability required for high-dimensional input spaces, we demonstrate how leveraging the mean-field assumption on the black-box function can transform BO into an efficient and scalable solution. Specifically, we introduce MF-GP-UCB, a novel efficient algorithm designed to optimise agent payoffs in this setting. Our theoretical analysis establishes a regret bound for MF-GP-UCB that is independent of the number of agents, contrasting sharply with the exponential dependence observed when naive BO methods are applied. We evaluate our algorithm on a diverse set of tasks, including real-world problems, such as optimising the location of public bikes for a bike-sharing programme, distributing taxi fleets, and selecting refuelling ports for maritime vessels. Empirical results demonstrate that MF-GP-UCB significantly outperforms existing benchmarks, offering substantial improvements in performance and scalability, constituting a promising solution for mean-field, black-box optimisation. The code is available at https://github.com/petarsteinberg/MF-BO.
中文: 我们提出MF-GP-UCB算法,通过利用平均场假设实现高维黑盒优化的可扩展性,在理论层面获得与智能体数量无关的遗憾界,并在共享单车调度等实际应用中显著超越现有基准方法。
English: We introduce MF-GP-UCB, a scalable Bayesian optimization algorithm that leverages mean-field assumptions to efficiently optimize agent payoffs in high-dimensional black-box settings, achieving regret bounds independent of agent count and outperforming benchmarks in real-world applications.
Authors:Jake Vasilakes, Chrysoula Zerva, Sophia Ananiadou
Abstract:
Many existing approaches for learning from labeled data assume the existence of gold-standard labels. According to these approaches, inter-annotator disagreement is seen as noise to be removed, either through refinement of annotation guidelines, label adjudication, or label filtering. However, annotator disagreement can rarely be totally eradicated, especially on more subjective tasks such as sentiment analysis or hate speech detection where disagreement is natural. Therefore, a new approach to learning from labeled data, called data perspectivism, seeks to leverage inter-annotator disagreement to learn models that stay true to the inherent uncertainty of the task by treating annotations as opinions of the annotators, rather than gold-standard facts. Despite this conceptual grounding, existing methods under data perspectivism are limited to using disagreement as the sole source of annotation uncertainty. To expand the possibilities of data perspectivism, we introduce Subjective Logic Encodings (SLEs), a flexible framework for constructing classification targets that explicitly encodes annotations as opinions of the annotators. Based on Subjective Logic Theory, SLEs encode labels as Dirichlet distributions and provide principled methods for encoding and aggregating various types of annotation uncertainty -- annotator confidence, reliability, and disagreement -- into the targets. We show that SLEs are a generalization of other types of label encodings as well as how to estimate models to predict SLEs using a distribution matching objective.
中文: 数据透视主义将标注者分歧视为有价值信息而非噪声,而提出的主观逻辑编码框架通过将多种标注不确定性纳入分类目标,扩展了这一方法。
English: Data perspectivism treats annotator disagreement as valuable information rather than noise, and the proposed Subjective Logic Encodings framework expands this approach by incorporating multiple sources of annotation uncertainty into classification targets.
Authors:Jiayu Zhang, Zhiyu Zhu, Xinyi Wang, Silin Liao, Zhibo Jin, Flora D. Salim, Huaming Chen
Abstract:
Deep neural networks have demonstrated remarkable performance across various domains. However, they are vulnerable to adversarial examples, which can lead to erroneous predictions. Generative Adversarial Networks (GANs) can leverage the generators and discriminators model to quickly produce high-quality adversarial examples. Since both modules train in a competitive and simultaneous manner, GAN-based algorithms like AdvGAN can generate adversarial examples with better transferability compared to traditional methods. However, the generation of perturbations is usually limited to a single iteration, preventing these examples from fully exploiting the potential of the methods. To tackle this issue, we introduce a novel approach named Progressive Auto-Regression AdvGAN (PAR-AdvGAN). It incorporates an auto-regressive iteration mechanism within a progressive generation network to craft adversarial examples with enhanced attack capability. We thoroughly evaluate our PAR-AdvGAN method with a large-scale experiment, demonstrating its superior performance over various state-of-the-art black-box adversarial attacks, as well as the original AdvGAN.Moreover, PAR-AdvGAN significantly accelerates the adversarial example generation, i.e., achieving the speeds of up to 335.5 frames per second on Inception-v3 model, outperforming the gradient-based transferable attack algorithms. Our code is available at: https://github.com/LMBTough/PAR
Chinese: 提出的PAR-AdvGAN方法在渐进式生成网络中引入自回归迭代机制,相比现有方法能生成攻击能力更强、速度更快的对抗样本。
English: The proposed PAR-AdvGAN method introduces an auto-regressive iteration mechanism within a progressive generation network to create adversarial examples with superior attack capability and faster generation speed compared to existing approaches.
Authors:Da Xiao, Qingye Meng, Shengping Li, Xingyuan Yuan
Abstract:
We propose MUltiway Dynamic Dense (MUDD) connections, a simple yet effective method to address the limitations of residual connections and enhance cross-layer information flow in Transformers. Unlike existing dense connection approaches with static and shared connection weights, MUDD generates connection weights dynamically depending on hidden states at each sequence position and for each decoupled input stream (the query, key, value or residual) of a Transformer block. MUDD connections can be seamlessly integrated into any Transformer architecture to create MUDDFormer. Extensive experiments show that MUDDFormer significantly outperforms Transformers across various model architectures and scales in language modeling, achieving the performance of Transformers trained with 1.8X-2.4X compute. Notably, MUDDPythia-2.8B matches Pythia-6.9B in pretraining ppl and downstream tasks and even rivals Pythia-12B in five-shot settings, while adding only 0.23% parameters and 0.4% computation. Code in JAX and PyTorch and pre-trained models are available at https://github.com/Caiyun-AI/MUDDFormer .
中文:MUDD连接提出了一种动态方法,通过生成位置特定和输入流依赖的权重来增强Transformer中的跨层信息流,以极少的参数和计算量显著提升了语言建模性能。
English: MUDD connections introduce a dynamic method to enhance cross-layer information flow in Transformers by generating position-specific and input-stream-dependent weights, significantly improving performance in language modeling with minimal added parameters and computation.
Authors:Zhicong Tang, Jianmin Bao, Dong Chen, Baining Guo
Abstract:
This paper presents Model-guidance (MG), a novel objective for training diffusion model that addresses and removes of the commonly used Classifier-free guidance (CFG). Our innovative approach transcends the standard modeling of solely data distribution to incorporating the posterior probability of conditions. The proposed technique originates from the idea of CFG and is easy yet effective, making it a plug-and-play module for existing models. Our method significantly accelerates the training process, doubles the inference speed, and achieve exceptional quality that parallel and even surpass concurrent diffusion models with CFG. Extensive experiments demonstrate the effectiveness, efficiency, scalability on different models and datasets. Finally, we establish state-of-the-art performance on ImageNet 256 benchmarks with an FID of 1.34. Our code is available at https://github.com/tzco/Diffusion-wo-CFG.
中文: 本文提出模型引导(MG)这一创新目标,通过引入条件后验概率取代了常用的无分类器引导(CFG),在加速训练和推理的同时实现了更优的图像生成质量,并在ImageNet 256基准测试中取得了当前最佳性能。
English: This paper introduces Model-guidance (MG), a novel training objective for diffusion models that replaces Classifier-free guidance (CFG) by incorporating conditional posterior probabilities, achieving faster training, doubled inference speed, and superior image quality with state-of-the-art results on benchmarks.
Authors:Florian Sestak, Artur Toshev, Andreas Fürst, Günter Klambauer, Andreas Mayr, Johannes Brandstetter
Abstract:
Generative models are spearheading recent progress in deep learning, showcasing strong promise for trajectory sampling in dynamical systems as well. However, whereas latent space modeling paradigms have transformed image and video generation, similar approaches are more difficult for most dynamical systems. Such systems -- from chemical molecule structures to collective human behavior -- are described by interactions of entities, making them inherently linked to connectivity patterns, entity conservation, and the traceability of entities over time. Our approach, LaM-SLidE (Latent Space Modeling of Spatial Dynamical Systems via Linked Entities), bridges the gap between: (1) keeping the traceability of individual entities in a latent system representation, and (2) leveraging the efficiency and scalability of recent advances in image and video generation, where pre-trained encoder and decoder enable generative modeling directly in latent space. The core idea of LaM-SLidE is the introduction of identifier representations (IDs) that enable the retrieval of entity properties and entity composition from latent system representations, thus fostering traceability. Experimentally, across different domains, we show that LaM-SLidE performs favorably in terms of speed, accuracy, and generalizability. Code is available at https://github.com/ml-jku/LaM-SLidE .
中文: LaM-SLidE通过引入标识符表示,在潜在空间中实现可追踪的实体建模,弥合了动态系统需求与图像/视频生成技术效率之间的差距,并在多个领域展现出卓越的速度、准确性和泛化能力。
English: LaM-SLidE introduces identifier representations to enable traceable entity modeling in latent space, bridging the gap between dynamical system requirements and the efficiency of image/video generation techniques while demonstrating superior speed, accuracy, and generalizability across domains.
Authors:Samuel Miserendino, Michele Wang, Tejal Patwardhan, Johannes Heidecke
Abstract:
We introduce SWE-Lancer, a benchmark of over 1,400 freelance software engineering tasks from Upwork, valued at \$1 million USD total in real-world payouts. SWE-Lancer encompasses both independent engineering tasks--ranging from \$50 bug fixes to \$32,000 feature implementations--and managerial tasks, where models choose between technical implementation proposals. Independent tasks are graded with end-to-end tests triple-verified by experienced software engineers, while managerial decisions are assessed against the choices of the original hired engineering managers. We evaluate model performance and find that frontier models are still unable to solve the majority of tasks. To facilitate future research, we open-source a unified Docker image and a public evaluation split, SWE-Lancer Diamond (https://github.com/openai/SWELancer-Benchmark). By mapping model performance to monetary value, we hope SWE-Lancer enables greater research into the economic impact of AI model development.
中文: SWE-Lancer是一个包含1400多个自由职业软件工程任务的基准,总价值100万美元,用于评估AI模型在技术和管理方面的能力,目前模型仍难以解决大部分任务,并开源资源以促进未来研究。
English: SWE-Lancer is a benchmark of over 1,400 freelance software engineering tasks valued at $1 million, evaluating both technical and managerial capabilities of AI models, which currently struggle to solve most tasks, with open-source resources provided for future research.
Authors:Yuxiang Huang, Mingye Li, Xu Han, Chaojun Xiao, Weilin Zhao, Sun Ao, Hao Zhou, Jie Zhou, Zhiyuan Liu, Maosong Sun
Abstract:
While long-context inference is crucial for advancing large language model (LLM) applications, its prefill speed remains a significant bottleneck. Current approaches, including sequence parallelism strategies and compute reduction through approximate attention mechanisms, still fall short of delivering optimal inference efficiency. This hinders scaling the inputs to longer sequences and processing long-context queries in a timely manner. To address this, we introduce APB, an efficient long-context inference framework that leverages multi-host approximate attention to enhance prefill speed by reducing compute and enhancing parallelism simultaneously. APB introduces a communication mechanism for essential key-value pairs within a sequence parallelism framework, enabling a faster inference speed while maintaining task performance. We implement APB by incorporating a tailored FlashAttn kernel alongside optimized distribution strategies, supporting diverse models and parallelism configurations. APB achieves speedups of up to 9.2x, 4.2x, and 1.6x compared with FlashAttn, RingAttn, and StarAttn, respectively, without any observable task performance degradation. We provide the implementation and experiment code of APB in https://github.com/thunlp/APB.
中文: APB是一种高效的长上下文推理框架,通过多主机近似注意力和优化并行机制显著提升预填充速度,在保持任务性能的同时实现高达9.2倍的加速效果。
English: APB is an efficient long-context inference framework that accelerates prefill speed through multi-host approximate attention and optimized parallelism, achieving up to 9.2x faster inference without performance loss.
Authors:Jiayang Zhang, Xianyuan Liu, Wei Wu, Sina Tabakhi, Wenrui Fan, Shuo Zhou, Kang Lan Tee, Tuck Seng Wong, Haiping Lu
Abstract:
Virus-like particles (VLPs) are valuable for vaccine development due to their immune-triggering properties. Understanding their stoichiometry, the number of protein subunits to form a VLP, is critical for vaccine optimisation. However, current experimental methods to determine stoichiometry are time-consuming and require highly purified proteins. To efficiently classify stoichiometry classes in proteins, we curate a new dataset and propose an interpretable, data-driven pipeline leveraging linear machine learning models. We also explore the impact of feature encoding on model performance and interpretability, as well as methods to identify key protein sequence features influencing classification. The evaluation of our pipeline demonstrates that it can classify stoichiometry while revealing protein features that possibly influence VLP assembly. The data and code used in this work are publicly available at https://github.com/Shef-AIRE/StoicIML.
Chinese: 本研究提出了一种基于可解释机器学习的流程,用于高效分类病毒样颗粒(VLPs)的化学计量比,这对疫苗优化至关重要,同时识别出影响VLPs组装的关键蛋白质特征。
English: This study introduces an interpretable, machine learning-based pipeline for efficiently classifying the stoichiometry of virus-like particles (VLPs), which is crucial for vaccine optimization, while also identifying key protein features influencing VLP assembly.
Authors:Fengwei Teng, Zhaoyang Yu, Quan Shi, Jiayi Zhang, Chenglin Wu, Yuyu Luo
Abstract:
Large Language Models (LLMs) achieve superior performance through training-time scaling, and test-time scaling further enhances their capabilities by conducting effective reasoning during inference. However, as the scale of reasoning increases, existing test-time scaling methods suffer from accumulated historical information, which not only wastes computational resources but also interferes with effective reasoning. To address this issue, we observe that complex reasoning can be achieved by solving a series of independent and self-contained subquestions. These subquestions are essentially \textit{atomic questions}, exhibiting the memoryless property similar to Markov processes. Based on this observation, we propose Atom of Thoughts (\our), where each state transition consists of decomposing the current question into a dependency-based directed acyclic graph and contracting its subquestions, forming a simplified question that maintains answer equivalence with the original problem. This answer preservation enables the iterative \textit{decomposition-contraction} process to naturally form a meaningful Markov reasoning process. Furthermore, these atomic states can be seamlessly integrated into existing test-time scaling methods, enabling \our to serve as a plug-in enhancement for improving reasoning capabilities. Experiments across six benchmarks demonstrate the effectiveness of \our both as a standalone framework and a plug-in enhancement. Notably, on HotpotQA, when applied to gpt-4o-mini, \our achieves an \textbf{80.6\%} F1 score, surpassing o3-mini by \textbf{3.4\%} and DeepSeek-R1 by \textbf{10.6\%}. The code is available at \href{https://github.com/qixucen/atom}{https://github.com/qixucen/atom}.
Chinese: 提出的Atom of Thoughts (AoT)方法通过将复杂问题分解为原子子问题并压缩为简化形式,有效提升大语言模型的推理能力,在多个基准测试中作为独立框架和插件增强均表现出卓越性能。
English: The proposed Atom of Thoughts (AoT) method enhances reasoning in Large Language Models by decomposing complex questions into atomic subquestions and contracting them into simplified forms, achieving superior performance across benchmarks as both a standalone framework and plug-in enhancement.
Authors:Xuefeng Li, Haoyang Zou, Pengfei Liu
Abstract:
In this paper, we ask: what truly determines the effectiveness of RL training data for enhancing language models' reasoning capabilities? While recent advances like o1, Deepseek R1, and Kimi1.5 demonstrate RL's potential, the lack of transparency about training data requirements has hindered systematic progress. Starting directly from base models without distillation, we challenge the assumption that scaling up RL training data inherently improves performance. we demonstrate that a strategically selected subset of just 1,389 samples can outperform the full 8,523-sample dataset. We introduce Learning Impact Measurement (LIM), an automated method to evaluate and prioritize training samples based on their alignment with model learning trajectories, enabling efficient resource utilization and scalable implementation. Our method achieves comparable or even superior performance using only 1,389 samples versus the full 8,523 samples dataset. Notably, while recent data-efficient approaches (e.g., LIMO and s1) show promise with 32B-scale models, we find it significantly underperforms at 7B-scale through supervised fine-tuning (SFT). In contrast, our RL-based LIMR achieves 16.7% higher accuracy on AIME24 and outperforms LIMO and s1 by 13.0% and 22.2% on MATH500. These results fundamentally reshape our understanding of RL scaling in LLMs, demonstrating that precise sample selection, rather than data scale, may be the key to unlocking enhanced reasoning capabilities. For reproducible research and future innovation, we are open-sourcing LIMR, including implementation of LIM, training and evaluation code, curated datasets, and trained models at https://github.com/GAIR-NLP/LIMR.
中文: 本研究表明,通过"学习影响度量"方法进行策略性样本选择比单纯扩大数据规模更能提升语言模型的推理能力,仅用1,389个样本就超越了完整数据集的性能表现。
English: This study reveals that strategic sample selection through Learning Impact Measurement (LIM) is more crucial than data volume for enhancing language models' reasoning, achieving superior performance with only 1,389 samples compared to full datasets.
Authors:Shao Zhang, Xihuai Wang, Wenhao Zhang, Chaoran Li, Junru Song, Tingyu Li, Lin Qiu, Xuezhi Cao, Xunliang Cai, Wen Yao, Weinan Zhang, Xinbing Wang, Ying Wen
Abstract:
Agents built on large language models (LLMs) have excelled in turn-by-turn human-AI collaboration but struggle with simultaneous tasks requiring real-time interaction. Latency issues and the challenge of inferring variable human strategies hinder their ability to make autonomous decisions without explicit instructions. Through experiments with current independent System 1 and System 2 methods, we validate the necessity of using Dual Process Theory (DPT) in real-time tasks. We propose DPT-Agent, a novel language agent framework that integrates System 1 and System 2 for efficient real-time simultaneous human-AI collaboration. DPT-Agent's System 1 uses a Finite-state Machine (FSM) and code-as-policy for fast, intuitive, and controllable decision-making. DPT-Agent's System 2 integrates Theory of Mind (ToM) and asynchronous reflection to infer human intentions and perform reasoning-based autonomous decisions. We demonstrate the effectiveness of DPT-Agent through further experiments with rule-based agents and human collaborators, showing significant improvements over mainstream LLM-based frameworks. DPT-Agent can effectively help LLMs convert correct slow thinking and reasoning into executable actions, thereby improving performance. To the best of our knowledge, DPT-Agent is the first language agent framework that achieves successful real-time simultaneous human-AI collaboration autonomously. Code of DPT-Agent can be found in https://github.com/sjtu-marl/DPT-Agent.
中文摘要:DPT-Agent是一种新型语言智能体框架,通过整合快速反应的系统1和基于推理的系统2,实现了自主的实时人机协作,克服了当前基于大语言模型的智能体在延迟和策略推断方面的局限。
English Summary: DPT-Agent is a novel language agent framework that integrates fast System 1 and reasoning-based System 2 processes to enable autonomous real-time human-AI collaboration, overcoming latency and strategy inference limitations of current LLM-based agents.
Authors:Jinheng Wang, Hansong Zhou, Ting Song, Shijie Cao, Yan Xia, Ting Cao, Jianyu Wei, Shuming Ma, Hongyu Wang, Furu Wei
Abstract:
The advent of 1-bit large language models (LLMs), led by BitNet b1.58, has spurred interest in ternary LLMs. Despite this, research and practical applications focusing on efficient edge inference for ternary LLMs remain scarce. To bridge this gap, we introduce Bitnet.cpp, an inference system optimized for BitNet b1.58 and ternary LLMs. Given that mixed-precision matrix multiplication (mpGEMM) constitutes the bulk of inference time in ternary LLMs, Bitnet.cpp incorporates a novel mpGEMM library to facilitate sub-2-bits-per-weight, efficient and lossless inference. The library features two core solutions: Ternary Lookup Table (TL), which addresses spatial inefficiencies of previous bit-wise methods, and Int2 with a Scale (I2_S), which ensures lossless edge inference, both enabling high-speed inference. Our experiments show that Bitnet.cpp achieves up to a 6.25x increase in speed over full-precision baselines and up to 2.32x over low-bit baselines, setting new benchmarks in the field. Additionally, we expand TL to element-wise lookup table (ELUT) for low-bit LLMs in the appendix, presenting both theoretical and empirical evidence of its considerable potential. Bitnet.cpp is publicly available at https://github.com/microsoft/BitNet/tree/paper , offering a sophisticated solution for the efficient and practical deployment of edge LLMs.
中文摘要:Bitnet.cpp为三元大语言模型推出了优化的推理系统,采用创新的混合精度矩阵乘法库,在实现无损边缘部署的同时,相比全精度基线最高可提升6.25倍推理速度。
English Summary: Bitnet.cpp introduces an optimized inference system for ternary large language models, featuring a novel mixed-precision matrix multiplication library that achieves up to 6.25x speed improvement over full-precision baselines while enabling lossless edge deployment.
Authors:Sheng-Yu Huang, Zi-Ting Chou, Yu-Chiang Frank Wang
Abstract:
When performing 3D inpainting using novel-view rendering methods like Neural Radiance Field (NeRF) or 3D Gaussian Splatting (3DGS), how to achieve texture and geometry consistency across camera views has been a challenge. In this paper, we propose a framework of 3D Gaussian Inpainting with Depth-Guided Cross-View Consistency (3DGIC) for cross-view consistent 3D inpainting. Guided by the rendered depth information from each training view, our 3DGIC exploits background pixels visible across different views for updating the inpainting mask, allowing us to refine the 3DGS for inpainting purposes.Through extensive experiments on benchmark datasets, we confirm that our 3DGIC outperforms current state-of-the-art 3D inpainting methods quantitatively and qualitatively.
Authors:Georg Wölflein, Dyke Ferber, Daniel Truhn, Ognjen ArandjeloviÄ, Jakob Nikolas Kather
Abstract:
Tool use has turned large language models (LLMs) into powerful agents that can perform complex multi-step tasks by dynamically utilising external software components. However, these tools must be implemented in advance by human developers, hindering the applicability of LLM agents in domains demanding large numbers of highly specialised tools, like in life sciences and medicine. Motivated by the growing trend of scientific studies accompanied by public code repositories, we propose ToolMaker, an agentic framework that autonomously transforms papers with code into LLM-compatible tools. Given a GitHub URL and short task description, ToolMaker autonomously installs dependencies and generates code to perform the task, using a closed-loop self-correction mechanism for debugging. To evaluate our approach, we introduce a benchmark comprising 15 complex computational tasks spanning various domains with over 100 unit tests to assess correctness and robustness. Our method correctly implements 80% of the tasks, substantially outperforming current state-of-the-art software engineering agents. ToolMaker therefore is a step towards fully autonomous agent-based scientific workflows. Our code and benchmark are publicly available at https://github.com/KatherLab/ToolMaker.
中文: ToolMaker 是一个自主框架,能够将附带代码库的研究论文转化为大语言模型兼容的工具,使大语言模型无需人工干预即可创建专业软件组件,在复杂计算任务中显著优于现有智能体。
English: ToolMaker is an autonomous framework that converts research papers with code repositories into LLM-compatible tools, enabling large language models to create specialized software components without human intervention and significantly outperforming existing agents in complex computational tasks.
Authors:Xiaoyi Dong, Jian Cheng, Xi Sheryl Zhang
Abstract:
The Soft Actor-Critic (SAC) algorithm with a Gaussian policy has become a mainstream implementation for realizing the Maximum Entropy Reinforcement Learning (MaxEnt RL) objective, which incorporates entropy maximization to encourage exploration and enhance policy robustness. While the Gaussian policy performs well on simpler tasks, its exploration capacity and potential performance in complex multi-goal RL environments are limited by its inherent unimodality. In this paper, we employ the diffusion model, a powerful generative model capable of capturing complex multimodal distributions, as the policy representation to fulfill the MaxEnt RL objective, developing a method named MaxEnt RL with Diffusion Policy (MaxEntDP). Our method enables efficient exploration and brings the policy closer to the optimal MaxEnt policy. Experimental results on Mujoco benchmarks show that MaxEntDP outperforms the Gaussian policy and other generative models within the MaxEnt RL framework, and performs comparably to other state-of-the-art diffusion-based online RL algorithms. Our code is available at https://github.com/diffusionyes/MaxEntDP.
高斯策略的软演员-评论家算法被扩散模型取代,形成名为MaxEntDP的新方法,通过捕捉多模态分布增强了复杂环境中的探索能力和性能表现。
The Soft Actor-Critic algorithm with a Gaussian policy is enhanced by replacing it with a diffusion model, named MaxEntDP, which improves exploration and performance in complex environments by capturing multimodal distributions.
Authors:Arnaud Bougaham, Benoît Frénay
Abstract:
Anomaly Detection is a crucial step for critical applications such in the industrial, medical or cybersecurity domains. These sectors share the same requirement of handling differently the different types of classification errors. Indeed, even if false positives are acceptable, false negatives are not, because it would reflect a missed detection of a quality issue, a disease or a cyber threat. To fulfill this requirement, we propose a method that dynamically applies a trustworthy approximated partial AUC ROC loss (tapAUC). A binary classifier is trained to optimize the specific range of the AUC ROC curve that prevents the True Positive Rate (TPR) to reach 100% while minimizing the False Positive Rate (FPR). The optimal threshold that does not trigger any false negative is then kept and used at the test step. The results show a TPR of 92.52% at a 20.43% FPR for an average across 6 datasets, representing a TPR improvement of 4.3% for a FPR cost of 12.2% against other state-of-the-art methods. The code is available at https://github.com/ArnaudBougaham/tapAUC.
中文: 本文提出了一种可信赖的近似部分AUC ROC损失方法,通过优化分类器在关键应用中避免漏报,在六个数据集上实现了92.52%的真阳性率和20.43%的假阳性率。
English: This paper introduces a trustworthy approximated partial AUC ROC loss method that optimizes classifiers to prevent false negatives in critical applications, achieving a 92.52% true positive rate at a 20.43% false positive rate across six datasets.
Authors:Jaehyeong Jo, Sung Ju Hwang
Abstract:
Diffusion models have emerged as a promising alternative to autoregressive models in modeling discrete categorical data. Yet diffusion models that directly work on discrete data space do not fully exploit the power of iterative refinement, as the signals are lost during the transition between discrete states. Existing continuous diffusion models for discrete data have limited performance compared to discrete approaches, and the unclear link between them restricts the development of diffusion models for discrete data. In this work, we propose a continuous diffusion model for language modeling that incorporates the geometry of the underlying categorical distribution. We establish a connection between the discrete diffusion and continuous flow on the statistical manifold, and building on the analogy, we introduce a simple design for the diffusion process that generalizes previous discrete diffusion models. We further propose a simulation-free training framework based on radial symmetry and a simple technique to address the high dimensionality of the manifold. Comprehensive experiments on language modeling benchmarks and other modalities show that our method outperforms existing discrete diffusion models and approaches the performance of autoregressive models. Codes available at \href{https://github.com/harryjo97/RDLM}{https://github.com/harryjo97/RDLM}.
中文摘要:本文提出了一种用于语言建模的连续扩散模型,该模型结合了分类分布的几何特性,通过在统计流形上建立离散扩散与连续流的联系,超越了现有离散扩散模型并接近自回归模型的性能。
English Summary: This paper introduces a continuous diffusion model for language modeling that leverages the geometry of categorical distributions, establishing a connection between discrete diffusion and continuous flow on statistical manifolds to outperform existing discrete diffusion models and approach autoregressive model performance.
Authors:Lior Cohen, Kaixin Wang, Bingyi Kang, Uri Gadot, Shie Mannor
Abstract:
World model (WM) agents enable sample-efficient reinforcement learning by learning policies entirely from simulated experience. However, existing token-based world models (TBWMs) are limited to visual inputs and discrete actions, restricting their adoption and applicability. Moreover, although both intrinsic motivation and prioritized WM replay have shown promise in improving WM performance and generalization, they remain underexplored in this setting, particularly in combination. We introduce Simulus, a highly modular TBWM agent that integrates (1) a modular multi-modality tokenization framework, (2) intrinsic motivation, (3) prioritized WM replay, and (4) regression-as-classification for reward and return prediction. Simulus achieves state-of-the-art sample efficiency for planning-free WMs across three diverse benchmarks. Ablation studies reveal the individual contribution of each component while highlighting their synergy. Our code and model weights are publicly available at https://github.com/leor-c/Simulus.
Chinese: Simulus作为一种模块化基于令牌的世界模型智能体,整合了多模态令牌化、内在动机、优先回放和回归分类方法,在三个不同基准测试中实现了最先进的样本效率。
English: Simulus is a modular token-based world model agent that integrates multi-modality tokenization, intrinsic motivation, prioritized replay, and regression-as-classification, achieving state-of-the-art sample efficiency across three benchmarks.
Authors:Haochen Li, Wanjin Feng, Xin Zhou, Zhiqi Shen
Abstract:
Training Large Language Models (LLMs) with synthetic data is a prevalent practice in code generation. A key approach is self-training, where LLMs are iteratively trained on self-generated correct code snippets. In this case, the self-generated codes are drawn from a conditional distribution, conditioned on a specific seed description. However, the seed description is not the only valid representation that aligns with its intended meaning. With all valid descriptions and codes forming a joint space, codes drawn from the conditional distribution would lead to an underrepresentation of the full description-code space. As such, we propose Gibbs Fine-Tuning (GiFT), a novel self-training method inspired by Gibbs sampling. GiFT allows self-generated data to be drawn from the marginal distribution of the joint space, thereby mitigating the biases inherent in conditional sampling. We provide a theoretical analysis demonstrating the potential benefits of fine-tuning LLMs with code derived from the marginal distribution. Furthermore, we propose a perplexity-based code selection method to mitigate the imbalanced long-tail distribution of the self-generated codes. Empirical evaluation of two LLMs across four datasets demonstrates that GiFT achieves superior performance, particularly on more challenging benchmarks. Source code is available at https://github.com/Alex-HaochenLi/GiFT.
中文摘要:本研究提出吉布斯微调(GiFT)方法,通过从描述-代码联合空间的边际分布中采样,解决了代码生成中的代表性不足问题,显著提升了大型语言模型在复杂基准测试中的表现。
English Summary: The study introduces Gibbs Fine-Tuning (GiFT), a self-training method that addresses the underrepresentation in code generation by sampling from the marginal distribution of the joint description-code space, enhancing LLM performance on challenging benchmarks.
Authors:Ivo Gollini Navarrete, Nicolas Mauricio Cuadrado, Jose Renato Restom, Martin TakáÄ, Samuel Horváth
Abstract:
Pruning offers a promising solution to mitigate the associated costs and environmental impact of deploying large deep neural networks (DNNs). Traditional approaches rely on computationally expensive trained models or time-consuming iterative prune-retrain cycles, undermining their utility in resource-constrained settings. To address this issue, we build upon the established principles of saliency (LeCun et al., 1989) and connection sensitivity (Lee et al., 2018) to tackle the challenging problem of one-shot pruning neural networks (NNs) before training (PBT) at initialization. We introduce Fisher-Taylor Sensitivity (FTS), a computationally cheap and efficient pruning criterion based on the empirical Fisher Information Matrix (FIM) diagonal, offering a viable alternative for integrating first- and second-order information to identify a model's structurally important parameters. Although the FIM-Hessian equivalency only holds for convergent models that maximize the likelihood, recent studies (Karakida et al., 2019) suggest that, even at initialization, the FIM captures essential geometric information of parameters in overparameterized NNs, providing the basis for our method. Finally, we demonstrate empirically that layer collapse, a critical limitation of data-dependent pruning methodologies, is easily overcome by pruning within a single training epoch after initialization. We perform experiments on ResNet18 and VGG19 with CIFAR-10 and CIFAR-100, widely used benchmarks in pruning research. Our method achieves competitive performance against state-of-the-art techniques for one-shot PBT, even under extreme sparsity conditions. Our code is made available to the public.
中文摘要:本文提出Fisher-Taylor敏感性(FTS)这一高效剪枝标准,通过利用经验Fisher信息矩阵识别结构重要参数,可在训练前实现一次性神经网络剪枝,避免了昂贵的迭代计算过程。
English Summary: This paper introduces Fisher-Taylor Sensitivity (FTS), an efficient pruning criterion that enables one-shot neural network pruning before training by leveraging the empirical Fisher Information Matrix to identify structurally important parameters without costly iterative cycles.
Authors:Morgan Byrd, Jackson Crandell, Mili Das, Jessica Inman, Robert Wright, Sehoon Ha
Abstract:
Numerous real-world control problems involve dynamics and objectives affected by unobservable hidden parameters, ranging from autonomous driving to robotic manipulation, which cause performance degradation during sim-to-real transfer. To represent these kinds of domains, we adopt hidden-parameter Markov decision processes (HIP-MDPs), which model sequential decision problems where hidden variables parameterize transition and reward functions. Existing approaches, such as domain randomization, domain adaptation, and meta-learning, simply treat the effect of hidden parameters as additional variance and often struggle to effectively handle HIP-MDP problems, especially when the rewards are parameterized by hidden variables. We introduce Privileged-Dreamer, a model-based reinforcement learning framework that extends the existing model-based approach by incorporating an explicit parameter estimation module. PrivilegedDreamer features its novel dual recurrent architecture that explicitly estimates hidden parameters from limited historical data and enables us to condition the model, actor, and critic networks on these estimated parameters. Our empirical analysis on five diverse HIP-MDP tasks demonstrates that PrivilegedDreamer outperforms state-of-the-art model-based, model-free, and domain adaptation learning algorithms. Additionally, we conduct ablation studies to justify the inclusion of each component in the proposed architecture.
Authors:Jack Gallifant, Shan Chen, Kuleen Sasse, Hugo Aerts, Thomas Hartvigsen, Danielle S. Bitterman
Abstract:
Sparse Autoencoders (SAEs) provide potentials for uncovering structured, human-interpretable representations in Large Language Models (LLMs), making them a crucial tool for transparent and controllable AI systems. We systematically analyze SAE for interpretable feature extraction from LLMs in safety-critical classification tasks. Our framework evaluates (1) model-layer selection and scaling properties, (2) SAE architectural configurations, including width and pooling strategies, and (3) the effect of binarizing continuous SAE activations. SAE-derived features achieve macro F1 > 0.8, outperforming hidden-state and BoW baselines while demonstrating cross-model transfer from Gemma 2 2B to 9B-IT models. These features generalize in a zero-shot manner to cross-lingual toxicity detection and visual classification tasks. Our analysis highlights the significant impact of pooling strategies and binarization thresholds, showing that binarization offers an efficient alternative to traditional feature selection while maintaining or improving performance. These findings establish new best practices for SAE-based interpretability and enable scalable, transparent deployment of LLMs in real-world applications. Full repo: https://github.com/shan23chen/MOSAIC.
中文摘要:稀疏自编码器能够从大语言模型中提取可解释特征,在安全关键任务中超越基线方法,并通过优化配置实现跨模型与跨任务的泛化能力。
English Summary: Sparse Autoencoders effectively extract interpretable features from Large Language Models, surpassing baseline methods in safety-critical tasks and enabling cross-model and cross-task generalization with optimized configurations.
Authors:Andrii Krutsylo
Abstract:
Continual learning is the process of training machine learning models on a sequence of tasks where data distributions change over time. A well-known obstacle in this setting is catastrophic forgetting, a phenomenon in which a model drastically loses performance on previously learned tasks when learning new ones. A popular strategy to alleviate this problem is experience replay, in which a subset of old samples is stored in a memory buffer and replayed with new data. Despite continual learning advances focusing on which examples to store and how to incorporate them into the training loss, most approaches assume that sampling from this buffer is uniform by default.
We challenge the assumption that uniform sampling is necessarily optimal. We conduct an experiment in which the memory buffer updates the same way in every trial, but the replay probability of each stored sample changes between trials based on different random weight distributions. Specifically, we generate 50 different non-uniform sampling probability weights for each trial and compare their final accuracy to the uniform sampling baseline. We find that there is always at least one distribution that significantly outperforms the baseline across multiple buffer sizes, models, and datasets. These results suggest that more principled adaptive replay policies could yield further gains. We discuss how exploiting this insight could inspire new research on non-uniform memory sampling in continual learning to better mitigate catastrophic forgetting.
The code supporting this study is available at $\href{https://github.com/DentonJC/memory-sampling}{https://github.com/DentonJC/memory-sampling}$.
中文: 本研究挑战了持续学习中经验回放的均匀采样假设,证明非均匀采样策略在不同设置下始终优于基线,表明自适应回放策略能更有效地缓解灾难性遗忘问题。
English: This study challenges the uniform sampling assumption in continual learning's experience replay, demonstrating that non-uniform sampling strategies consistently outperform the baseline across various settings, suggesting adaptive replay policies could better mitigate catastrophic forgetting.
Authors:Yanran Wu, Inez Hua, Yi Ding
Abstract:
Large language models (LLMs) offer powerful capabilities but come with significant environmental impact, particularly in carbon emissions. Existing studies benchmark carbon emissions but lack a standardized basis for comparison across different model configurations. To address this, we introduce the concept of functional unit (FU) as a standardized basis and develop FUEL, the first FU-based framework for evaluating LLM serving's environmental impact. Through three case studies, we uncover key insights and trade-offs in reducing carbon emissions by optimizing model size, quantization strategy, and hardware choice, paving the way for more sustainable LLM serving. The code is available at https://github.com/jojacola/FUEL.
中文: 研究者提出基于功能单元的FUEL框架,为大型语言模型的环境影响评估建立统一标准,并通过案例研究揭示模型配置优化对降低碳排放的关键作用。
English: The authors propose FUEL, a functional unit-based framework to standardize environmental impact assessments of large language models, demonstrating through case studies how optimizing model configurations can reduce carbon emissions.
Authors:Yixin Ou, Yunzhi Yao, Ningyu Zhang, Hui Jin, Jiacheng Sun, Shumin Deng, Zhenguo Li, Huajun Chen
Abstract:
Despite exceptional capabilities in knowledge-intensive tasks, Large Language Models (LLMs) face a critical gap in understanding how they internalize new knowledge, particularly how to structurally embed acquired knowledge in their neural computations. We address this issue through the lens of knowledge circuit evolution, identifying computational subgraphs that facilitate knowledge storage and processing. Our systematic analysis of circuit evolution throughout continual pre-training reveals several key findings: (1) the acquisition of new knowledge is influenced by its relevance to pre-existing knowledge; (2) the evolution of knowledge circuits exhibits a distinct phase shift from formation to optimization; (3) the evolution of knowledge circuits follows a deep-to-shallow pattern. These insights not only advance our theoretical understanding of the mechanisms of new knowledge acquisition in LLMs, but also provide potential implications for improving continual pre-training strategies to enhance model performance. Code and data will be available at https://github.com/zjunlp/DynamicKnowledgeCircuits.
Chinese: 本研究通过知识回路演化的视角,揭示了大语言模型获取新知识受其与已有知识相关性影响,遵循从深层到浅层的模式,并经历形成到优化的阶段转变,为改进持续预训练策略提供了理论依据。
English: This study investigates how Large Language Models structurally embed new knowledge through knowledge circuit evolution, revealing that acquisition depends on relevance to existing knowledge, follows a deep-to-shallow pattern, and shifts from formation to optimization, offering insights to improve continual pre-training strategies.
Authors:Haoming Xu, Ningyuan Zhao, Liming Yang, Sendong Zhao, Shumin Deng, Mengru Wang, Bryan Hooi, Nay Oo, Huajun Chen, Ningyu Zhang
Abstract:
Current unlearning methods for large language models usually rely on reverse optimization to reduce target token probabilities. However, this paradigm disrupts the subsequent tokens prediction, degrading model performance and linguistic coherence. Moreover, existing evaluation metrics overemphasize contextual forgetting while inadequately assessing response fluency and relevance. To address these challenges, we propose ReLearn, a data augmentation and fine-tuning pipeline for effective unlearning, along with a comprehensive evaluation framework. This framework introduces Knowledge Forgetting Rate (KFR) and Knowledge Retention Rate (KRR) to measure knowledge-level preservation, and Linguistic Score (LS) to evaluate generation quality. Our experiments show that ReLearn successfully achieves targeted forgetting while preserving high-quality output. Through mechanistic analysis, we further demonstrate how reverse optimization disrupts coherent text generation, while ReLearn preserves this essential capability. Code is available at https://github.com/zjunlp/unlearn.
中文摘要:提出的ReLearn方法通过数据增强和微调有效实现大语言模型的定向遗忘,同时保持输出质量,优于会破坏语言连贯性的反向优化方法。
English Summary: The proposed ReLearn method effectively achieves targeted unlearning in large language models through data augmentation and fine-tuning while maintaining output quality, outperforming reverse optimization approaches that compromise linguistic coherence.
Authors:Bohan Lyu, Siqiao Huang, Zichen Liang, Qi-An Sun, Jiaming Zhang
Abstract:
Neural surrogate models have emerged as powerful and efficient tools in data mining. Meanwhile, large language models (LLMs) have demonstrated remarkable capabilities in code-related tasks. We investigate a novel application: using LLMs as surrogate models for code execution prediction. Given LLMs' unique ability to understand and process diverse programs, they present a promising direction for building general-purpose surrogate models. To systematically investigate this capability, we introduce SURGE, a comprehensive benchmark with $1160$ problems covering $8$ key aspects: multi-language programming tasks, competition-level programming problems, repository-level code analysis, high-cost scientific computing, time-complexity-intensive algorithms, buggy code analysis, programs dependent on specific compilers or execution environments, and formal mathematical proof verification. Through extensive empirical analysis of $21$ open-source and proprietary LLMs, we examine scaling laws, data efficiency, and predictive accuracy. Our findings reveal important insights about the feasibility of LLMs as efficient surrogates for computational processes, with implications for automated software testing, program analysis, and computational resource optimization in data mining applications. Code and dataset are released at https://github.com/Imbernoulli/SURGE.
中文摘要:本研究通过SURGE基准系统评估大语言模型在代码执行预测中作为神经代理模型的可行性,涵盖多语言编程、竞赛题目等八大维度,对21个模型的测试揭示了其在计算过程中替代作用的重要潜力。
English Summary: The study introduces SURGE, a benchmark evaluating whether large language models (LLMs) can effectively serve as neural surrogate models for code execution prediction across diverse programming scenarios, revealing key insights about their feasibility through comprehensive testing of 21 models.
Authors:Yanwei Yue, Guibin Zhang, Boyang Liu, Guancheng Wan, Kun Wang, Dawei Cheng, Yiyan Qi
Abstract:
Multi-agent systems (MAS) powered by Large Language Models (LLMs) have been demonstrated to push the boundaries of LLM capabilities, yet they often incur significant costs and face challenges in dynamic LLM selection. Current LLM routing methods effectively reduce overhead in single-agent scenarios by customizing LLM selection for each query, but they overlook the critical decisions regarding collaboration modes and agent roles in MAS. In response to this challenge, we first introduce the problem of Multi-Agent System Routing (MASR), which integrates all components of MAS into a unified routing framework. Toward this goal, we propose MasRouter, the first high-performing, cost-effective, and inductive MASR solution. MasRouter employs collaboration mode determination, role allocation, and LLM routing through a cascaded controller network, progressively constructing a MAS that balances effectiveness and efficiency. Extensive experiments demonstrate that MasRouter is (1) high-performing, achieving a $1.8\%\sim8.2\%$ improvement over the state-of-the-art method on MBPP; (2) economical, reducing overhead by up to $52.07\%$ compared to SOTA methods on HumanEval; and (3) plug-and-play, seamlessly integrating with mainstream MAS frameworks, reducing overhead by $17.21\%\sim28.17\%$ via customized routing. The code is available at https://github.com/yanweiyue/masrouter.
中文摘要:MasRouter提出了一种多智能体系统统一路由框架,通过协作模式决策、角色分配和大语言模型路由的级联控制,在提升系统性能的同时大幅降低了运行开销。
English Summary: MasRouter introduces a unified routing framework for multi-agent systems that optimizes collaboration modes, role allocation, and LLM selection to enhance performance while significantly reducing computational costs.
Authors:Shilong Wang, Guibin Zhang, Miao Yu, Guancheng Wan, Fanci Meng, Chongye Guo, Kun Wang, Yang Wang
Abstract:
Large Language Model (LLM)-based Multi-agent Systems (MAS) have demonstrated remarkable capabilities in various complex tasks, ranging from collaborative problem-solving to autonomous decision-making. However, as these systems become increasingly integrated into critical applications, their vulnerability to adversarial attacks, misinformation propagation, and unintended behaviors have raised significant concerns. To address this challenge, we introduce G-Safeguard, a topology-guided security lens and treatment for robust LLM-MAS, which leverages graph neural networks to detect anomalies on the multi-agent utterance graph and employ topological intervention for attack remediation. Extensive experiments demonstrate that G-Safeguard: (I) exhibits significant effectiveness under various attack strategies, recovering over 40% of the performance for prompt injection; (II) is highly adaptable to diverse LLM backbones and large-scale MAS; (III) can seamlessly combine with mainstream MAS with security guarantees. The code is available at https://github.com/wslong20/G-safeguard.
中文: G-Safeguard是一种基于拓扑引导的安全框架,通过图神经网络检测异常并实施拓扑干预,能有效提升基于大语言模型的多智能体系统对抗各类攻击的鲁棒性,同时保持与不同系统架构的兼容性。
English: G-Safeguard is a topology-guided security framework that uses graph neural networks to detect anomalies and apply topological interventions, effectively enhancing the robustness of LLM-based multi-agent systems against various attacks while maintaining compatibility with diverse system architectures.
Authors:Hongliang Lu, Zhonglin Xie, Yaoyu Wu, Can Ren, Yuxuan Chen, Zaiwen Wen
Abstract:
Despite the rapid development of large language models (LLMs), a fundamental challenge persists: the lack of high-quality optimization modeling datasets hampers LLMs' robust modeling of practical optimization problems from natural language descriptions (NL). This data scarcity also contributes to the generalization difficulties experienced by learning-based methods. To address these challenges, we propose a scalable framework for synthesizing a high-quality dataset, named OptMATH. Starting from curated seed data with mathematical formulations (MF), this framework automatically generates problem data (PD) with controllable complexity. Then, a back-translation step is employed to obtain NL. To verify the correspondence between the NL and the PD, a forward modeling step followed by rejection sampling is used. The accepted pairs constitute the training part of OptMATH. Then a collection of rejected pairs is identified and further filtered. This collection serves as a new benchmark for optimization modeling, containing difficult instances whose lengths are much longer than these of NL4OPT and MAMO. Through extensive experiments, we demonstrate that models of various sizes (0.5B-32B parameters) trained on OptMATH achieve superior results on multiple modeling benchmarks, thereby validating the effectiveness and scalability of our approach. Our dataset is publicly available at https://github.com/AuroraLHL/OptMATH.
中文摘要:OptMATH框架通过自动生成可控复杂度的数据并验证其与自然语言的对应关系,解决了高质量优化建模数据集稀缺的问题,使大语言模型在多个基准测试中实现卓越性能。
English Summary: The OptMATH framework addresses the scarcity of high-quality optimization modeling datasets by automatically generating controllable complexity data with verified natural language correspondences, enabling LLMs to achieve superior performance across benchmarks.
Authors:Zhao Wang, Sota Moriyama, Wei-Yao Wang, Briti Gangopadhyay, Shingo Takamatsu
Abstract:
Recent advancements in LLM-based multi-agent (LLM-MA) systems have shown promise, yet significant challenges remain in managing communication and refinement when agents collaborate on complex tasks. In this paper, we propose \textit{Talk Structurally, Act Hierarchically (TalkHier)}, a novel framework that introduces a structured communication protocol for context-rich exchanges and a hierarchical refinement system to address issues such as incorrect outputs, falsehoods, and biases. \textit{TalkHier} surpasses various types of SoTA, including inference scaling model (OpenAI-o1), open-source multi-agent models (e.g., AgentVerse), and majority voting strategies on current LLM and single-agent baselines (e.g., ReAct, GPT4o), across diverse tasks, including open-domain question answering, domain-specific selective questioning, and practical advertisement text generation. These results highlight its potential to set a new standard for LLM-MA systems, paving the way for more effective, adaptable, and collaborative multi-agent frameworks. The code is available https://github.com/sony/talkhier.
中文:提出的TalkHier框架通过结构化通信和分层优化机制解决了多智能体系统协作中的关键问题,在多项任务中超越现有先进模型,为高效协作确立了新标准。
English: The proposed TalkHier framework introduces structured communication and hierarchical refinement to overcome collaboration challenges in LLM-based multi-agent systems, outperforming state-of-the-art models across diverse tasks and setting a new standard for effective multi-agent collaboration.
Authors:Shaoxuan Xu, Menglu Cui, Chengxiang Huang, Hongfa Wang, Di Hu
Abstract:
Multimodal learning has gained attention for its capacity to integrate information from different modalities. However, it is often hindered by the multimodal imbalance problem, where certain modality dominates while others remain underutilized. Although recent studies have proposed various methods to alleviate this problem, they lack comprehensive and fair comparisons. In this paper, we systematically categorize various mainstream multimodal imbalance algorithms into four groups based on the strategies they employ to mitigate imbalance. To facilitate a comprehensive evaluation of these methods, we introduce BalanceBenchmark, a benchmark including multiple widely used multidimensional datasets and evaluation metrics from three perspectives: performance, imbalance degree, and complexity. To ensure fair comparisons, we have developed a modular and extensible toolkit that standardizes the experimental workflow across different methods. Based on the experiments using BalanceBenchmark, we have identified several key insights into the characteristics and advantages of different method groups in terms of performance, balance degree and computational complexity. We expect such analysis could inspire more efficient approaches to address the imbalance problem in the future, as well as foundation models. The code of the toolkit is available at https://github.com/GeWu-Lab/BalanceBenchmark.
中文: 本文提出BalanceBenchmark,一个用于系统评估多模态不平衡缓解方法的工具包和基准测试,揭示了不同方法在性能与效率方面的关键特征。
English: This paper introduces BalanceBenchmark, a comprehensive toolkit and benchmark for systematically evaluating multimodal imbalance mitigation methods, revealing key insights into their performance and efficiency.
Authors:Mingqian Ma, Guoqing Liu, Chuan Cao, Pan Deng, Tri Dao, Albert Gu, Peiran Jin, Zhao Yang, Yingce Xia, Renqian Luo, Pipi Hu, Zun Wang, Yuan-Jyue Chen, Haiguang Liu, Tao Qin
Abstract:
Advances in natural language processing and large language models have sparked growing interest in modeling DNA, often referred to as the "language of life". However, DNA modeling poses unique challenges. First, it requires the ability to process ultra-long DNA sequences while preserving single-nucleotide resolution, as individual nucleotides play a critical role in DNA function. Second, success in this domain requires excelling at both generative and understanding tasks: generative tasks hold potential for therapeutic and industrial applications, while understanding tasks provide crucial insights into biological mechanisms and diseases. To address these challenges, we propose HybriDNA, a decoder-only DNA language model that incorporates a hybrid Transformer-Mamba2 architecture, seamlessly integrating the strengths of attention mechanisms with selective state-space models. This hybrid design enables HybriDNA to efficiently process DNA sequences up to 131kb in length with single-nucleotide resolution. HybriDNA achieves state-of-the-art performance across 33 DNA understanding datasets curated from the BEND, GUE, and LRB benchmarks, and demonstrates exceptional capability in generating synthetic cis-regulatory elements (CREs) with desired properties. Furthermore, we show that HybriDNA adheres to expected scaling laws, with performance improving consistently as the model scales from 300M to 3B and 7B parameters. These findings underscore HybriDNA's versatility and its potential to advance DNA research and applications, paving the way for innovations in understanding and engineering the "language of life".
Authors:Xiliang Yang, Shenyang Deng, Shicong Liu, Yuanchi Suo, Wing. W. Y NG, Jianjun Zhang
Abstract:
Data augmentation is an important technique in training deep neural networks as it enhances their ability to generalize and remain robust. While data augmentation is commonly used to expand the sample size and act as a consistency regularization term, there is a lack of research on the relationship between them. To address this gap, this paper introduces a more comprehensive mathematical framework for data augmentation. Through this framework, we establish that the expected risk of the shifted population is the sum of the original population risk and a gap term, which can be interpreted as a consistency regularization term. The paper also provides a theoretical understanding of this gap, highlighting its negative effects on the early stages of training. We also propose a method to mitigate these effects. To validate our approach, we conducted experiments using same data augmentation techniques and computing resources under several scenarios, including standard training, out-of-distribution, and imbalanced classification. The results demonstrate that our methods surpass compared methods under all scenarios in terms of generalization ability and convergence stability. We provide our code implementation at the following link: https://github.com/ydlsfhll/ASPR.
中文摘要:本文提出了一个数学框架,揭示数据增强在扩大样本量和作为一致性正则化项方面的双重作用,并提出了一种在多种场景下超越现有方法的新方法。
English Summary: This paper introduces a mathematical framework revealing data augmentation's dual role in expanding sample size and serving as a consistency regularization term, proposing a method that outperforms existing approaches across multiple scenarios.
Authors:Xiangfei Qiu, Hanyin Cheng, Xingjian Wu, Jilin Hu, Chenjuan Guo, Bin Yang
Abstract:
Multivariate Time Series Forecasting (MTSF) plays a crucial role across diverse fields, ranging from economic, energy, to traffic. In recent years, deep learning has demonstrated outstanding performance in MTSF tasks. In MTSF, modeling the correlations among different channels is critical, as leveraging information from other related channels can significantly improve the prediction accuracy of a specific channel. This study systematically reviews the channel modeling strategies for time series and proposes a taxonomy organized into three hierarchical levels: the strategy perspective, the mechanism perspective, and the characteristic perspective. On this basis, we provide a structured analysis of these methods and conduct an in-depth examination of the advantages and limitations of different channel strategies. Finally, we summarize and discuss some future research directions to provide useful research guidance. Moreover, we maintain an up-to-date Github repository (https://github.com/decisionintelligence/CS4TS) which includes all the papers discussed in the survey.
中文: 本研究系统回顾了多元时间序列预测中的通道建模策略,提出三层分类框架并分析不同方法的优劣,同时总结了未来研究方向并维护了相关GitHub资源库。
English: This study systematically reviews and categorizes channel modeling strategies in multivariate time series forecasting into three hierarchical perspectives, analyzing their advantages and limitations while outlining future research directions and maintaining a relevant GitHub repository.
Authors:Jiarui Jin, Haoyu Wang, Hongyan Li, Jun Li, Jiahui Pan, Shenda Hong
Abstract:
Electrocardiogram (ECG) is essential for the clinical diagnosis of arrhythmias and other heart diseases, but deep learning methods based on ECG often face limitations due to the need for high-quality annotations. Although previous ECG self-supervised learning (eSSL) methods have made significant progress in representation learning from unannotated ECG data, they typically treat ECG signals as ordinary time-series data, segmenting the signals using fixed-size and fixed-step time windows, which often ignore the form and rhythm characteristics and latent semantic relationships in ECG signals. In this work, we introduce a novel perspective on ECG signals, treating heartbeats as words and rhythms as sentences. Based on this perspective, we first designed the QRS-Tokenizer, which generates semantically meaningful ECG sentences from the raw ECG signals. Building on these, we then propose HeartLang, a novel self-supervised learning framework for ECG language processing, learning general representations at form and rhythm levels. Additionally, we construct the largest heartbeat-based ECG vocabulary to date, which will further advance the development of ECG language processing. We evaluated HeartLang across six public ECG datasets, where it demonstrated robust competitiveness against other eSSL methods. Our data and code are publicly available at https://github.com/PKUDigitalHealth/HeartLang.
中文: 本研究提出HeartLang框架,将心搏视为单词、节律视为句子进行心电语言自监督学习,在六个数据集上表现优异,解决了以往方法忽视心电信号形态节律特征的问题。
English: This study introduces HeartLang, a self-supervised learning framework that treats heartbeats as words and rhythms as sentences to learn ECG representations, demonstrating strong performance across six datasets while addressing limitations of prior methods that overlooked ECG-specific characteristics.
Authors:Haiquan Qiu, You Wu, Dong Li, Jianmin Guo, Quanming Yao
Abstract:
Model merging enables powerful capabilities in neural networks without requiring additional training. In this paper, we introduce a novel perspective on model merging by leveraging the fundamental mechanisms of neural network representation. Our approach is motivated by the linear representation hypothesis, which states that neural networks encode information through linear combinations of feature vectors. We propose a method that superposes task-specific features from individual models into a merged model. Our approach specifically targets linear transformation matrices, which are crucial for feature activation and extraction in deep networks. By formulating the merging process as a linear system, we can preserve task-specific features from individual models and create merged models that effectively maintain multi-task capabilities compared to existing methods. Extensive experiments across diverse benchmarks and models demonstrate that our method outperforms existing techniques. Code is available at https://github.com/LARS-research/STF.
中文: 本文提出一种新颖的模型融合方法,基于线性表示假说,通过针对变换矩阵叠加任务特定特征,在多个基准测试中相比现有技术实现了更优的多任务性能。
English: This paper presents a novel model merging method that leverages the linear representation hypothesis to superpose task-specific features by targeting transformation matrices, achieving superior multi-task performance across benchmarks compared to existing techniques.
Authors:Ahmad Chaddad, Yihang Wu, Yuchen Jiang, Ahmed Bouridane, Christian Desrosiers
Abstract:
Traditional machine learning assumes that training and test sets are derived from the same distribution; however, this assumption does not always hold in practical applications. This distribution disparity can lead to severe performance drops when the trained model is used in new data sets. Domain adaptation (DA) is a machine learning technique that aims to address this problem by reducing the differences between domains. This paper presents simulation-based algorithms of recent DA techniques, mainly related to unsupervised domain adaptation (UDA), where labels are available only in the source domain. Our study compares these techniques with public data sets and diverse characteristics, highlighting their respective strengths and drawbacks. For example, Safe Self-Refinement for Transformer-based DA (SSRT) achieved the highest accuracy (91.6\%) in the office-31 data set during our simulations, however, the accuracy dropped to 72.4\% in the Office-Home data set when using limited batch sizes. In addition to improving the reader's comprehension of recent techniques in DA, our study also highlights challenges and upcoming directions for research in this domain. The codes are available at https://github.com/AIPMLab/Domain_Adaptation.
中文摘要:本文通过仿真比较了最新的无监督领域自适应技术,揭示了它们在不同数据集上的性能差异,并指出了未来研究面临的挑战。
English Summary: This paper compares recent unsupervised domain adaptation techniques through simulations, highlighting their varying performance across datasets and identifying challenges for future research.
Authors:Md Yousuf Harun, Jhair Gallardo, Christopher Kanan
Abstract:
Out-of-distribution (OOD) detection and OOD generalization are widely studied in Deep Neural Networks (DNNs), yet their relationship remains poorly understood. We empirically show that the degree of Neural Collapse (NC) in a network layer is inversely related with these objectives: stronger NC improves OOD detection but degrades generalization, while weaker NC enhances generalization at the cost of detection. This trade-off suggests that a single feature space cannot simultaneously achieve both tasks. To address this, we develop a theoretical framework linking NC to OOD detection and generalization. We show that entropy regularization mitigates NC to improve generalization, while a fixed Simplex Equiangular Tight Frame (ETF) projector enforces NC for better detection. Based on these insights, we propose a method to control NC at different DNN layers. In experiments, our method excels at both tasks across OOD datasets and DNN architectures. Code for our experiments is available at: https://yousuf907.github.io/ncoodg
Authors:Muhammad Ashad Kabir, Nidita Roy, Md. Ekramul Hossain, Jill Featherston, Sayed Ahmed
Abstract:
Deep learning (DL) techniques have emerged as promising solutions for medical wound tissue segmentation. However, a notable limitation in this field is the lack of publicly available labelled datasets and a standardised performance evaluation of state-of-the-art DL models on such datasets. This study addresses this gap by comprehensively evaluating various DL models for wound tissue segmentation using a novel dataset. We have curated a dataset comprising 147 wound images exhibiting six tissue types: slough, granulation, maceration, necrosis, bone, and tendon. The dataset was meticulously labelled for semantic segmentation employing supervised machine learning techniques. Three distinct labelling formats were developed -- full image, patch, and superpixel. Our investigation encompassed a wide array of DL segmentation and classification methodologies, ranging from conventional approaches like UNet, to generative adversarial networks such as cGAN, and modified techniques like FPN+VGG16. Also, we explored DL-based classification methods (e.g., ResNet50) and machine learning-based classification leveraging DL features (e.g., AlexNet+RF). In total, 82 wound tissue segmentation models were derived across the three labelling formats. Our analysis yielded several notable findings, including identifying optimal DL models for each labelling format based on weighted average Dice or F1 scores. Notably, FPN+VGG16 emerged as the top-performing DL model for wound tissue segmentation, achieving a dice score of 82.25%. This study provides a valuable benchmark for evaluating wound image segmentation and classification models, offering insights to inform future research and clinical practice in wound care. The labelled dataset created in this study is available at https://github.com/akabircs/WoundTissue.
中文摘要:本研究利用包含六种组织类型的147张伤口图像新数据集,评估了多种深度学习模型在伤口组织分割中的表现,确定FPN+VGG16为最佳模型(Dice得分82.25%),为未来伤口护理研究提供了重要基准。
English Summary: This study evaluates various deep learning models for wound tissue segmentation using a novel dataset of 147 images with six tissue types, identifying FPN+VGG16 as the top-performing model with an 82.25% dice score and providing a benchmark for future wound care research.
Authors:Kaiwen Shi, Yifei Li, Binh Ho, Jovian Wang, Kobe Guo
Abstract:
In recent years, machine learning algorithms have achieved much success in segmenting lesions across various tissues. There is, however, not one satisfying model that works well on all tissue types universally. In response to this need, we attempt to train a model that 1) works well on all tissue types, and 2) is capable of still performing fast inferences. To this end, we design our architectures, test multiple existing architectures, compare their results, and settle upon SwinUnet. We document our rationales, successes, and failures. Finally, we propose some further directions that we think are worth exploring. codes: https://github.com/KWFredShi/ULS2023NGKD.git
中文: 研究人员利用SwinUnet架构开发了一种通用病灶分割模型,该模型能在所有组织类型上有效工作并保持快速推理能力,同时公布了研究结果和代码以供进一步探索。
English: Researchers have developed a universal lesion segmentation model using SwinUnet architecture that performs effectively across all tissue types while maintaining fast inference speeds, with findings and code shared for further exploration.
Authors:Aditya Dey, Jonas Kusch, Fadi Al Machot
Abstract:
Long-term time series forecasting is critical in domains such as finance, economics, and energy, where accurate and reliable predictions over extended horizons drive strategic decision-making. Despite the progress in machine learning-based models, the impact of temporal noise in extended lookback windows remains underexplored, often degrading model performance and computational efficiency. In this paper, we propose a novel framework that addresses these challenges by integrating the Discrete Wavelet Transform (DWT) and Discrete Cosine Transform (DCT) to perform noise reduction and extract robust long-term features. These transformations enable the separation of meaningful temporal patterns from noise in both the time and frequency domains. To complement this, we introduce a lightweight low-rank linear prediction layer that not only reduces the influence of residual noise but also improves memory efficiency. Our approach demonstrates competitive robustness to noisy input, significantly reduces computational complexity, and achieves competitive or state-of-the-art forecasting performance across diverse benchmark datasets. Extensive experiments reveal that the proposed framework is particularly effective in scenarios with high noise levels or irregular patterns, making it well suited for real-world forecasting tasks. The code is available in https://github.com/forgee-master/HADL.
中文: 本文提出了一种结合离散小波变换和离散余弦变换的新框架,用于长期时间序列预测中的噪声消除和特征提取,该框架增强了对噪声输入的鲁棒性和计算效率,并在多个基准数据集上实现了具有竞争力的预测性能。
English: This paper introduces a novel framework combining Discrete Wavelet and Cosine Transforms for noise reduction and feature extraction in long-term time series forecasting, which enhances robustness to noisy inputs and computational efficiency while achieving competitive performance across benchmarks.
Authors:Kevin Garcia, Juan Manuel Perez, Yifeng Gao
Abstract:
Recently, there has been a significant advancement in designing Self-Supervised Learning (SSL) frameworks for time series data to reduce the dependency on data labels. Among these works, hierarchical contrastive learning-based SSL frameworks, which learn representations by contrasting data embeddings at multiple resolutions, have gained considerable attention. Due to their ability to gather more information, they exhibit better generalization in various downstream tasks. However, when the time series data length is significant long, the computational cost is often significantly higher than that of other SSL frameworks. In this paper, to address this challenge, we propose an efficient way to train hierarchical contrastive learning models. Inspired by the fact that each resolution's data embedding is highly dependent, we introduce importance-aware resolution selection based training framework to reduce the computational cost. In the experiment, we demonstrate that the proposed method significantly improves training time while preserving the original model's integrity in extensive time series classification performance evaluations. Our code could be found here, https://github.com/KEEBVIN/IARS
中文: 本文提出了一种高效的层次对比学习训练方法,通过重要性感知的分辨率选择来降低计算成本,同时保持时间序列分类性能。
English: This paper introduces an efficient training method for hierarchical contrastive learning in self-supervised time series analysis, using importance-aware resolution selection to reduce computational costs while maintaining model performance.
Authors:Zhenxing Mi, Kuan-Chieh Wang, Guocheng Qian, Hanrong Ye, Runtao Liu, Sergey Tulyakov, Kfir Aberman, Dan Xu
Abstract:
This paper presents ThinkDiff, a novel alignment paradigm that empowers text-to-image diffusion models with multimodal in-context understanding and reasoning capabilities by integrating the strengths of vision-language models (VLMs). Existing multimodal diffusion finetuning methods largely focus on pixel-level reconstruction rather than in-context reasoning, and are constrained by the complexity and limited availability of reasoning-based datasets. ThinkDiff addresses these challenges by leveraging vision-language training as a proxy task, aligning VLMs with the decoder of an encoder-decoder large language model (LLM) instead of a diffusion decoder. This proxy task builds on the observation that the $\textbf{LLM decoder}$ shares the same input feature space with $\textbf{diffusion decoders}$ that use the corresponding $\textbf{LLM encoder}$ for prompt embedding. As a result, aligning VLMs with diffusion decoders can be simplified through alignment with the LLM decoder. Without complex training and datasets, ThinkDiff effectively unleashes understanding, reasoning, and composing capabilities in diffusion models. Experiments demonstrate that ThinkDiff significantly improves accuracy from 19.2% to 46.3% on the challenging CoBSAT benchmark for multimodal in-context reasoning generation, with only 5 hours of training on 4 A100 GPUs. Additionally, ThinkDiff demonstrates exceptional performance in composing multiple images and texts into logically coherent images. Project page: https://mizhenxing.github.io/ThinkDiff.
Authors:Zheng Fang, Lichuan Xiang, Xu Cai, Kaicheng Zhou, Hongkai Wen
Abstract:
ControlNet offers a powerful way to guide diffusion-based generative models, yet most implementations rely on ad-hoc heuristics to choose which network blocks to control-an approach that varies unpredictably with different tasks. To address this gap, we propose FlexControl, a novel framework that copies all diffusion blocks during training and employs a trainable gating mechanism to dynamically select which blocks to activate at each denoising step. With introducing a computation-aware loss, we can encourage control blocks only to activate when it benefit the generation quality. By eliminating manual block selection, FlexControl enhances adaptability across diverse tasks and streamlines the design pipeline, with computation-aware training loss in an end-to-end training manner. Through comprehensive experiments on both UNet (e.g., SD1.5) and DiT (e.g., SD3.0), we show that our method outperforms existing ControlNet variants in certain key aspects of interest. As evidenced by both quantitative and qualitative evaluations, FlexControl preserves or enhances image fidelity while also reducing computational overhead by selectively activating the most relevant blocks. These results underscore the potential of a flexible, data-driven approach for controlled diffusion and open new avenues for efficient generative model design. The code will soon be available at https://github.com/Anonymousuuser/FlexControl.
中文:FlexControl 通过可训练门控机制动态选择去噪过程中的扩散模块,无需人工干预,在提升任务适应性的同时优化了计算效率。
English: FlexControl introduces a trainable gating mechanism to dynamically select diffusion blocks during denoising, eliminating manual selection and improving both adaptability and computational efficiency across tasks.
Authors:Libo Wang
Abstract:
To reduce the cost and consumption of computing resources caused by computational redundancy and delayed reward assignment in long CoT, this research proposes the dynamic chain-of-thought (D-CoT) with adaptive reasoning time and steps. The researcher used simulation experiment to simulate the integration of D-CoT through Python 3.13 IDLE combined with a Python simulator based on GPTs. At the same time, the researcher used DeepSeek R1 as a control group to test and compare the performance of the D-CoT simulator in processing MIT OpenCourseWare's linear algebra exam questions. Experimental results show that D-CoT is better than DeepSeek R1 based on long CoT in three indicators: reasoning time, CoT length (reasoning steps) and token count, which achieves a significant reduction in computing resource consumption. In addition, this research has potential value in deep reasoning optimization that is used as a reference for future dynamic deep reasoning frameworks.
中文: 本研究提出动态思维链(D-CoT),通过自适应推理步骤和时间降低计算成本,在线性代数问题处理中相比DeepSeek R1展现出更优的推理效率与资源节约潜力。
English: This research introduces Dynamic Chain-of-Thought (D-CoT) to reduce computational costs by optimizing reasoning steps and time, demonstrating superior efficiency over DeepSeek R1 in processing linear algebra questions with significant resource savings.
Authors:Wenxuan Guo, Xiuwei Xu, Ziwei Wang, Jianjiang Feng, Jie Zhou, Jiwen Lu
Abstract:
In this paper, we propose an efficient multi-level convolution architecture for 3D visual grounding. Conventional methods are difficult to meet the requirements of real-time inference due to the two-stage or point-based architecture. Inspired by the success of multi-level fully sparse convolutional architecture in 3D object detection, we aim to build a new 3D visual grounding framework following this technical route. However, as in 3D visual grounding task the 3D scene representation should be deeply interacted with text features, sparse convolution-based architecture is inefficient for this interaction due to the large amount of voxel features. To this end, we propose text-guided pruning (TGP) and completion-based addition (CBA) to deeply fuse 3D scene representation and text features in an efficient way by gradual region pruning and target completion. Specifically, TGP iteratively sparsifies the 3D scene representation and thus efficiently interacts the voxel features with text features by cross-attention. To mitigate the affect of pruning on delicate geometric information, CBA adaptively fixes the over-pruned region by voxel completion with negligible computational overhead. Compared with previous single-stage methods, our method achieves top inference speed and surpasses previous fastest method by 100\% FPS. Our method also achieves state-of-the-art accuracy even compared with two-stage methods, with $+1.13$ lead of Acc@0.5 on ScanRefer, and $+2.6$ and $+3.2$ leads on NR3D and SR3D respectively. The code is available at \href{https://github.com/GWxuan/TSP3D}{https://github.com/GWxuan/TSP3D}.
中文: 本文提出了一种高效的多级卷积架构用于三维视觉定位,通过文本引导剪枝和基于补全的添加方法,实现了最优的推理速度和精度。
English: This paper introduces an efficient multi-level convolution architecture for 3D visual grounding, utilizing text-guided pruning and completion-based addition to achieve state-of-the-art speed and accuracy.
Authors:Huayi Wang, Zirui Wang, Junli Ren, Qingwei Ben, Tao Huang, Weinan Zhang, Jiangmiao Pang
Abstract:
Traversing risky terrains with sparse footholds poses a significant challenge for humanoid robots, requiring precise foot placements and stable locomotion. Existing learning-based approaches often struggle on such complex terrains due to sparse foothold rewards and inefficient learning processes. To address these challenges, we introduce BeamDojo, a reinforcement learning (RL) framework designed for enabling agile humanoid locomotion on sparse footholds. BeamDojo begins by introducing a sampling-based foothold reward tailored for polygonal feet, along with a double critic to balancing the learning process between dense locomotion rewards and sparse foothold rewards. To encourage sufficient trial-and-error exploration, BeamDojo incorporates a two-stage RL approach: the first stage relaxes the terrain dynamics by training the humanoid on flat terrain while providing it with task-terrain perceptive observations, and the second stage fine-tunes the policy on the actual task terrain. Moreover, we implement a onboard LiDAR-based elevation map to enable real-world deployment. Extensive simulation and real-world experiments demonstrate that BeamDojo achieves efficient learning in simulation and enables agile locomotion with precise foot placement on sparse footholds in the real world, maintaining a high success rate even under significant external disturbances.
Authors:Yu-Ang Lee, Ching-Yun Ko, Tejaswini Pedapati, I-Hsin Chung, Mi-Yen Yeh, Pin-Yu Chen
Abstract:
Model merging is an efficient way of obtaining a multi-task model from several pretrained models without further fine-tuning, and it has gained attention in various domains, including natural language processing (NLP). Despite the efficiency, a key challenge in model merging is the seemingly inevitable decrease in task performance as the number of models increases. In this paper, we propose $\mathbf{S}$pectral $\mathbf{T}$runcation $\mathbf{A}$nd $\mathbf{R}$escale (STAR) that aims at mitigating ``merging conflicts'' by truncating small components in the respective spectral spaces, which is followed by an automatic parameter rescaling scheme to retain the nuclear norm of the original matrix. STAR requires no additional inference on original training data and is robust to hyperparamater choice. We demonstrate the effectiveness of STAR through extensive model merging cases on diverse NLP tasks. Specifically, STAR works robustly across varying model sizes, and can outperform baselines by 4.2$\%$ when merging 12 models on Flan-T5. Our code is publicly available at https://github.com/IBM/STAR.
中文: 本文提出的STAR方法通过截断谱空间分量和自动参数重缩放来缓解模型合并中的性能下降问题,在多种自然语言处理任务中无需额外数据即实现稳定性能提升。
English: The paper introduces STAR, a method that reduces performance loss in model merging by truncating spectral components and rescaling parameters, showing robust improvements across NLP tasks without needing extra data or fine-tuning.
Authors:Sanjiban Choudhury
Abstract:
We introduce Agent Process Reward Models (AgentPRM), a simple and scalable framework for training LLM agents to continually improve through interactions. AgentPRM follows a lightweight actor-critic paradigm, using Monte Carlo rollouts to compute reward targets and optimize policies. It requires minimal modifications to existing RLHF pipelines, making it easy to integrate at scale. Beyond AgentPRM, we propose InversePRM, which learns process rewards directly from demonstrations without explicit outcome supervision. We also explore key challenges and opportunities, including exploration, process reward shaping, and model-predictive reasoning. We evaluate on ALFWorld benchmark, show that small 3B models trained with AgentPRM and InversePRM outperform strong GPT-4o baselines, and analyze test-time scaling, reward hacking, and more. Our code is available at: https://github.com/sanjibanc/agent_prm.
中文: 我们提出了AgentPRM框架,通过轻量级演员-评论家方法和蒙特卡洛推演来训练LLM智能体持续优化,其变体InversePRM可从演示中学习过程奖励,两者在ALFWorld基准测试中均超越了GPT-4o基线模型。
English: We propose AgentPRM, a scalable framework for training LLM agents to improve continuously via actor-critic methods and Monte Carlo rollouts, requiring minimal RLHF pipeline changes, with its variant InversePRM learning rewards from demonstrations and both outperforming GPT-4o on ALFWorld benchmarks.
Authors:Lauri Seppäläinen, Mudong Guo, Kai Puolamäki
Abstract:
Most commonly used non-linear machine learning methods are closed-box models, uninterpretable to humans. The field of explainable artificial intelligence (XAI) aims to develop tools to examine the inner workings of these closed boxes. An often-used model-agnostic approach to XAI involves using simple models as local approximations to produce so-called local explanations; examples of this approach include LIME, SHAP, and SLISEMAP. This paper shows how a large set of local explanations can be reduced to a small "proxy set" of simple models, which can act as a generative global explanation. This reduction procedure, ExplainReduce, can be formulated as an optimisation problem and approximated efficiently using greedy heuristics.
中文:ExplainReduce方法通过优化算法将大量局部解释简化为一个精简的代理简单模型集合,从而为复杂机器学习系统提供可生成的全局解释。
English: ExplainReduce is a method that compiles numerous local explanations into a concise proxy set of simple models, offering a global understanding of complex machine learning systems through efficient optimization.
Authors:Aivin V. Solatorio, Rafael Macalaba, James Liounis
Abstract:
Tracking how data is mentioned and used in research papers provides critical insights for improving data discoverability, quality, and production. However, manually identifying and classifying dataset mentions across vast academic literature is resource-intensive and not scalable. This paper presents a machine learning framework that automates dataset mention detection across research domains by leveraging large language models (LLMs), synthetic data, and a two-stage fine-tuning process. We employ zero-shot extraction from research papers, an LLM-as-a-Judge for quality assessment, and a reasoning agent for refinement to generate a weakly supervised synthetic dataset. The Phi-3.5-mini instruct model is pre-fine-tuned on this dataset, followed by fine-tuning on a manually annotated subset. At inference, a ModernBERT-based classifier efficiently filters dataset mentions, reducing computational overhead while maintaining high recall. Evaluated on a held-out manually annotated sample, our fine-tuned model outperforms NuExtract-v1.5 and GLiNER-large-v2.1 in dataset extraction accuracy. Our results highlight how LLM-generated synthetic data can effectively address training data scarcity, improving generalization in low-resource settings. This framework offers a pathway toward scalable monitoring of dataset usage, enhancing transparency, and supporting researchers, funders, and policymakers in identifying data gaps and strengthening data accessibility for informed decision-making.
中文: 本文提出一种机器学习框架,利用大语言模型和合成数据自动识别研究论文中的数据集引用,其性能优于现有方法,可提升数据可发现性以支持科学决策。
English: This paper introduces a machine learning framework that automates dataset mention detection in research papers using large language models and synthetic data, outperforming existing methods and enhancing data discoverability for better decision-making.
Authors:Abdelhakim Benechehab, Vasilii Feofanov, Giuseppe Paolo, Albert Thomas, Maurizio Filippone, Balázs Kégl
Abstract:
Pre-trained foundation models (FMs) have shown exceptional performance in univariate time series forecasting tasks. However, several practical challenges persist, including managing intricate dependencies among features and quantifying uncertainty in predictions. This study aims to tackle these critical limitations by introducing adapters; feature-space transformations that facilitate the effective use of pre-trained univariate time series FMs for multivariate tasks. Adapters operate by projecting multivariate inputs into a suitable latent space and applying the FM independently to each dimension. Inspired by the literature on representation learning and partially stochastic Bayesian neural networks, we present a range of adapters and optimization/inference strategies. Experiments conducted on both synthetic and real-world datasets confirm the efficacy of adapters, demonstrating substantial enhancements in forecasting accuracy and uncertainty quantification compared to baseline methods. Our framework, AdaPTS, positions adapters as a modular, scalable, and effective solution for leveraging time series FMs in multivariate contexts, thereby promoting their wider adoption in real-world applications. We release the code at https://github.com/abenechehab/AdaPTS.
中文摘要:本研究引入适配器作为特征空间转换方法,使预训练的单变量时间序列基础模型能够有效处理多变量预测任务,显著提升了预测精度和不确定性量化能力。
English Summary: This study introduces adapters as feature-space transformations to enable pre-trained univariate time series foundation models to handle multivariate forecasting tasks effectively, improving both accuracy and uncertainty quantification.
Authors:Laurin Luttmann, Lin Xie
Abstract:
The Mixed-Shelves Picker Routing Problem (MSPRP) is a fundamental challenge in warehouse logistics, where pickers must navigate a mixed-shelves environment to retrieve SKUs efficiently. Traditional heuristics and optimization-based approaches struggle with scalability, while recent machine learning methods often rely on sequential decision-making, leading to high solution latency and suboptimal agent coordination. In this work, we propose a novel hierarchical and parallel decoding approach for solving the min-max variant of the MSPRP via multi-agent reinforcement learning. While our approach generates a joint distribution over agent actions, allowing for fast decoding and effective picker coordination, our method introduces a sequential action selection to avoid conflicts in the multi-dimensional action space. Experiments show state-of-the-art performance in both solution quality and inference speed, particularly for large-scale and out-of-distribution instances. Our code is publicly available at http://github.com/LTluttmann/marl4msprp.
中文摘要:本文提出一种基于多智能体强化学习的层次并行解码方法,通过顺序动作选择避免冲突,在解决最小最大混合货架拣选路径问题时实现了最优解质量和推理速度的突破。
English Summary: This paper introduces a hierarchical parallel decoding method using multi-agent reinforcement learning to efficiently solve the min-max Mixed-Shelves Picker Routing Problem, achieving superior solution quality and inference speed through conflict-free sequential action selection.
Authors:Riccardo Bravin, Massimo Pavan, Hazem Hesham Yousef Shalby, Fabrizio Pittorino, Manuel Roveri
Abstract:
Large Language Models (LLMs) have revolutionized natural language processing, setting new standards across a wide range of applications. However, their relevant memory and computational demands make them impractical for deployment on technologically-constrained tiny devices such as wearable devices and Internet-of-Things units. To address this limitation, we introduce EmbBERT-Q, a novel tiny language model specifically designed for tiny devices with stringent memory constraints. EmbBERT-Q achieves state-of-the-art (SotA) accuracy in Natural Language Processing tasks in this scenario, with a total memory footprint (weights and activations) of just 781 kB, representing a 25x reduction in size with respect to SotA models. By combining architectural innovations with hardware-compatible 8-bit quantization, EmbBERT-Q consistently outperforms several baseline models scaled down to a 2 MB memory budget (i.e., the maximum memory typically available in tiny devices), including heavily compressed versions of BERT and MAMBA. Extensive experimental evaluations on both a selected benchmark dataset, TinyNLP, specifically curated to evaluate Tiny Language Models in NLP tasks and real-world scenarios, and the GLUE benchmark, demonstrate EmbBERT-Q ability to deliver competitive accuracy with respect to existing approaches, achieving an unmatched balance between memory and performance. To ensure the complete and immediate reproducibility of all our results, we release all code, scripts, and model checkpoints at https://github.com/RiccardoBravin/tiny-LLM.
中文:EmbBERT-Q是一种专为内存受限微型设备设计的新型微型语言模型,通过架构创新和8位量化技术,仅用781 kB内存占用就实现了最先进的准确率,尺寸缩小了25倍。
English: EmbBERT-Q is a novel tiny language model designed for memory-constrained tiny devices, achieving state-of-the-art accuracy with a 25x size reduction to just 781 kB through architectural innovations and 8-bit quantization.
Authors:Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, Chongxuan Li
Abstract:
Autoregressive models (ARMs) are widely regarded as the cornerstone of large language models (LLMs). We challenge this notion by introducing LLaDA, a diffusion model trained from scratch under the pre-training and supervised fine-tuning (SFT) paradigm. LLaDA models distributions through a forward data masking process and a reverse process, parameterized by a vanilla Transformer to predict masked tokens. By optimizing a likelihood bound, it provides a principled generative approach for probabilistic inference. Across extensive benchmarks, LLaDA demonstrates strong scalability, outperforming our self-constructed ARM baselines. Remarkably, LLaDA 8B is competitive with strong LLMs like LLaMA3 8B in in-context learning and, after SFT, exhibits impressive instruction-following abilities in case studies such as multi-turn dialogue. Moreover, LLaDA addresses the reversal curse, surpassing GPT-4o in a reversal poem completion task. Our findings establish diffusion models as a viable and promising alternative to ARMs, challenging the assumption that key LLM capabilities discussed above are inherently tied to ARMs. Project page and codes: https://ml-gsai.github.io/LLaDA-demo/.
Chinese: LLaDA作为一种基于扩散的模型,通过在语言任务中展现竞争力并解决如逆向诅咒等问题,挑战了自回归模型的地位,为大型语言模型提供了可行的替代方案。
English: LLaDA, a diffusion-based model, challenges autoregressive models by demonstrating competitive performance in language tasks and addressing issues like the reversal curse, offering a viable alternative for large language models.
Authors:Xiaoya Lu, Dongrui Liu, Yi Yu, Luxin Xu, Jing Shao
Abstract:
Despite the rapid development of safety alignment techniques for LLMs, defending against multi-turn jailbreaks is still a challenging task. In this paper, we conduct a comprehensive comparison, revealing that some existing defense methods can improve the robustness of LLMs against multi-turn jailbreaks but compromise usability, i.e., reducing general capabilities or causing the over-refusal problem. From the perspective of mechanism interpretability of LLMs, we discover that these methods fail to establish a boundary that exactly distinguishes safe and harmful feature representations. Therefore, boundary-safe representations close to harmful representations are inevitably disrupted, leading to a decline in usability. To address this issue, we propose X-Boundary to push harmful representations away from boundary-safe representations and obtain an exact distinction boundary. In this way, harmful representations can be precisely erased without disrupting safe ones. Experimental results show that X-Boundary achieves state-of-the-art defense performance against multi-turn jailbreaks, while reducing the over-refusal rate by about 20% and maintaining nearly complete general capability. Furthermore, we theoretically prove and empirically verify that X-Boundary can accelerate the convergence process during training. Please see our code at: https://github.com/AI45Lab/X-Boundary.
中文摘要:本研究提出X-Boundary方法,通过精确区分安全与有害特征表示并仅消除后者,在保持大语言模型通用能力的同时,将过度拒绝率降低约20%,实现了对多轮越狱攻击的最优防御效果。
English Summary: The study introduces X-Boundary, a novel defense method that enhances LLM robustness against multi-turn jailbreaks by precisely distinguishing and erasing harmful representations while preserving usability and reducing over-refusal by approximately 20%.
Authors:Ishika Agarwal, Dilek Hakkani-Tür
Abstract:
Influence functions provide crucial insights into model training, but existing methods suffer from large computational costs and limited generalization. Particularly, recent works have proposed various metrics and algorithms to calculate the influence of data using language models, which do not scale well with large models and datasets. This is because of the expensive forward and backward passes required for computation, substantial memory requirements to store large models, and poor generalization of influence estimates to new data. In this paper, we explore the use of small neural networks -- which we refer to as the InfluenceNetwork -- to estimate influence values, achieving up to 99% cost reduction. Our evaluation demonstrates that influence values can be estimated with models just 0.0027% the size of full language models (we use 7B and 8B versions). We apply our algorithm of estimating influence values (called NN-CIFT: Neural Networks for effiCient Instruction Fine-Tuning) to the downstream task of subset selection for general instruction fine-tuning. In our study, we include four state-of-the-art influence functions and show no compromise in performance, despite large speedups, between NN-CIFT and the original influence functions. We provide an in-depth hyperparameter analyses of NN-CIFT. The code for our method can be found here: https://github.com/agarwalishika/NN-CIFT.
中文: 本文提出InfluenceNetwork方法,通过小型神经网络高效估算数据影响力值,成本降低高达99%,且性能与传统影响力函数相当。
English: This paper introduces InfluenceNetwork, a method using small neural networks to efficiently estimate data influence values with up to 99% cost reduction while maintaining performance comparable to traditional influence functions.
Authors:Kehan Guo, Yili Shen, Gisela Abigail Gonzalez-Montiel, Yue Huang, Yujun Zhou, Mihir Surve, Zhichun Guo, Prayel Das, Nitesh V Chawla, Olaf Wiest, Xiangliang Zhang
Abstract:
The rapid advent of machine learning (ML) and artificial intelligence (AI) has catalyzed major transformations in chemistry, yet the application of these methods to spectroscopic and spectrometric data, referred to as Spectroscopy Machine Learning (SpectraML), remains relatively underexplored. Modern spectroscopic techniques (MS, NMR, IR, Raman, UV-Vis) generate an ever-growing volume of high-dimensional data, creating a pressing need for automated and intelligent analysis beyond traditional expert-based workflows. In this survey, we provide a unified review of SpectraML, systematically examining state-of-the-art approaches for both forward tasks (molecule-to-spectrum prediction) and inverse tasks (spectrum-to-molecule inference). We trace the historical evolution of ML in spectroscopy, from early pattern recognition to the latest foundation models capable of advanced reasoning, and offer a taxonomy of representative neural architectures, including graph-based and transformer-based methods. Addressing key challenges such as data quality, multimodal integration, and computational scalability, we highlight emerging directions such as synthetic data generation, large-scale pretraining, and few- or zero-shot learning. To foster reproducible research, we also release an open-source repository containing recent papers and their corresponding curated datasets (https://github.com/MINE-Lab-ND/SpectrumML_Survey_Papers). Our survey serves as a roadmap for researchers, guiding progress at the intersection of spectroscopy and AI.
中文摘要:本综述系统梳理了光谱机器学习领域,涵盖从分子到光谱的预测及光谱到分子的推断任务,追溯了该领域从模式识别到先进基础模型的发展历程,并指出了合成数据生成与小样本学习等新兴研究方向。
English Summary: This survey provides a comprehensive review of Spectroscopy Machine Learning (SpectraML), examining forward and inverse tasks while tracing its evolution from pattern recognition to advanced foundation models, and highlights emerging directions like synthetic data generation and few-shot learning.
Authors:Chris Zhuang, Debadyuti Mukherjee, Yingzhou Lu, Tianfan Fu, Ruqi Zhang
Abstract:
Molecular discovery has brought great benefits to the chemical industry. Various molecule design techniques are developed to identify molecules with desirable properties. Traditional optimization methods, such as genetic algorithms, continue to achieve state-of-the-art results across multiple molecular design benchmarks. However, these techniques rely solely on random walk exploration, which hinders both the quality of the final solution and the convergence speed. To address this limitation, we propose a novel approach called Gradient Genetic Algorithm (Gradient GA), which incorporates gradient information from the objective function into genetic algorithms. Instead of random exploration, each proposed sample iteratively progresses toward an optimal solution by following the gradient direction. We achieve this by designing a differentiable objective function parameterized by a neural network and utilizing the Discrete Langevin Proposal to enable gradient guidance in discrete molecular spaces. Experimental results demonstrate that our method significantly improves both convergence speed and solution quality, outperforming cutting-edge techniques. For example, it achieves up to a 25% improvement in the top-10 score over the vanilla genetic algorithm. The code is publicly available at https://github.com/debadyuti23/GradientGA.
Chinese Summary: 梯度遗传算法通过引入目标函数的梯度信息来优化分子设计,相比传统方法显著提升了收敛速度和解决方案质量。
English Summary: The Gradient Genetic Algorithm (Gradient GA) enhances traditional genetic algorithms by incorporating gradient information to guide molecular optimization, significantly improving both convergence speed and solution quality in molecular design.
Authors:Anzo Teh, Mark Jabbour, Yury Polyanskiy
Abstract:
This work applies modern AI tools (transformers) to solving one of the oldest statistical problems: Poisson means under empirical Bayes (Poisson-EB) setting. In Poisson-EB a high-dimensional mean vector $θ$ (with iid coordinates sampled from an unknown prior $Ï$) is estimated on the basis of $X=\mathrm{Poisson}(θ)$. A transformer model is pre-trained on a set of synthetically generated pairs $(X,θ)$ and learns to do in-context learning (ICL) by adapting to unknown $Ï$. Theoretically, we show that a sufficiently wide transformer can achieve vanishing regret with respect to an oracle estimator who knows $Ï$ as dimension grows to infinity. Practically, we discover that already very small models (100k parameters) are able to outperform the best classical algorithm (non-parametric maximum likelihood, or NPMLE) both in runtime and validation loss, which we compute on out-of-distribution synthetic data as well as real-world datasets (NHL hockey, MLB baseball, BookCorpusOpen). Finally, by using linear probes, we confirm that the transformer's EB estimator appears to internally work differently from either NPMLE or Robbins' estimators.
本研究应用Transformer人工智能解决泊松经验贝叶斯问题,证明即使小型模型也能在速度和精度上超越传统方法,同时通过独特的内部机制运作。
This study employs transformer AI to tackle the Poisson empirical Bayes problem, demonstrating that even small models can surpass traditional methods in speed and accuracy while operating through distinct internal mechanisms.
Authors:Yuhui Zhang, Yuchang Su, Chenyu Wang, Tianhong Li, Zoe Wefers, Jeffrey Nirschl, James Burgess, Daisy Ding, Alejandro Lozano, Emma Lundberg, Serena Yeung-Levy
Abstract:
Building a virtual cell capable of accurately simulating cellular behaviors in silico has long been a dream in computational biology. We introduce CellFlux, an image-generative model that simulates cellular morphology changes induced by chemical and genetic perturbations using flow matching. Unlike prior methods, CellFlux models distribution-wise transformations from unperturbed to perturbed cell states, effectively distinguishing actual perturbation effects from experimental artifacts such as batch effects -- a major challenge in biological data. Evaluated on chemical (BBBC021), genetic (RxRx1), and combined perturbation (JUMP) datasets, CellFlux generates biologically meaningful cell images that faithfully capture perturbation-specific morphological changes, achieving a 35% improvement in FID scores and a 12% increase in mode-of-action prediction accuracy over existing methods. Additionally, CellFlux enables continuous interpolation between cellular states, providing a potential tool for studying perturbation dynamics. These capabilities mark a significant step toward realizing virtual cell modeling for biomedical research. Project page: https://yuhui-zh15.github.io/CellFlux/.
中文: CellFlux是一种创新的图像生成模型,通过流匹配技术模拟化学和遗传扰动引起的细胞形态变化,在图像质量和生物学预测准确性上相比现有方法实现了显著提升。
English: CellFlux is an innovative image-generative model that simulates cellular morphology changes from chemical and genetic perturbations using flow matching, achieving significant improvements in image quality and biological prediction accuracy over existing methods.
Authors:Amit Levi, Rom Himelstein, Yaniv Nemcovsky, Avi Mendelson, Chaim Baskin
Abstract:
Safety-aligned LLMs respond to prompts with either compliance or refusal, each corresponding to distinct directions in the model's activation space. Recent works show that initializing attacks via self-transfer from other prompts significantly enhances their performance. However, the underlying mechanisms of these initializations remain unclear, and attacks utilize arbitrary or hand-picked initializations. This work presents that each gradient-based jailbreak attack and subsequent initialization gradually converge to a single compliance direction that suppresses refusal, thereby enabling an efficient transition from refusal to compliance. Based on this insight, we propose CRI, an initialization framework that aims to project unseen prompts further along compliance directions. We demonstrate our approach on multiple attacks, models, and datasets, achieving an increased attack success rate (ASR) and reduced computational overhead, highlighting the fragility of safety-aligned LLMs. A reference implementation is available at: https://amit1221levi.github.io/CRI-Jailbreak-Init-LLMs-evaluation.
Authors:Tianyi Zhou, Deqing Fu, Mahdi Soltanolkotabi, Robin Jia, Vatsal Sharan
Abstract:
Large Language Models (LLMs) typically represent numbers using multiple tokens, which requires the model to aggregate these tokens to interpret numerical values. This fragmentation makes both training and inference less efficient and adversely affects the model's performance on number-related tasks. Inspired by the observation that pre-trained LLMs internally learn Fourier-like features for number tokens, we propose Fourier Number Embedding (FoNE), a novel method that directly maps numbers into the embedding space with their Fourier features. FoNE encodes each number as a single token with only two embedding dimensions per digit, effectively capturing numerical values without fragmentation. This compact representation accelerates both training and inference. Compared to traditional subword and digit-wise embeddings, FoNE not only reduces computational overhead but also achieves higher accuracy across various numerical tasks including addition, subtraction and multiplication. On 6-digit decimal addition, FoNE requires 64$\times$ less data to achieve 99% accuracy than subword and digit-wise embeddings while using 3$\times$ and 6$\times$ fewer tokens per number, respectively. Furthermore, FoNE is the only method that yields 100% accuracy on over 100,000 test examples for addition, subtraction, and multiplication. The codes and visualization are available at https://fouriernumber.github.io/.
Authors:Benedikt Alkin, Maurits Bleeker, Richard Kurle, Tobias Kronlachner, Reinhard Sonnleitner, Matthias Dorfer, Johannes Brandstetter
Abstract:
Recent advances in neural surrogate modeling offer the potential for transformative innovations in applications such as automotive aerodynamics. Yet, industrial-scale problems often involve volumetric meshes with cell counts reaching 100 million, presenting major scalability challenges. Complex geometries further complicate modeling through intricate surface-volume interactions, while quantities such as vorticity are highly nonlinear and must satisfy strict divergence-free constraints. To address these requirements, we introduce Anchored-Branched Universal Physics Transformers (AB-UPT) as a novel modeling scheme for building neural surrogates for computational fluid dynamics (CFD) simulations. AB-UPT is designed to: (i) decouple geometry encoding and prediction tasks via multi-branch operators; (ii) enable scalability to high-resolution outputs via neural simulation in a low-dimensional latent space, coupled with anchored neural field decoders to predict high-fidelity outputs; (iii) enforce physics consistency by a novel divergence-free formulation. We show that AB-UPT yields state-of-the-art predictive accuracy of surface and volume fields on automotive CFD simulations ranging from 33 thousand up to 150 million mesh cells. Furthermore, our anchored neural field architecture enables the enforcement of hard physical constraints on the physics predictions without degradation in performance, exemplified by modeling divergence-free vorticity fields. Notably, the proposed models can be trained on a single GPU in less than a day and predict industry-standard surface and volume fields within seconds. Additionally, we show that the flexible design of our method enables neural simulation from a computer-aided design geometry alone, omitting the need for costly CFD meshing procedures.
Chinese: AB-UPT模型提出了一种新颖的计算流体动力学神经代理方法,通过解耦几何编码与预测任务,在单个GPU上实现高效训练的同时强制执行无散度涡量场等物理约束,有效解决了工业级仿真的可扩展性难题。
English: The AB-UPT model introduces a novel neural surrogate approach for computational fluid dynamics that overcomes scalability challenges in industrial-scale simulations by decoupling geometry encoding and prediction, enabling efficient training on a single GPU while enforcing strict physical constraints like divergence-free vorticity fields.
Authors:Chengqian Gao, Haonan Li, Liu Liu, Zeke Xie, Peilin Zhao, Zhiqiang Xu
Abstract:
The alignment of large language models (LLMs) often assumes that using more clean data yields better outcomes, overlooking the match between model capacity and example difficulty. Challenging this, we propose a new principle: Preference data vary in difficulty, and overly difficult examples hinder alignment, by exceeding the model's capacity. Through systematic experimentation, we validate this principle with three key findings: (1) preference examples vary in difficulty, as evidenced by consistent learning orders across alignment runs; (2) overly difficult examples significantly degrade performance across four LLMs and two datasets; and (3) the capacity of a model dictates its threshold for handling difficult examples, underscoring a critical relationship between data selection and model capacity. Building on this principle, we introduce Selective DPO, which filters out overly difficult examples. This simple adjustment improves alignment performance by 9-16% in win rates on the AlpacaEval 2 benchmark compared to the DPO baseline, suppressing a series of DPO variants with different algorithmic adjustments. Together, these results illuminate the importance of aligning data difficulty with model capacity, offering a transformative perspective for improving alignment strategies in LLMs. Code is available at https://github.com/glorgao/SelectiveDPO.
中文: 本研究挑战了传统上依赖更多数据对齐大语言模型的做法,通过证明过于困难的示例会损害性能,并提出了选择性DPO方法,通过过滤这些示例将对齐效果在胜率上提升了9-16%。
English: The study challenges the conventional approach of using more data for aligning large language models by demonstrating that overly difficult examples hinder performance and introduces Selective DPO, a method that filters such examples to improve alignment outcomes by 9-16% in win rates.
Authors:Xueyi Liu, Jianibieke Adalibieke, Qianwei Han, Yuzhe Qin, Li Yi
Abstract:
We address the challenge of developing a generalizable neural tracking controller for dexterous manipulation from human references. This controller aims to manage a dexterous robot hand to manipulate diverse objects for various purposes defined by kinematic human-object interactions. Developing such a controller is complicated by the intricate contact dynamics of dexterous manipulation and the need for adaptivity, generalizability, and robustness. Current reinforcement learning and trajectory optimization methods often fall short due to their dependence on task-specific rewards or precise system models. We introduce an approach that curates large-scale successful robot tracking demonstrations, comprising pairs of human references and robot actions, to train a neural controller. Utilizing a data flywheel, we iteratively enhance the controller's performance, as well as the number and quality of successful tracking demonstrations. We exploit available tracking demonstrations and carefully integrate reinforcement learning and imitation learning to boost the controller's performance in dynamic environments. At the same time, to obtain high-quality tracking demonstrations, we individually optimize per-trajectory tracking by leveraging the learned tracking controller in a homotopy optimization method. The homotopy optimization, mimicking chain-of-thought, aids in solving challenging trajectory tracking problems to increase demonstration diversity. We showcase our success by training a generalizable neural controller and evaluating it in both simulation and real world. Our method achieves over a 10% improvement in success rates compared to leading baselines. The project website with animated results is available at https://meowuu7.github.io/DexTrack/.
Chinese: 本研究通过整合人类-机器人示范数据并结合模仿学习与同伦优化方法,开发出适用于灵巧操作的通用神经控制器,在真实环境测试中相比主流基线模型成功率提升超过10%。
English: This study develops a generalizable neural controller for dexterous robotic manipulation by curating human-robot demonstration pairs and integrating imitation learning with homotopy optimization, achieving over 10% higher success rates than leading baselines in real-world evaluations.
Authors:Yung-Sung Chuang, Benjamin Cohen-Wang, Shannon Zejiang Shen, Zhaofeng Wu, Hu Xu, Xi Victoria Lin, James Glass, Shang-Wen Li, Wen-tau Yih
Abstract:
We introduce SelfCite, a novel self-supervised approach that aligns LLMs to generate high-quality, fine-grained, sentence-level citations for the statements in their generated responses. Instead of only relying on costly and labor-intensive annotations, SelfCite leverages a reward signal provided by the LLM itself through context ablation: If a citation is necessary, removing the cited text from the context should prevent the same response; if sufficient, retaining the cited text alone should preserve the same response. This reward can guide the inference-time best-of-N sampling strategy to improve citation quality significantly, as well as be used in preference optimization to directly fine-tune the models for generating better citations. The effectiveness of SelfCite is demonstrated by increasing citation F1 up to 5.3 points on the LongBench-Cite benchmark across five long-form question answering tasks. The source code is available at https://github.com/facebookresearch/SelfCite
中文: SelfCite是一种自监督方法,通过上下文消融生成奖励信号,引导LLM在推理时优化采样和微调,显著提升句子级引用的质量,在基准测试中F1分数最高提升5.3分。
English: SelfCite is a self-supervised method that enhances LLMs' sentence-level citation accuracy by using context ablation to generate rewards, improving citation quality through sampling and fine-tuning, achieving up to a 5.3-point F1 score increase on benchmarks.
Authors:Siyan Zhao, Mingyi Hong, Yang Liu, Devamanyu Hazarika, Kaixiang Lin
Abstract:
Large Language Models (LLMs) are increasingly used as chatbots, yet their ability to personalize responses to user preferences remains limited. We introduce PrefEval, a benchmark for evaluating LLMs' ability to infer, memorize and adhere to user preferences in a long-context conversational setting. PrefEval comprises 3,000 manually curated user preference and query pairs spanning 20 topics. PrefEval contains user personalization or preference information in both explicit and implicit forms, and evaluates LLM performance using a generation and a classification task. With PrefEval, we evaluated the aforementioned preference following capabilities of 10 open-source and proprietary LLMs in multi-session conversations with varying context lengths up to 100k tokens. We benchmark with various prompting, iterative feedback, and retrieval-augmented generation methods. Our benchmarking effort reveals that state-of-the-art LLMs face significant challenges in proactively following users' preferences during conversations. In particular, in zero-shot settings, preference following accuracy falls below 10% at merely 10 turns (~3k tokens) across most evaluated models. Even with advanced prompting and retrieval methods, preference following still deteriorates in long-context conversations. Furthermore, we show that fine-tuning on PrefEval significantly improves performance. We believe PrefEval serves as a valuable resource for measuring, understanding, and enhancing LLMs' preference following abilities, paving the way for personalized conversational agents. Our code and dataset are available at https://prefeval.github.io/.
Authors:Montgomery Bohde, Mrunali Manjrekar, Runzhong Wang, Shuiwang Ji, Connor W. Coley
Abstract:
Mass spectrometry plays a fundamental role in elucidating the structures of unknown molecules and subsequent scientific discoveries. One formulation of the structure elucidation task is the conditional de novo generation of molecular structure given a mass spectrum. Toward a more accurate and efficient scientific discovery pipeline for small molecules, we present DiffMS, a formula-restricted encoder-decoder generative network that achieves state-of-the-art performance on this task. The encoder utilizes a transformer architecture and models mass spectra domain knowledge such as peak formulae and neutral losses, and the decoder is a discrete graph diffusion model restricted by the heavy-atom composition of a known chemical formula. To develop a robust decoder that bridges latent embeddings and molecular structures, we pretrain the diffusion decoder with fingerprint-structure pairs, which are available in virtually infinite quantities, compared to structure-spectrum pairs that number in the tens of thousands. Extensive experiments on established benchmarks show that DiffMS outperforms existing models on de novo molecule generation. We provide several ablations to demonstrate the effectiveness of our diffusion and pretraining approaches and show consistent performance scaling with increasing pretraining dataset size. DiffMS code is publicly available at https://github.com/coleygroup/DiffMS.
中文:DiffMS是一种先进的生成网络,结合了Transformer编码器和离散图扩散解码器,能够根据质谱精确生成分子结构,在从头分子生成任务中超越了现有模型。
English: DiffMS is a state-of-the-art generative network that uses a transformer encoder and discrete graph diffusion decoder to accurately generate molecular structures from mass spectra, outperforming existing models in de novo molecule generation.
Authors:Liang Wang, Chao Song, Zhiyuan Liu, Yu Rong, Qiang Liu, Shu Wu, Liang Wang
Abstract:
Generative tasks about molecules, including but not limited to molecule generation, are crucial for drug discovery and material design, and have consistently attracted significant attention. In recent years, diffusion models have emerged as an impressive class of deep generative models, sparking extensive research and leading to numerous studies on their application to molecular generative tasks. Despite the proliferation of related work, there remains a notable lack of up-to-date and systematic surveys in this area. Particularly, due to the diversity of diffusion model formulations, molecular data modalities, and generative task types, the research landscape is challenging to navigate, hindering understanding and limiting the area's growth. To address this, this paper conducts a comprehensive survey of diffusion model-based molecular generative methods. We systematically review the research from the perspectives of methodological formulations, data modalities, and task types, offering a novel taxonomy. This survey aims to facilitate understanding and further flourishing development in this area. The relevant papers are summarized at: https://github.com/AzureLeon1/awesome-molecular-diffusion-models.
中文摘要:本文对基于扩散模型的分子生成方法进行了全面综述,从方法框架、数据模态和任务类型角度系统梳理了相关研究,旨在促进该领域的深入理解和蓬勃发展。
English Summary: This paper provides a comprehensive survey of diffusion model-based molecular generative methods, systematically reviewing them through methodological formulations, data modalities, and task types to facilitate understanding and development in this rapidly growing field.
Authors:Theodoros Kouzelis, Ioannis Kakogeorgiou, Spyros Gidaris, Nikos Komodakis
Abstract:
Latent generative models have emerged as a leading approach for high-quality image synthesis. These models rely on an autoencoder to compress images into a latent space, followed by a generative model to learn the latent distribution. We identify that existing autoencoders lack equivariance to semantic-preserving transformations like scaling and rotation, resulting in complex latent spaces that hinder generative performance. To address this, we propose EQ-VAE, a simple regularization approach that enforces equivariance in the latent space, reducing its complexity without degrading reconstruction quality. By finetuning pre-trained autoencoders with EQ-VAE, we enhance the performance of several state-of-the-art generative models, including DiT, SiT, REPA and MaskGIT, achieving a 7 speedup on DiT-XL/2 with only five epochs of SD-VAE fine-tuning. EQ-VAE is compatible with both continuous and discrete autoencoders, thus offering a versatile enhancement for a wide range of latent generative models. Project page and code: https://eq-vae.github.io/.
Authors:Nicholas Dronen, Randall Balestriero
Abstract:
Catastrophic forgetting -- the phenomenon of a neural network learning a task t1 and losing the ability to perform it after being trained on some other task t2 -- is a long-standing problem for neural networks [McCloskey and Cohen, 1989]. We present a method, Eidetic Learning, that provably solves catastrophic forgetting. A network trained with Eidetic Learning -- here, an EideticNet -- requires no rehearsal or replay. We consider successive discrete tasks and show how at inference time an EideticNet automatically routes new instances without auxiliary task information. An EideticNet bears a family resemblance to the sparsely-gated Mixture-of-Experts layer Shazeer et al. [2016] in that network capacity is partitioned across tasks and the network itself performs data-conditional routing. An EideticNet is easy to implement and train, is efficient, and has time and space complexity linear in the number of parameters. The guarantee of our method holds for normalization layers of modern neural networks during both pre-training and fine-tuning. We show with a variety of network architectures and sets of tasks that EideticNets are immune to forgetting. While the practical benefits of EideticNets are substantial, we believe they can be benefit practitioners and theorists alike. The code for training EideticNets is available at https://github.com/amazon-science/eideticnet-training.
中文: 记忆学习法通过将网络容量按任务划分并实现自动数据条件路由,有效解决了神经网络的灾难性遗忘问题,无需回放或复习机制。
English: Eidetic Learning is a method that effectively prevents catastrophic forgetting in neural networks by partitioning capacity across tasks and enabling automatic data-conditional routing without requiring rehearsal or replay.
Authors:Daniel Fleischer, Moshe Berchansky, Gad Markovits, Moshe Wasserblat
Abstract:
In the rapidly evolving field of Natural Language Processing, Large Language Models (LLMs) are tasked with increasingly complex reasoning challenges. Traditional methods like chain-of-thought prompting have shown promise but often fall short in fully leveraging a model's reasoning capabilities. This paper introduces SQuARE (Sequential Question Answering Reasoning Engine), a novel prompting technique designed to improve reasoning through a self-interrogation paradigm. Building upon CoT frameworks, SQuARE prompts models to generate and resolve multiple auxiliary questions before tackling the main query, promoting a more thorough exploration of various aspects of a topic. Our expansive evaluations, conducted with Llama 3 and GPT-4o models across multiple question-answering datasets, demonstrate that SQuARE significantly surpasses traditional CoT prompts and existing rephrase-and-respond methods. By systematically decomposing queries, SQuARE advances LLM capabilities in reasoning tasks. The code is publicly available at https://github.com/IntelLabs/RAG-FiT/tree/square.
Chinese: 本文提出SQuARE这一新颖的自询问提示技术,通过让模型在回答主问题前生成并解决多个辅助问题来增强推理能力,在Llama 3和GPT-4o的评估中显著超越了传统思维链等方法的性能表现。
English: This paper introduces SQuARE, a novel self-interrogation prompting technique that enhances LLM reasoning by generating and resolving auxiliary questions before addressing main queries, significantly outperforming traditional methods like chain-of-thought in evaluations with Llama 3 and GPT-4o.
Authors:Khawla Elhadri, Tomasz Michalski, Adam Wróbel, Jörg Schlötterer, Bartosz ZieliÅski, Christin Seifert
Abstract:
The growing interest in eXplainable Artificial Intelligence (XAI) has prompted research into models with built-in interpretability, the most prominent of which are part-prototype models. Part-Prototype Models (PPMs) make decisions by comparing an input image to a set of learned prototypes, providing human-understandable explanations in the form of ``this looks like that''. Despite their inherent interpretability, PPMS are not yet considered a valuable alternative to post-hoc models. In this survey, we investigate the reasons for this and provide directions for future research. We analyze papers from 2019 to 2024, and derive a taxonomy of the challenges that current PPMS face. Our analysis shows that the open challenges are quite diverse. The main concern is the quality and quantity of prototypes. Other concerns are the lack of generalization to a variety of tasks and contexts, and general methodological issues, including non-standardized evaluation. We provide ideas for future research in five broad directions: improving predictive performance, developing novel architectures grounded in theory, establishing frameworks for human-AI collaboration, aligning models with humans, and establishing metrics and benchmarks for evaluation. We hope that this survey will stimulate research and promote intrinsically interpretable models for application domains. Our list of surveyed papers is available at https://github.com/aix-group/ppm-survey.
中文: 本综述分析了2019至2024年的部件原型模型,揭示了原型质量和方法论等核心挑战,并提出五个研究方向以推进这种本质可解释的AI模型发展。
English: This survey analyzes part-prototype models (PPMs) from 2019-2024, identifying key challenges like prototype quality and methodological issues while proposing five research directions to advance these inherently interpretable AI models.
Authors:Jiayang Wu, Wensheng Gan, Philip S. Yu
Abstract:
Predicting drug-gene associations is crucial for drug development and disease treatment. While graph neural networks (GNN) have shown effectiveness in this task, they face challenges with data sparsity and efficient contrastive learning implementation. We introduce a graph diffusion network for drug-gene prediction (GDNDGP), a framework that addresses these limitations through two key innovations. First, it employs meta-path-based homogeneous graph learning to capture drug-drug and gene-gene relationships, ensuring similar entities share embedding spaces. Second, it incorporates a parallel diffusion network that generates hard negative samples during training, eliminating the need for exhaustive negative sample retrieval. Our model achieves superior performance on the DGIdb 4.0 dataset and demonstrates strong generalization capability on tripartite drug-gene-disease networks. Results show significant improvements over existing methods in drug-gene prediction tasks, particularly in handling complex heterogeneous relationships. The source code is publicly available at https://github.com/csjywu1/GDNDGP.
Chinese Summary: 提出的GDNDGP框架通过整合基于元路径的同构图学习和并行扩散网络生成困难负样本,显著提升了药物-基因关联预测性能,在基准数据集上实现了最优效果。
English Summary: The proposed GDNDGP framework enhances drug-gene association prediction by integrating meta-path-based homogeneous graph learning and a parallel diffusion network for hard negative sampling, achieving state-of-the-art performance on benchmark datasets.
Authors:Chen Xu, Yuxin Li, Wenjie Wang, Liang Pang, Jun Xu, Tat-Seng Chua
Abstract:
Group max-min fairness (MMF) is commonly used in fairness-aware recommender systems (RS) as an optimization objective, as it aims to protect marginalized item groups and ensures a fair competition platform. However, our theoretical analysis indicates that integrating MMF constraint violates the assumption of sample independence during optimization, causing the loss function to deviate from linear additivity. Such nonlinearity property introduces the Jensen gap between the model's convergence point and the optimal point if mini-batch sampling is applied. Both theoretical and empirical studies show that as the mini-batch size decreases and the group size increases, the Jensen gap will widen accordingly. Some methods using heuristic re-weighting or debiasing strategies have the potential to bridge the Jensen gap. However, they either lack theoretical guarantees or suffer from heavy computational costs. To overcome these limitations, we first theoretically demonstrate that the MMF-constrained objective can be essentially reformulated as a group-weighted optimization objective. Then we present an efficient and effective algorithm named FairDual, which utilizes a dual optimization technique to minimize the Jensen gap. Our theoretical analysis demonstrates that FairDual can achieve a sub-linear convergence rate to the globally optimal solution and the Jensen gap can be well bounded under a mini-batch sampling strategy with random shuffle. Extensive experiments conducted using six large-scale RS backbone models on three publicly available datasets demonstrate that FairDual outperforms all baselines in terms of both accuracy and fairness. Our data and codes are shared at https://github.com/XuChen0427/FairDual.
中文: 研究表明,推荐系统中的组最大最小公平性会因非线性引入Jensen间隙,并提出了FairDual双重优化算法,能在保证准确性和公平性的同时有效缩小该间隙。
English: The study reveals that group max-min fairness in recommender systems introduces a Jensen gap due to nonlinearity, and proposes FairDual, a dual optimization algorithm that effectively minimizes this gap while ensuring both accuracy and fairness.
Authors:Daniel Koutas, Daniel Hettegger, Kostas G. Papakonstantinou, Daniel Straub
Abstract:
We present a novel method for Deep Reinforcement Learning (DRL), incorporating the convex property of the value function over the belief space in Partially Observable Markov Decision Processes (POMDPs). We introduce hard- and soft-enforced convexity as two different approaches, and compare their performance against standard DRL on two well-known POMDP environments, namely the Tiger and FieldVisionRockSample problems. Our findings show that including the convexity feature can substantially increase performance of the agents, as well as increase robustness over the hyperparameter space, especially when testing on out-of-distribution domains. The source code for this work can be found at https://github.com/Dakout/Convex_DRL.
中文: 本研究提出了一种新颖的深度强化学习方法,通过在部分可观测马尔可夫决策过程中强化价值函数的凸性,显著提升了智能体性能并增强了超参数和分布外测试的鲁棒性。
English: This study introduces a novel Deep Reinforcement Learning method that enforces convexity of the value function in POMDPs, demonstrating significant performance improvements and enhanced robustness across hyperparameters and out-of-distribution domains.
Authors:Yuankai Luo, Lei Shi, Xiao-Ming Wu
Abstract:
Message-passing Graph Neural Networks (GNNs) are often criticized for their limited expressiveness, issues like over-smoothing and over-squashing, and challenges in capturing long-range dependencies. Conversely, Graph Transformers (GTs) are regarded as superior due to their employment of global attention mechanisms, which potentially mitigate these challenges. Literature frequently suggests that GTs outperform GNNs in graph-level tasks, especially for graph classification and regression on small molecular graphs. In this study, we explore the untapped potential of GNNs through an enhanced framework, GNN+, which integrates six widely used techniques: edge feature integration, normalization, dropout, residual connections, feed-forward networks, and positional encoding, to effectively tackle graph-level tasks. We conduct a systematic re-evaluation of three classic GNNs (GCN, GIN, and GatedGCN) enhanced by the GNN+ framework across 14 well-known graph-level datasets. Our results reveal that, contrary to prevailing beliefs, these classic GNNs consistently match or surpass the performance of GTs, securing top-three rankings across all datasets and achieving first place in eight. Furthermore, they demonstrate greater efficiency, running several times faster than GTs on many datasets. This highlights the potential of simple GNN architectures, challenging the notion that complex mechanisms in GTs are essential for superior graph-level performance. Our source code is available at https://github.com/LUOyk1999/GNNPlus.
中文: 本研究通过集成六种技术的GNN+框架增强经典图神经网络,证明其在图级任务中不仅能匹配甚至超越图Transformer的性能,且效率更高,从而挑战了复杂GT架构必然优越的普遍认知。
English: This study demonstrates that enhanced classic GNNs, through the GNN+ framework integrating six techniques, can match or exceed Graph Transformers' performance in graph-level tasks while being more efficient, challenging the prevailing superiority of complex GT architectures.
Authors:Shihao Zhang, Yuguang Yan, Angela Yao
Abstract:
For deep regression, preserving the ordinality of the targets with respect to the feature representation improves performance across various tasks. However, a theoretical explanation for the benefits of ordinality is still lacking. This work reveals that preserving ordinality reduces the conditional entropy $H(Z|Y)$ of representation $Z$ conditional on the target $Y$. However, our findings reveal that typical regression losses do little to reduce $H(Z|Y)$, even though it is vital for generalization performance. With this motivation, we introduce an optimal transport-based regularizer to preserve the similarity relationships of targets in the feature space to reduce $H(Z|Y)$. Additionally, we introduce a simple yet efficient strategy of duplicating the regressor targets, also with the aim of reducing $H(Z|Y)$. Experiments on three real-world regression tasks verify the effectiveness of our strategies to improve deep regression. Code: https://github.com/needylove/Regression_tightness.
中文: 在深度回归中保持目标的有序性可降低条件熵H(Z|Y)以提升性能,本研究通过引入最优传输正则化和目标复制策略有效减少了该熵,并在三个实际任务中验证了其有效性。
English: Preserving ordinality in deep regression reduces conditional entropy H(Z|Y) to enhance performance, and this work introduces both an optimal transport regularizer and a target duplication strategy to effectively minimize this entropy, validated across three real-world tasks.
Authors:Rubén Pérez-Jove, Cristian R. Munteanu, Alejandro Pazos, Jose Vázquez-Naya
Abstract:
Operating System (OS) fingerprinting is essential for network management and cybersecurity, enabling accurate device identification based on network traffic analysis. Traditional rule-based tools such as Nmap and p0f face challenges in dynamic environments due to frequent OS updates and obfuscation techniques. While Machine Learning (ML) approaches have been explored, Deep Learning (DL) models, particularly Transformer architectures, remain unexploited in this domain. This study investigates the application of Tabular Transformer architectures-specifically TabTransformer and FT-Transformer-for OS fingerprinting, leveraging structured network data from three publicly available datasets. Our experiments demonstrate that FT-Transformer generally outperforms traditional ML models, previous approaches and TabTransformer across multiple classification levels (OS family, major, and minor versions). The results establish a strong foundation for DL-based OS fingerprinting, improving accuracy and adaptability in complex network environments. Furthermore, we ensure the reproducibility of our research by providing an open-source implementation.
中文: 本研究首次将表格Transformer架构应用于操作系统指纹识别,其中FT-Transformer在多个分类层级上显著优于传统方法,为复杂网络环境下的精准设备识别建立了新基准。
English: This study introduces Tabular Transformer architectures, particularly FT-Transformer, for OS fingerprinting, demonstrating superior accuracy over traditional methods and enhancing adaptability in complex network environments.
Authors:Debangshu Banerjee, Tarun Suresh, Shubham Ugare, Sasa Misailovic, Gagandeep Singh
Abstract:
Code generation, symbolic math reasoning, and other tasks require LLMs to produce outputs that are both syntactically and semantically correct. Constrained LLM generation is a promising direction to enforce adherence to formal grammar, but prior works have empirically observed that strict enforcement of formal constraints often diminishes the reasoning capabilities of LLMs. In this work, we first provide a theoretical explanation for why constraining LLM outputs to very restrictive grammars that only allow syntactically valid final answers reduces the reasoning capabilities of the model. Second, we demonstrate that by augmenting the output grammar with carefully designed additional rules, it is always possible to preserve the reasoning capabilities of the LLM while ensuring syntactic and semantic correctness in its outputs. Building on these theoretical insights, we propose a reasoning-augmented constrained decoding algorithm, CRANE, which effectively balances the correctness of constrained generation with the flexibility of unconstrained generation. Experiments on multiple open-source LLMs and benchmarks show that CRANE significantly outperforms both state-of-the-art constrained decoding strategies and standard unconstrained decoding, showing up to 10% points accuracy improvement over baselines on challenging symbolic reasoning benchmarks GSM-symbolic and FOLIO.
中文摘要:限制LLM输出至严格语法会削弱推理能力,但通过精心设计的附加规则扩充输出语法可在确保正确性的同时保留模型推理能力,CRANE算法在多项测试中显著优于现有方法。
English Summary: Constraining LLM outputs to strict grammars can hinder reasoning, but augmenting grammars with additional rules preserves reasoning capabilities while ensuring correctness, as demonstrated by the CRANE algorithm which significantly outperforms existing methods.
Authors:Shin'ya Yamaguchi, Kosuke Nishida, Daiki Chijiwa, Yasutoshi Ida
Abstract:
Concept bottleneck models (CBMs) are inherently interpretable and intervenable neural network models, which explain their final label prediction by the intermediate prediction of high-level semantic concepts. However, they require target task training to learn input-to-concept and concept-to-label mappings, incurring target dataset collections and training resources. In this paper, we present \textit{zero-shot concept bottleneck models} (Z-CBMs), which predict concepts and labels in a fully zero-shot manner without training neural networks. Z-CBMs utilize a large-scale concept bank, which is composed of millions of vocabulary extracted from the web, to describe arbitrary input in various domains. For the input-to-concept mapping, we introduce concept retrieval, which dynamically finds input-related concepts by the cross-modal search on the concept bank. In the concept-to-label inference, we apply concept regression to select essential concepts from the retrieved concepts by sparse linear regression. Through extensive experiments, we confirm that our Z-CBMs provide interpretable and intervenable concepts without any additional training. Code will be available at https://github.com/yshinya6/zcbm.
中文: 零样本概念瓶颈模型(Z-CBMs)通过从大规模网络概念库中动态检索相关概念,并利用稀疏回归筛选关键概念,实现了无需训练即可进行可解释和可干预的预测。
English: Zero-shot concept bottleneck models (Z-CBMs) enable interpretable predictions by dynamically retrieving relevant concepts from a large-scale web-based concept bank and selecting essential ones through sparse regression, eliminating the need for target task training.
Authors:Quan Wei, Chung-Yiu Yau, Hoi-To Wai, Yang Katie Zhao, Dongyeop Kang, Youngsuk Park, Mingyi Hong
Abstract:
Supervised fine-tuning is a standard method for adapting pre-trained large language models (LLMs) to downstream tasks. Quantization has been recently studied as a post-training technique for efficient LLM deployment. To obtain quantized fine-tuned LLMs, conventional pipelines would first fine-tune the pre-trained models, followed by post-training quantization. This often yields suboptimal performance as it fails to leverage the synergy between fine-tuning and quantization. To effectively realize low-bit quantization of weights, activations and KV caches in LLMs, we propose an algorithm named Rotated Straight-Through-Estimator (RoSTE), which combines quantization-aware supervised fine-tuning (QA-SFT) with an adaptive rotation strategy that identifies an effective rotation configuration to reduce activation outliers. We provide theoretical insights on RoSTE by analyzing its prediction error when applied to an overparameterized least square quantized training problem. Our findings reveal that the prediction error is directly proportional to the quantization error of the converged weights, which can be effectively managed through an optimized rotation configuration. Experiments on Pythia, Qwen and Llama models of different sizes demonstrate the effectiveness of RoSTE. Compared to existing post-SFT quantization baselines, our method consistently achieves superior performances across various tasks and different LLM architectures. Our code is available at https://github.com/OptimAI-Lab/RoSTE.
中文摘要:RoSTE是一种创新算法,结合量化感知微调与自适应旋转策略,有效提升大语言模型的低比特量化效果,相比传统方法在不同模型和任务中均表现更优。
English Summary: RoSTE is a novel algorithm that integrates quantization-aware fine-tuning with adaptive rotation to enhance low-bit quantization of LLMs, achieving superior performance across various models and tasks compared to conventional methods.
Authors:Zihao Li, Xiao Lin, Zhining Liu, Jiaru Zou, Ziwei Wu, Lecheng Zheng, Dongqi Fu, Yada Zhu, Hendrik Hamann, Hanghang Tong, Jingrui He
Abstract:
While many advances in time series models focus exclusively on numerical data, research on multimodal time series, particularly those involving contextual textual information commonly encountered in real-world scenarios, remains in its infancy. With recent progress in large language models and time series learning, we revisit the integration of paired texts with time series through the Platonic Representation Hypothesis, which posits that representations of different modalities converge to shared spaces. In this context, we identify that time-series-paired texts may naturally exhibit periodic properties that closely mirror those of the original time series. Building on this insight, we propose a novel framework, Texts as Time Series (TaTS), which considers the time-series-paired texts to be auxiliary variables of the time series. TaTS can be plugged into any existing numerical-only time series models and enable them to handle time series data with paired texts effectively. Through extensive experiments on both multimodal time series forecasting and imputation tasks across benchmark datasets with various existing time series models, we demonstrate that TaTS can enhance predictive performance without modifying model architectures. Code available at https://github.com/iDEA-iSAIL-Lab-UIUC/TaTS.
中文摘要:提出的“文本即时间序列”(TaTS)框架通过将具有周期特性的配对文本视为时间序列的辅助变量,使现有数值时间序列模型无需改动架构即可有效融合文本数据,从而提升预测和填补任务的性能。
English Summary: The proposed Texts as Time Series (TaTS) framework enables existing numerical time series models to effectively incorporate paired textual data by treating texts as auxiliary variables with periodic properties, enhancing performance in forecasting and imputation tasks without architectural changes.
Authors:Max Rudolph, Nathan Lichtle, Sobhan Mohammadpour, Alexandre Bayen, J. Zico Kolter, Amy Zhang, Gabriele Farina, Eugene Vinitsky, Samuel Sokota
Abstract:
In the past decade, motivated by the putative failure of naive self-play deep reinforcement learning (DRL) in adversarial imperfect-information games, researchers have developed numerous DRL algorithms based on fictitious play (FP), double oracle (DO), and counterfactual regret minimization (CFR). In light of recent results of the magnetic mirror descent algorithm, we hypothesize that simpler generic policy gradient methods like PPO are competitive with or superior to these FP-, DO-, and CFR-based DRL approaches. To facilitate the resolution of this hypothesis, we implement and release the first broadly accessible exact exploitability computations for four large games. Using these games, we conduct the largest-ever exploitability comparison of DRL algorithms for imperfect-information games. Over 5600 training runs, we find that FP-, DO-, and CFR-based approaches fail to outperform generic policy gradient methods. Code is available at https://github.com/nathanlct/IIG-RL-Benchmark and https://github.com/gabrfarina/exp-a-spiel .
Chinese: 最新研究表明,在非完美信息博弈中,简单的策略梯度方法(如PPO)优于基于虚拟博弈、双预言机和反事实遗憾最小化的复杂算法,这一结论基于超过5600次训练运行的大规模可利用性比较得出。
English: Recent research demonstrates that simpler policy gradient methods, such as PPO, outperform more complex approaches based on fictitious play, double oracle, and counterfactual regret minimization in imperfect-information games, as shown by extensive exploitability comparisons across over 5600 training runs.
Authors:Razvan-Gabriel Dumitru, Minglai Yang, Vikas Yadav, Mihai Surdeanu
Abstract:
We introduce CopySpec, a simple yet effective technique to tackle the inefficiencies LLMs face when generating responses that closely resemble previous outputs or responses that can be verbatim extracted from context. CopySpec identifies repeated sequences in the model's chat history or context and speculates that the same tokens will follow, enabling seamless copying without compromising output quality and without requiring additional GPU memory. To evaluate the effectiveness of our approach, we conducted experiments using seven LLMs and five datasets: MT-Bench, CNN/DM, GSM8K, HumanEval, and our newly created dataset, MT-Redundant. MT-Redundant, introduced in this paper, transforms the second turn of MT-Bench into a request for variations of the first turn's answer, simulating real-world scenarios where users request modifications to prior responses. Our results demonstrate significant speed-ups: up to 2.35x on CNN/DM, 3.08x on the second turn of select MT-Redundant categories, and 2.66x on the third turn of GSM8K's self-correction tasks. Importantly, we show that CopySpec integrates seamlessly with speculative decoding, yielding an average 49% additional speed-up over speculative decoding for the second turn of MT-Redundant across all eight categories. While LLMs, even with speculative decoding, suffer from slower inference as context size grows, CopySpec leverages larger contexts to accelerate inference, making it a faster complementary solution. Our code and dataset are publicly available at https://github.com/RazvanDu/CopySpec.
中文: CopySpec 是一种通过识别并复用对话历史中的重复序列来加速大语言模型推理的技术,无需额外显存即可实现显著加速且不损失输出质量。
English: CopySpec is a technique that accelerates LLM inference by identifying and reusing repeated sequences from chat history or context, achieving significant speed-ups without extra GPU memory or quality loss.
Authors:Jocelyn Dzuong
Abstract:
The recent surge in advanced generative models, such as diffusion models and generative adversarial networks (GANs), has led to an alarming rise in AI-generated images across various domains on the web. While such technologies offer benefits such as democratizing artistic creation, they also pose challenges in misinformation, digital forgery, and authenticity verification. Additionally, the uncredited use of AI-generated images in media and marketing has sparked significant backlash from online communities. In response to this, we introduce DejAIvu, a Chrome Web extension that combines real-time AI-generated image detection with saliency-based explainability while users browse the web. Using an ONNX-optimized deep learning model, DejAIvu automatically analyzes images on websites such as Google Images, identifies AI-generated content using model inference, and overlays a saliency heatmap to highlight AI-related artifacts. Our approach integrates efficient in-browser inference, gradient-based saliency analysis, and a seamless user experience, ensuring that AI detection is both transparent and interpretable. We also evaluate DejAIvu across multiple pretrained architectures and benchmark datasets, demonstrating high accuracy and low latency, making it a practical and deployable tool for enhancing AI image accountability. The code for this system can be found at https://github.com/Noodulz/dejAIvu.
Chinese: DejAIvu 是一款 Chrome 浏览器扩展,通过优化的深度学习模型实时检测 AI 生成图像并提供基于显著性的可解释热力图,为增强 AI 图像可信度提供透明高效的解决方案。
English: The DejAIvu Chrome extension uses an optimized deep learning model to detect AI-generated images in real-time and provides explainable saliency heatmaps, offering a transparent and efficient solution for enhancing AI image accountability.
Authors:Zifan He, Anderson Truong, Yingqi Cao, Jason Cong
Abstract:
The rise of deep neural networks (DNNs) has driven an increased demand for computing power and memory. Modern DNNs exhibit high data volume variation (HDV) across tasks, which poses challenges for FPGA acceleration: conventional accelerators rely on fixed execution patterns (dataflow or sequential) that can lead to pipeline stalls or necessitate frequent off-chip memory accesses. To address these challenges, we introduce the Inter-Task Auto-Reconfigurable Accelerator (InTAR), a novel accelerator design methodology for HDV applications on FPGAs. InTAR combines the high computational efficiency of sequential execution with the reduced off-chip memory overhead of dataflow execution. It switches execution patterns automatically with a static schedule determined before circuit design based on resource constraints and problem sizes. Unlike previous reconfigurable accelerators, InTAR encodes reconfiguration schedules during circuit design, allowing model-specific optimizations that allocate only the necessary logic and interconnects. Thus, InTAR achieves a high clock frequency with fewer resources and low reconfiguration time. Furthermore, InTAR supports high-level tools such as HLS for fast design generation. We implement a set of multi-task HDV DNN kernels using InTAR. Compared with dataflow and sequential accelerators, InTAR exhibits $\mathbf{1.8\times}$ and $\mathbf{7.1 \times}$ speedups correspondingly. Moreover, we extend InTAR to GPT-2 medium as a more complex example, which is $\mathbf{3.65 \sim 39.14\times}$ faster and a $\mathbf{1.72 \sim 10.44\times}$ more DSP efficient than SoTA accelerators (Allo and DFX) on FPGAs. Additionally, this design demonstrates $\mathbf{1.66 \sim 7.17\times}$ better power efficiency than GPUs. Code: https://github.com/OswaldHe/InTAR
中文:InTAR加速器通过在执行过程中动态切换顺序与数据流模式,有效应对深度神经网络中高数据量变化带来的挑战,相比现有FPGA和GPU方案实现了显著的加速效果与能效提升。
English: The InTAR accelerator addresses the challenges of high data volume variation in deep neural networks by dynamically switching between sequential and dataflow execution patterns, achieving significant speedups and improved efficiency over existing FPGA and GPU solutions.
Authors:Christopher Tosh, Boyuan Zhang, Wesley Tansey
Abstract:
Scientists often need to analyze the samples in a study that responded to treatment in order to refine their hypotheses and find potential causal drivers of response. Natural variation in outcomes makes teasing apart responders from non-responders a statistical inference problem. To handle latent responses, we introduce the causal two-groups (C2G) model, a causal extension of the classical two-groups model. The C2G model posits that treated samples may or may not experience an effect, according to some prior probability. We propose two empirical Bayes procedures for the causal two-groups model, one under semi-parametric conditions and another under fully nonparametric conditions. The semi-parametric model assumes additive treatment effects and is identifiable from observed data. The nonparametric model is unidentifiable, but we show it can still be used to test for response in each treated sample. We show empirically and theoretically that both methods for selecting responders control the false discovery rate at the target level with near-optimal power. We also propose two novel estimands of interest and provide a strategy for deriving estimand intervals in the unidentifiable nonparametric model. On a cancer immunotherapy dataset, the nonparametric C2G model recovers clinically-validated predictive biomarkers of both positive and negative outcomes. Code is available at https://github.com/tansey-lab/causal2groups.
中文: C2G模型通过提出经验贝叶斯方法处理潜在治疗反应,在控制误发现率的同时识别应答者,其有效性经理论分析和癌症免疫治疗数据验证。
English: The C2G model addresses latent treatment responses by proposing empirical Bayes methods that control false discovery rates while identifying responders, validated through theoretical analysis and cancer immunotherapy data.
Authors:Joshua Omolegan, Pak Hei Yeung, Madeleine K. Wyburd, Linde Hesse, Monique Haak, Intergrowth-21st Consortium, Ana I. L. Namburete, Nicola K. Dinsdale
Abstract:
Monitoring the growth of subcortical regions of the fetal brain in ultrasound (US) images can help identify the presence of abnormal development. Manually segmenting these regions is a challenging task, but recent work has shown that it can be automated using deep learning. However, applying pretrained models to unseen freehand US volumes often leads to a degradation of performance due to the vast differences in acquisition and alignment. In this work, we first demonstrate that test time adaptation (TTA) can be used to improve model performance in the presence of both real and simulated domain shifts. We further propose a novel TTA method by incorporating a normative atlas as a prior for anatomy. In the presence of various types of domain shifts, we benchmark the performance of different TTA methods and demonstrate the improvements brought by our proposed approach, which may further facilitate automated monitoring of fetal brain development. Our code is available at https://github.com/joshuaomolegan/TTA-for-3D-Fetal-Subcortical-Segmentation.
中文摘要:通过引入规范图谱作为解剖先验的测试时自适应方法,有效提升了深度学习模型在不同域偏移下对胎儿脑部皮层下区域超声图像分割的性能,有助于实现胎儿大脑发育的自动化监测。
English Summary: Test time adaptation, enhanced with a normative atlas prior, improves deep learning model performance for segmenting fetal brain subcortical regions in ultrasound images under domain shifts, facilitating automated monitoring of brain development.
Authors:Randolph W. Linderman, Yiran Chen, Scott W. Linderman
Abstract:
Bayesian nonparametric methods are naturally suited to the problem of out-of-distribution (OOD) detection. However, these techniques have largely been eschewed in favor of simpler methods based on distances between pre-trained or learned embeddings of data points. Here we show a formal relationship between Bayesian nonparametric models and the relative Mahalanobis distance score (RMDS), a commonly used method for OOD detection. Building on this connection, we propose Bayesian nonparametric mixture models with hierarchical priors that generalize the RMDS. We evaluate these models on the OpenOOD detection benchmark and show that Bayesian nonparametric methods can improve upon existing OOD methods, especially in regimes where training classes differ in their covariance structure and where there are relatively few data points per class.
中文: 贝叶斯非参数方法与相对马哈拉诺比斯距离评分在分布外检测中存在形式关联,通过引入分层先验的混合模型,在协方差结构差异大和每类数据点较少的情况下显著优于现有方法。
English: Bayesian nonparametric methods are formally linked to the relative Mahalanobis distance score for OOD detection and, when enhanced with hierarchical priors, outperform existing methods in scenarios with varying covariance structures and limited data per class.
Authors:Renqi Jia, Xiaokun Zhang, Bowei He, Qiannan Zhu, Weitao Xu, Jiehao Chen, Chen Ma
Abstract:
User behavior records serve as the foundation for recommender systems. While the behavior data exhibits ease of acquisition, it often suffers from varying quality. Current methods employ data valuation to discern high-quality data from low-quality data. However, they tend to employ black-box design, lacking transparency and interpretability. Besides, they are typically tailored to specific evaluation metrics, leading to limited generality across various tasks. To overcome these issues, we propose an explainable and versatile framework DVR which can enhance the efficiency of data utilization tailored to any requirements of the model architectures and evaluation metrics. For explainable data valuation, a data valuator is presented to evaluate the data quality via calculating its Shapley value from the game-theoretic perspective, ensuring robust mathematical properties and reliability. In order to accommodate various evaluation metrics, including differentiable and non-differentiable ones, a metric adapter is devised based on reinforcement learning, where a metric is treated as the reinforcement reward that guides model optimization. Extensive experiments conducted on various benchmarks verify that our framework can improve the performance of current recommendation algorithms on various metrics including ranking accuracy, diversity, and fairness. Specifically, our framework achieves up to 34.7\% improvements over existing methods in terms of representative NDCG metric. The code is available at https://github.com/renqii/DVR.
中文:提出的DVR框架通过博弈论视角的Shapley值实现可解释数据评估,并结合基于强化学习的指标适配器兼容各类评估标准,在多个基准测试中显著提升了推荐系统的综合性能。
English: The proposed DVR framework introduces an explainable and versatile approach to data valuation in recommender systems, using Shapley values for transparent quality assessment and a reinforcement learning-based metric adapter to accommodate various evaluation metrics, achieving significant performance improvements across multiple benchmarks.
Authors:Karish Grover, Geoffrey J. Gordon, Christos Faloutsos
Abstract:
Does the intrinsic curvature of complex networks hold the key to unveiling graph anomalies that conventional approaches overlook? Reconstruction-based graph anomaly detection (GAD) methods overlook such geometric outliers, focusing only on structural and attribute-level anomalies. To this end, we propose CurvGAD - a mixed-curvature graph autoencoder that introduces the notion of curvature-based geometric anomalies. CurvGAD introduces two parallel pipelines for enhanced anomaly interpretability: (1) Curvature-equivariant geometry reconstruction, which focuses exclusively on reconstructing the edge curvatures using a mixed-curvature, Riemannian encoder and Gaussian kernel-based decoder; and (2) Curvature-invariant structure and attribute reconstruction, which decouples structural and attribute anomalies from geometric irregularities by regularizing graph curvature under discrete Ollivier-Ricci flow, thereby isolating the non-geometric anomalies. By leveraging curvature, CurvGAD refines the existing anomaly classifications and identifies new curvature-driven anomalies. Extensive experimentation over 10 real-world datasets (both homophilic and heterophilic) demonstrates an improvement of up to 6.5% over state-of-the-art GAD methods. The code is available at: https://github.com/karish-grover/curvgad.
Chinese: CurvGAD提出了一种新的图异常检测方法,通过混合曲率重构识别几何异常,相比现有方法将检测准确率最高提升了6.5%。
English: CurvGAD introduces a novel graph anomaly detection method that identifies geometric outliers through mixed-curvature reconstruction, improving detection accuracy by up to 6.5% over existing approaches.
Authors:Peiyao Xiao, Chaosheng Dong, Shaofeng Zou, Kaiyi Ji
Abstract:
Multi-task learning (MTL) has been widely adopted for its ability to simultaneously learn multiple tasks. While existing gradient manipulation methods often yield more balanced solutions than simple scalarization-based approaches, they typically incur a significant computational overhead of $\mathcal{O}(K)$ in both time and memory, where $K$ is the number of tasks. In this paper, we propose LDC-MTL, a simple and scalable loss discrepancy control approach for MTL, formulated from a bilevel optimization perspective. Our method incorporates three key components: (i) a coarse loss pre-normalization, (ii) a bilevel formulation for fine-grained loss discrepancy control, and (iii) a scalable first-order bilevel algorithm that requires only $\mathcal{O}(1)$ time and memory. Theoretically, we prove that LDC-MTL guarantees convergence not only to a stationary point of the bilevel problem with loss discrepancy control but also to an $ε$-accurate Pareto stationary point for all $K$ loss functions under mild conditions. Extensive experiments on diverse multi-task datasets demonstrate the superior performance of LDC-MTL in both accuracy and efficiency. Code is available at https://github.com/OptMN-Lab/LDC-MTL.
中文: LDC-MTL是一种基于双层优化的可扩展多任务学习方法,能以O(1)计算成本控制损失差异,在实验中展现出优越性能并具有理论收敛保证。
English: LDC-MTL is a scalable multi-task learning method that uses bilevel optimization to control loss discrepancy with only O(1) computational cost, achieving both theoretical convergence guarantees and superior performance in experiments.
Authors:Qifan Yu, Zhenyu He, Sijie Li, Xun Zhou, Jun Zhang, Jingjing Xu, Di He
Abstract:
Chain-of-Thought (CoT) prompting has emerged as a powerful technique for enhancing language model's reasoning capabilities. However, generating long and correct CoT trajectories is challenging. Recent studies have demonstrated that Looped Transformers possess remarkable length generalization capabilities, but their limited generality and adaptability prevent them from serving as an alternative to auto-regressive solutions. To better leverage the strengths of Looped Transformers, we propose RELAY (REasoning through Loop Alignment iterativelY). Specifically, we align the steps of Chain-of-Thought (CoT) reasoning with loop iterations and apply intermediate supervision during the training of Looped Transformers. This additional iteration-wise supervision not only preserves the Looped Transformer's ability for length generalization but also enables it to predict CoT reasoning steps for unseen data. Therefore, we leverage this Looped Transformer to generate accurate reasoning chains for complex problems that exceed the training length, which will then be used to fine-tune an auto-regressive model. We conduct extensive experiments, and the results demonstrate the effectiveness of our approach, with significant improvements in the performance of the auto-regressive model. Code will be released at https://github.com/qifanyu/RELAY.
Chinese: RELAY通过将思维链推理步骤与循环Transformer的迭代对齐,使其能够为超出训练长度的复杂问题生成准确推理链,进而用于微调自回归模型以提升性能。
English: RELAY aligns CoT reasoning steps with loop iterations in Looped Transformers, enabling them to generate accurate reasoning chains for complex problems beyond training length, which are then used to fine-tune auto-regressive models for improved performance.
Authors:Thomas Cass, Francesco Piatti, Jeffrey Pei
Abstract:
Signature kernels have emerged as a powerful tool within kernel methods for sequential data. In the paper "The Signature Kernel is the solution of a Goursat PDE", the authors identify a kernel trick that demonstrates that, for continuously differentiable paths, the signature kernel satisfies a Goursat problem for a hyperbolic partial differential equation (PDE) in two independent time variables. While finite difference methods have been explored for this PDE, they face limitations in accuracy and stability when handling highly oscillatory inputs. In this work, we introduce two advanced numerical schemes that leverage polynomial representations of boundary conditions through either approximation or interpolation techniques, and rigorously establish the theoretical convergence of the polynomial approximation scheme. Experimental evaluations reveal that our approaches yield improvements of several orders of magnitude in mean absolute percentage error (MAPE) compared to traditional finite difference schemes, without increasing computational complexity. Furthermore, like finite difference methods, our algorithms can be GPU-parallelized to reduce computational complexity from quadratic to linear in the length of the input sequences, thereby improving scalability for high-frequency data. We have implemented these algorithms in a dedicated Python library, which is publicly available at: https://github.com/FrancescoPiatti/polysigkernel.
中文: 本文提出了两种利用多项式表示的先进数值方案来解决签名核的Goursat偏微分方程,通过GPU并行化实现了精度数量级的提升和线性计算复杂度。
English: This paper introduces two advanced numerical schemes using polynomial representations to solve the signature kernel's Goursat PDE, achieving orders of magnitude improvement in accuracy and linear computational complexity via GPU parallelization.
Authors:Daeyoung Roh, Donghee Han, Daehee Kim, Keejun Han, Mun Yi
Abstract:
Hypergraphs provide a superior modeling framework for representing complex multidimensional relationships in the context of real-world interactions that often occur in groups, overcoming the limitations of traditional homogeneous graphs. However, there have been few studies on hypergraphbased contrastive learning, and existing graph-based contrastive learning methods have not been able to fully exploit the highorder correlation information in hypergraphs. Here, we propose a Hypergraph Fine-grained contrastive learning (HyFi) method designed to exploit the complex high-dimensional information inherent in hypergraphs. While avoiding traditional graph augmentation methods that corrupt the hypergraph topology, the proposed method provides a simple and efficient learning augmentation function by adding noise to node features. Furthermore, we expands beyond the traditional dichotomous relationship between positive and negative samples in contrastive learning by introducing a new relationship of weak positives. It demonstrates the importance of fine-graining positive samples in contrastive learning. Therefore, HyFi is able to produce highquality embeddings, and outperforms both supervised and unsupervised baselines in average rank on node classification across 10 datasets. Our approach effectively exploits high-dimensional hypergraph information, shows significant improvement over existing graph-based contrastive learning methods, and is efficient in terms of training speed and GPU memory cost. The source code is available at https://github.com/Noverse0/HyFi.git.
中文:HyFi方法通过引入特征噪声和弱正样本关系,有效利用超图的高维信息,在节点分类任务中展现出优于现有方法的性能。
English: The HyFi method introduces a hypergraph contrastive learning approach that enhances high-dimensional information utilization through feature noise injection and weak positive sample relationships, achieving superior performance in node classification tasks.
Authors:Tao Huang, Junli Ren, Huayi Wang, Zirui Wang, Qingwei Ben, Muning Wen, Xiao Chen, Jianan Li, Jiangmiao Pang
Abstract:
Standing-up control is crucial for humanoid robots, with the potential for integration into current locomotion and loco-manipulation systems, such as fall recovery. Existing approaches are either limited to simulations that overlook hardware constraints or rely on predefined ground-specific motion trajectories, failing to enable standing up across postures in real-world scenes. To bridge this gap, we present HoST (Humanoid Standing-up Control), a reinforcement learning framework that learns standing-up control from scratch, enabling robust sim-to-real transfer across diverse postures. HoST effectively learns posture-adaptive motions by leveraging a multi-critic architecture and curriculum-based training on diverse simulated terrains. To ensure successful real-world deployment, we constrain the motion with smoothness regularization and implicit motion speed bound to alleviate oscillatory and violent motions on physical hardware, respectively. After simulation-based training, the learned control policies are directly deployed on the Unitree G1 humanoid robot. Our experimental results demonstrate that the controllers achieve smooth, stable, and robust standing-up motions across a wide range of laboratory and outdoor environments. Videos and code are available at https://taohuang13.github.io/humanoid-standingup.github.io/.
Authors:Junyi An, Chao Qu, Yun-Fei Shi, XinHao Liu, Qianwei Tang, Fenglei Cao, Yuan Qi
Abstract:
Graph neural networks (GNNs) have shown considerable promise in computational chemistry. However, the limited availability of molecular data raises concerns regarding GNNs' ability to effectively capture the fundamental principles of physics and chemistry, which constrains their generalization capabilities. To address this challenge, we introduce a novel self-supervised approach termed Equivariant Masked Position Prediction (EMPP), grounded in intramolecular potential and force theory. Unlike conventional attribute masking techniques, EMPP formulates a nuanced position prediction task that is more well-defined and enhances the learning of quantum mechanical features. EMPP also bypasses the approximation of the Gaussian mixture distribution commonly used in denoising methods, allowing for more accurate acquisition of physical properties. Experimental results indicate that EMPP significantly enhances performance of advanced molecular architectures, surpassing state-of-the-art self-supervised approaches. Our code is released in https://github.com/ajy112/EMPP
中文: 本研究提出的等变掩码位置预测方法通过精确的位置预测任务,有效增强了图神经网络对量子力学特征的学习能力,在分子建模中显著超越了现有自监督方法。
English: The proposed Equivariant Masked Position Prediction (EMPP) method enhances graph neural networks' ability to learn quantum mechanical features through precise position prediction, significantly outperforming existing self-supervised approaches in molecular modeling.
Authors:Tingyi Cai, Yunliang Jiang, Yixin Liu, Ming Li, Changqin Huang, Shirui Pan
Abstract:
Graph machine learning has witnessed rapid growth, driving advancements across diverse domains. However, the in-distribution assumption, where training and testing data share the same distribution, often breaks in real-world scenarios, leading to degraded model performance under distribution shifts. This challenge has catalyzed interest in graph out-of-distribution (GOOD) detection, which focuses on identifying graph data that deviates from the distribution seen during training, thereby enhancing model robustness. In this paper, we provide a rigorous definition of GOOD detection and systematically categorize existing methods into four types: enhancement-based, reconstruction-based, information propagation-based, and classification-based approaches. We analyze the principles and mechanisms of each approach and clarify the distinctions between GOOD detection and related fields, such as graph anomaly detection, outlier detection, and GOOD generalization. Beyond methodology, we discuss practical applications and theoretical foundations, highlighting the unique challenges posed by graph data. Finally, we discuss the primary challenges and propose future directions to advance this emerging field. The repository of this survey is available at https://github.com/ca1man-2022/Awesome-GOOD-Detection.
Graph out-of-distribution (GOOD) detection addresses performance degradation caused by distribution shifts in graph machine learning by systematically categorizing methods into four approaches and distinguishing it from related fields while outlining future research directions.
English Summary:
Authors:Anthony D. Blaom, Samuel Okon
Abstract:
A new computational tool TumorGrowth$.$jl for modeling tumor growth is introduced. The tool allows the comparison of standard textbook models, such as General Bertalanffy and Gompertz, with some newer models, including, for the first time, neural ODE models. As an application, we revisit a human meta-study of non-small cell lung cancer and bladder cancer lesions, in patients undergoing two different treatment options, to determine if previously reported performance differences are statistically significant, and if newer, more complex models perform any better. In a population of examples with at least four time-volume measurements available for calibration, and an average of about 6.3, our main conclusion is that the General Bertalanffy model has superior performance, on average. However, where more measurements are available, we argue that more complex models, capable of capturing rebound and relapse behavior, may be better choices.
中文: 新推出的计算工具TumorGrowth.jl用于模拟肿瘤生长,能够比较通用Bertalanffy和Gompertz等标准模型与新型神经ODE模型;研究发现,在数据有限的情况下通用Bertalanffy模型平均表现更优,但当测量数据充足时,能捕捉复发行为的复杂模型可能是更好选择。
English: A new computational tool, TumorGrowth.jl, is introduced for modeling tumor growth, enabling comparison of standard models like General Bertalanffy and Gompertz with newer neural ODE models, with findings showing General Bertalanffy's superior average performance in cases with limited data but suggesting complex models may be preferable when more measurements are available to capture rebound and relapse behavior.
Authors:Ashkan Shahbazi, Elaheh Akbari, Darian Salehi, Xinran Liu, Navid Naderializadeh, Soheil Kolouri
Abstract:
While self-attention has been instrumental in the success of Transformers, it can lead to over-concentration on a few tokens during training, resulting in suboptimal information flow. Enforcing doubly-stochastic constraints in attention matrices has been shown to improve structure and balance in attention distributions. However, existing methods rely on iterative Sinkhorn normalization, which is computationally costly. In this paper, we introduce a novel, fully parallelizable doubly-stochastic attention mechanism based on sliced optimal transport, leveraging Expected Sliced Transport Plans (ESP). Unlike prior approaches, our method enforces doubly stochasticity without iterative Sinkhorn normalization, significantly enhancing efficiency. To ensure differentiability, we incorporate a temperature-based soft sorting technique, enabling seamless integration into deep learning models. Experiments across multiple benchmark datasets, including image classification, point cloud classification, sentiment analysis, and neural machine translation, demonstrate that our enhanced attention regularization consistently improves performance across diverse applications. Our implementation code can be found at https://github.com/dariansal/ESPFormer.
Chinese: 本文提出了一种基于切片最优传输的新型全并行双随机注意力机制,无需迭代Sinkhorn归一化即可提高效率并保持可微性,在多种基准任务中均实现了性能提升。
English: This paper introduces a novel, fully parallelizable doubly-stochastic attention mechanism using sliced optimal transport, which eliminates the need for iterative Sinkhorn normalization and improves efficiency while maintaining differentiability, leading to enhanced performance across various benchmark tasks.
Authors:Fanxu Meng, Pingzhi Tang, Xiaojuan Tang, Zengwei Yao, Xing Sun, Muhan Zhang
Abstract:
In this paper, we present TransMLA, a framework that seamlessly converts any GQA-based pre-trained model into an MLA-based model. Our approach enables direct compatibility with DeepSeek's codebase, allowing these models to fully leverage DeepSeek-specific optimizations such as vLLM and SGlang. By compressing 93% of the KV cache in LLaMA-2-7B, TransMLA achieves a 10.6x inference speedup at an 8K context length while preserving meaningful output quality. Additionally, the model requires only 6 billion tokens for fine-tuning to regain performance on par with the original across multiple benchmarks. TransMLA offers a practical solution for migrating GQA-based models to the MLA structure. When combined with DeepSeek's advanced features, such as FP8 quantization and Multi-Token Prediction, even greater inference acceleration can be realized.
中文:TransMLA框架可将GQA预训练模型无缝转换为MLA架构,在压缩93% KV缓存的同时实现10.6倍推理加速,并通过少量微调即可恢复原有性能表现。
English: TransMLA enables seamless conversion of GQA-based models to MLA architecture, achieving a 10.6x inference speedup with 93% KV cache compression while maintaining performance parity after minimal fine-tuning.
Authors:Jiyoon Kim, Kang Eun Jeon, Yulhwa Kim, Jong Hwan Ko
Abstract:
Compute-in-memory (CIM) is an efficient method for implementing deep neural networks (DNNs) but suffers from substantial overhead from analog-to-digital converters (ADCs), especially as ADC precision increases. Low-precision ADCs can reduce this overhead but introduce partial-sum quantization errors degrading accuracy. Additionally, low-bit weight constraints, imposed by cell limitations and the need for multiple cells for higher-bit weights, present further challenges. While fine-grained partial-sum quantization has been studied to lower ADC resolution effectively, weight granularity, which limits overall partial-sum quantized accuracy, remains underexplored. This work addresses these challenges by aligning weight and partial-sum quantization granularities at the column-wise level. Our method improves accuracy while maintaining dequantization overhead, simplifies training by removing two-stage processes, and ensures robustness to memory cell variations via independent column-wise scale factors. We also propose an open-source CIM-oriented convolution framework to handle fine-grained weights and partial-sums efficiently, incorporating a novel tiling method and group convolution. Experimental results on ResNet-20 (CIFAR-10, CIFAR-100) and ResNet-18 (ImageNet) show accuracy improvements of 0.99%, 2.69%, and 1.01%, respectively, compared to the best-performing related works. Additionally, variation analysis reveals the robustness of our method against memory cell variations. These findings highlight the effectiveness of our quantization scheme in enhancing accuracy and robustness while maintaining hardware efficiency in CIM-based DNN implementations. Our code is available at https://github.com/jiyoonkm/ColumnQuant.
中文摘要:本研究提出一种列级量化方法,通过统一权重和部分和的量化粒度,在保持硬件效率的同时,有效提升了存内计算深度神经网络的精度与鲁棒性。
English Summary: This study introduces a column-wise quantization method that aligns weight and partial-sum granularities to enhance accuracy and robustness in compute-in-memory DNNs while maintaining hardware efficiency.
Authors:Mark Schöne, Babak Rahmani, Heiner Kremer, Fabian Falck, Hitesh Ballani, Jannes Gladrow
Abstract:
State-space models (SSMs) and transformers dominate the language modeling landscape. However, they are constrained to a lower computational complexity than classical recurrent neural networks (RNNs), limiting their expressivity. In contrast, RNNs lack parallelization during training, raising fundamental questions about the trade off between parallelization and expressivity. We propose implicit SSMs, which iterate a transformation until convergence to a fixed point. Theoretically, we show that implicit SSMs implement the non-linear state-transitions of RNNs. Empirically, we find that only approximate fixed-point convergence suffices, enabling the design of a scalable training curriculum that largely retains parallelization, with full convergence required only for a small subset of tokens. Our approach demonstrates superior state-tracking capabilities on regular languages, surpassing transformers and SSMs. We further scale implicit SSMs to natural language reasoning tasks and pretraining of large-scale language models up to 1.3B parameters on 207B tokens representing, to our knowledge, the largest implicit model trained to date. Notably, our implicit models outperform their explicit counterparts on standard benchmarks. Our code is publicly available at http://github.com/microsoft/implicit_languagemodels .
中文摘要:隐式状态空间模型通过近似定点收敛实现了类似RNN的非线性状态转换,在保持并行化训练的同时解决了传统模型表达能力受限的问题,在合成语言和自然语言任务上均展现出卓越性能。
English Summary: Implicit state-space models overcome the expressivity limitations of transformers and SSMs by implementing RNN-like non-linear transitions while maintaining parallelization through approximate fixed-point convergence, achieving superior performance on both synthetic and natural language tasks.
Authors:Feng Liang, Haoyu Ma, Zecheng He, Tingbo Hou, Ji Hou, Kunpeng Li, Xiaoliang Dai, Felix Juefei-Xu, Samaneh Azadi, Animesh Sinha, Peizhao Zhang, Peter Vajda, Diana Marculescu
Abstract:
Video personalization, which generates customized videos using reference images, has gained significant attention. However, prior methods typically focus on single-concept personalization, limiting broader applications that require multi-concept integration. Attempts to extend these models to multiple concepts often lead to identity blending, which results in composite characters with fused attributes from multiple sources. This challenge arises due to the lack of a mechanism to link each concept with its specific reference image. We address this with anchored prompts, which embed image anchors as unique tokens within text prompts, guiding accurate referencing during generation. Additionally, we introduce concept embeddings to encode the order of reference images. Our approach, Movie Weaver, seamlessly weaves multiple concepts-including face, body, and animal images-into one video, allowing flexible combinations in a single model. The evaluation shows that Movie Weaver outperforms existing methods for multi-concept video personalization in identity preservation and overall quality.
中文:Movie Weaver通过锚定提示和概念嵌入技术,实现了多概念视频个性化生成,有效避免了身份混合问题,在保持各参考图像独立特征的同时,其生成质量优于现有方法。
English: Movie Weaver introduces anchored prompts and concept embeddings to enable seamless multi-concept video personalization, effectively preventing identity blending while preserving distinct attributes from multiple reference images and outperforming existing methods in quality.
Authors:Leyang Hu, Matteo Gamba, Randall Balestriero
Abstract:
The scaling of model and data sizes has reshaped the AI landscape, establishing finetuning pretrained models as the standard paradigm for solving downstream tasks. However, dominant finetuning methods typically rely on weight adaptation, often lack interpretability, and depend on heuristically chosen hyperparameters. In this paper, we take a different perspective and shift the focus from weights to activation functions, viewing them through the lens of spline operators. We propose Curvature Tuning (CT), an interpretable and principled steering method that modulates a model's decision boundary by injecting a single hyperparameter into its activation functions. We show that CT provably adjusts model decision boundary curvature and, more fundamentally, projects a model onto a space of smooth functions-thereby complementing current finetuning methods, whose effect lies primarily in feature adaptation. Making this hyperparameter trainable gives rise to a novel and highly parameter-efficient finetuning method. Empirically, CT improves both generalization and robustness. For example, it boosts downstream accuracy of ResNet-50/152 by 7.14%/8.46% over linear probing and 4.64%/1.70% over LoRA across 12 datasets, and improves robust accuracy on the $\ell_\infty$ benchmark from RobustBench by 1032.64%/1494.46%. Our code is available at https://github.com/Leon-Leyang/curvature-tuning.
Chinese: 本文提出曲率调优(CT)方法,通过向激活函数引入单一超参数来调整模型决策边界,在多个数据集上显著提升了模型的泛化能力和鲁棒性。
English: This paper introduces Curvature Tuning (CT), an interpretable method that adjusts a model's decision boundary by modifying activation functions with a single hyperparameter, enhancing both generalization and robustness across multiple datasets.
Authors:Shengkun Tang, Oliver Sieberling, Eldar Kurtic, Zhiqiang Shen, Dan Alistarh
Abstract:
Large Language Models (LLMs) have achieved significant success across various NLP tasks. However, their massive computational costs limit their widespread use, particularly in real-time applications. Structured pruning offers an effective solution by compressing models and directly providing end-to-end speed improvements, regardless of the hardware environment. Meanwhile, different components of the model exhibit varying sensitivities towards pruning, calling for non-uniform model compression. However, a pruning method should not only identify a capable substructure, but also account for post-compression training. To this end, we propose DarwinLM, a method for training-aware structured pruning. DarwinLM builds upon an evolutionary search process, generating multiple offspring models in each generation through mutation, and selecting the fittest for survival. To assess the effect of post-training, we incorporate a lightweight, multistep training process within the offspring population, progressively increasing the number of tokens and eliminating poorly performing models in each selection stage. We validate our method through extensive experiments on Llama-2-7B, Llama-3.1-8B and Qwen-2.5-14B-Instruct, achieving state-of-the-art performance for structured pruning. For instance, DarwinLM surpasses ShearedLlama while requiring 5x less training data during post-compression training. Code is at: https://github.com/IST-DASLab/DarwinLM
Chinese: DarwinLM是一种训练感知的结构化剪枝方法,通过进化搜索和多阶段训练,在降低计算成本的同时高效压缩大语言模型并保持优异性能。
English: DarwinLM is a training-aware structured pruning method that uses evolutionary search and multistep training to efficiently compress large language models while maintaining high performance with reduced computational costs.
Authors:Song Liu, Leyang Wang, Yakun Wang
Abstract:
Optimising probabilistic models is a well-studied field in statistics. However, its connection with the training of generative models remains largely under-explored. In this paper, we show that the evolution of time-varying generative models can be projected onto an exponential family manifold, naturally creating a link between the parameters of a generative model and those of a probabilistic model. We then train the generative model by moving its projection on the manifold according to the natural gradient descent scheme. This approach also allows us to efficiently approximate the natural gradient of the KL divergence without relying on MCMC for intractable models. Furthermore, we propose particle versions of the algorithm, which feature closed-form update rules for any parametric model within the exponential family. Through toy and real-world experiments, we validate the effectiveness of the proposed algorithms. The code of the proposed algorithms can be found at https://github.com/anewgithubname/iNGD.
Chinese: 本文提出一种方法,将时变生成模型的演化投影到指数族流形上,建立其与概率模型的联系,并通过自然梯度下降进行训练,无需MCMC即可高效近似KL散度。
English: This paper introduces a method that projects the evolution of time-varying generative models onto an exponential family manifold, linking them to probabilistic models and enabling training via natural gradient descent without MCMC for efficient KL divergence approximation.
Authors:Arvind Pillai, Dimitris Spathis, Subigya Nepal, Amanda C Collins, Daniel M Mackin, Michael V Heinz, Tess Z Griffin, Nicholas C Jacobson, Andrew Campbell
Abstract:
Large language models (LLMs) show promise for health applications when combined with behavioral sensing data. Traditional approaches convert sensor data into text prompts, but this process is prone to errors, computationally expensive, and requires domain expertise. These challenges are particularly acute when processing extended time series data. While time series foundation models (TFMs) have recently emerged as powerful tools for learning representations from temporal data, bridging TFMs and LLMs remains challenging. Here, we present Time2Lang, a framework that directly maps TFM outputs to LLM representations without intermediate text conversion. Our approach first trains on synthetic data using periodicity prediction as a pretext task, followed by evaluation on mental health classification tasks. We validate Time2Lang on two longitudinal wearable and mobile sensing datasets: daily depression prediction using step count data (17,251 days from 256 participants) and flourishing classification based on conversation duration (46 participants over 10 weeks). Time2Lang maintains near constant inference times regardless of input length, unlike traditional prompting methods. The generated embeddings preserve essential time-series characteristics such as auto-correlation. Our results demonstrate that TFMs and LLMs can be effectively integrated while minimizing information loss and enabling performance transfer across these distinct modeling paradigms. To our knowledge, we are the first to integrate a TFM and an LLM for health, thus establishing a foundation for future research combining general-purpose large models for complex healthcare tasks.
中文摘要:Time2Lang框架通过将时序基础模型的输出直接映射到大型语言模型表示,无需文本转换即可有效整合两种模型,在心理健康分类任务中实现恒定推理时间并保持时序特征,为医疗健康应用建立了新范式。
English Summary: The Time2Lang framework effectively bridges time series foundation models and large language models by directly mapping sensor data representations without text conversion, enabling efficient mental health classification with constant inference time and preserved temporal characteristics.
Authors:Marten Lienen, Marcel Kollovieh, Stephan Günnemann
Abstract:
We derive a novel generative model from iterative Gaussian posterior inference. By treating the generated sample as an unknown variable, we can formulate the sampling process in the language of Bayesian probability. Our model uses a sequence of prediction and posterior update steps to iteratively narrow down the unknown sample starting from a broad initial belief. In addition to a rigorous theoretical analysis, we establish a connection between our model and diffusion models and show that it includes Bayesian Flow Networks (BFNs) as a special case. In our experiments, we demonstrate that our model improves sample quality on ImageNet32 over both BFNs and the closely related Variational Diffusion Models, while achieving equivalent log-likelihoods on ImageNet32 and CIFAR10. Find our code at https://github.com/martenlienen/bsi.
中文: 本文提出了一种基于迭代高斯后验推断的新型生成模型,在ImageNet32上相比贝叶斯流网络和变分扩散模型提升了样本质量,同时保持了相当的似然性能。
English: This paper introduces a novel generative model based on iterative Gaussian posterior inference, which enhances sample quality on ImageNet32 compared to Bayesian Flow Networks and Variational Diffusion Models while maintaining comparable log-likelihood performance.
Authors:Cong Lu, Shengran Hu, Jeff Clune
Abstract:
Foundation models have become general-purpose assistants, exhibiting diverse capabilities across numerous domains through training on web-scale data. It remains challenging to precisely characterize even a fraction of the full spectrum of these abilities and potential risks in any new model. Existing evaluation approaches often require significant human effort, and it is taking increasing effort to design ever harder challenges for more capable models. We introduce Automated Capability Discovery (ACD), a framework that designates one foundation model as a scientist to systematically propose open-ended tasks probing the abilities of a subject model (potentially itself). By combining frontier models with ideas from the field of open-endedness, ACD automatically and systematically uncovers a diverse spectrum of surprising capabilities and failures in the subject model. We demonstrate ACD across a range of foundation models (including the GPT, Claude, and Llama series), showing that it automatically generates thousands of distinct tasks, which are then clustered to reveal dozens of broader capability areas and failure modes, that would be challenging for any single team to uncover. We further validate our method's automated scoring with extensive human surveys, observing high agreement between model-generated and human evaluations. By leveraging foundation models' ability to both create tasks and self-evaluate, ACD is a significant step toward scalable, automated evaluation of novel AI systems. All code and evaluation logs are open-sourced at https://github.com/conglu1997/ACD.
中文摘要:自动化能力发现(ACD)框架将一个基础模型作为科学家,通过开放式探索和自我评估,系统地生成和评估任务,自动揭示目标模型的广泛能力与风险。
English Summary: The Automated Capability Discovery (ACD) framework employs one foundation model as a scientist to systematically generate and evaluate tasks, automatically uncovering a wide range of capabilities and risks in subject models through open-ended exploration and self-assessment.
Authors:Weigao Sun, Disen Lan, Yiran Zhong, Xiaoye Qu, Yu Cheng
Abstract:
Linear sequence modeling approaches, such as linear attention, provide advantages like linear-time training and constant-memory inference over sequence lengths. However, existing sequence parallelism (SP) methods are either not optimized for the right-product-first feature of linear attention or use a ring-style communication strategy, which results in lower computation parallelism, limits their scalability for longer sequences in distributed systems. In this paper, we introduce LASP-2, a new SP method to enhance both communication and computation parallelism when training linear attention transformer models with very-long input sequences. Compared to previous work LASP, LASP-2 rethinks the minimal communication requirement for SP on linear attention layers, reorganizes the whole communication-computation workflow of LASP. In this way, only one single AllGather collective communication is needed on intermediate memory states, whose sizes are independent of the sequence length, leading to significant improvements of both communication and computation parallelism, as well as their overlap. Additionally, we extend LASP-2 to LASP-2H by applying similar communication redesign to standard attention modules, offering an efficient SP solution for hybrid models that blend linear and standard attention layers. Our evaluation on a Linear-Llama3 model, a variant of Llama3 with linear attention replacing standard attention, demonstrates the effectiveness of LASP-2 and LASP-2H. Specifically, LASP-2 achieves training speed improvements of 15.2% over LASP and 36.6% over Ring Attention, with a sequence length of 2048K across 64 GPUs. The Code is released as a part of: https://github.com/OpenSparseLLMs/Linear-MoE.
中文: 本文提出LASP-2这一新型序列并行方法,通过重构通信计算流程仅需单次AllGather操作,显著提升了长序列线性注意力Transformer训练的通信与计算并行效率。
English: This paper introduces LASP-2, a novel sequence parallelism method that enhances communication and computation parallelism for training linear attention transformers with long sequences by minimizing communication overhead through a redesigned workflow requiring only one AllGather operation.
Authors:Thomas Pethick, Wanyun Xie, Kimon Antonakopoulos, Zhenyu Zhu, Antonio Silveti-Falls, Volkan Cevher
Abstract:
In this work, we study optimization methods that leverage the linear minimization oracle (LMO) over a norm-ball. We propose a new stochastic family of algorithms that uses the LMO to adapt to the geometry of the problem and, perhaps surprisingly, show that they can be applied to unconstrained problems. The resulting update rule unifies several existing optimization methods under a single framework. Furthermore, we propose an explicit choice of norm for deep architectures, which, as a side benefit, leads to the transferability of hyperparameters across model sizes. Experimentally, we demonstrate significant speedups on nanoGPT training using our algorithm, Scion, without any reliance on Adam. The proposed method is memory-efficient, requiring only one set of model weights and one set of gradients, which can be stored in half-precision. The code is available at https://github.com/LIONS-EPFL/scion .
中文: 本文提出了Scion随机优化算法,利用线性最小化预言机适应问题几何结构并适用于无约束问题,实验显示其能加速nanoGPT训练,具有内存高效性和超参数跨模型尺寸可迁移性。
English: This paper introduces Scion, a stochastic optimization algorithm that uses a linear minimization oracle to adapt to problem geometry and applies to unconstrained problems, demonstrating faster nanoGPT training with memory efficiency and hyperparameter transferability across model sizes.
Authors:Zican Dong, Junyi Li, Jinhao Jiang, Mingyu Xu, Wayne Xin Zhao, Bingning Wang, Weipeng Chen
Abstract:
Large language models (LLMs) have gained extended context windows through scaling positional encodings and lightweight continual pre-training. However, this often leads to degraded performance on short-text tasks, while the reasons for this degradation remain insufficiently explored. In this work, we identify two primary factors contributing to this issue: distribution drift in hidden states and attention scores, and catastrophic forgetting during continual pre-training. To address these challenges, we propose Long Context Pre-training with Restoration Distillation (LongReD), a novel approach designed to mitigate short-text performance degradation through minimizing the distribution discrepancy between the extended and original models. Besides training on long texts, LongReD distills the hidden state of selected layers from the original model on short texts. Additionally, LongReD also introduces a short-to-long distillation, aligning the output distribution on short texts with that on long texts by leveraging skipped positional indices. Experiments on common text benchmarks demonstrate that LongReD effectively preserves the model's short-text performance while maintaining comparable or even better capacity to handle long texts than baselines. Our code is available at https://github.com/RUCAIBox/LongReD.
中文摘要:LongReD方法通过恢复蒸馏技术解决大语言模型扩展上下文窗口时出现的分布漂移和灾难性遗忘问题,有效缓解了短文本任务性能下降,同时保持长文本处理能力。
English Summary: The LongReD method is introduced to counteract the performance decline of large language models on short-text tasks when their context windows are expanded, by addressing distribution drift and catastrophic forgetting through restoration distillation techniques.
Authors:Zilu Dong, Xiangqing Shen, Rui Xia
Abstract:
As large language models continue to scale up, knowledge editing techniques that modify models' internal knowledge without full retraining have gained significant attention. MEMIT, a prominent batch editing algorithm, stands out for its capability to perform mass knowledge modifications. However, we uncover that MEMIT's editing efficacy significantly deteriorates when processing batches containing multiple edits sharing the same subject. Our analysis reveals this stems from MEMIT's key value modeling framework: identical keys (derived from the shared subject) are forced to represent different values (corresponding to different knowledge), resulting in update conflicts during editing. Addressing this issue, we propose MEMIT-Merge, an enhanced approach that merges value computation processes for facts sharing the same subject, effectively resolving the performance degradation in samesubject batch editing scenarios. Experimental results demonstrate that when MEMIT's edit success rate drops to around 50% at larger batch sizes, MEMIT-Merge maintains a success rate exceeding 90%, showcasing remarkable robustness to subject entity collisions. The code is available at https://github.com/NUSTM/ MEMIT-Merge.
中文: MEMIT算法在处理同主体批量知识编辑时,因相同键值对应不同知识导致更新冲突,性能显著下降至约50%成功率;而提出的MEMIT-Merge方法通过合并同主体事实的值计算过程,将成功率稳定保持在90%以上,有效解决了该问题。
English: MEMIT, a batch knowledge editing method for large language models, suffers performance degradation when handling multiple edits with the same subject due to conflicting key-value updates, but the proposed MEMIT-Merge enhancement resolves this by merging value computations, maintaining over 90% success rate versus MEMIT's 50% drop.
Authors:Yuechen Xie, Jie Song, Mengqi Xue, Haofei Zhang, Xingen Wang, Bingde Hu, Genlang Chen, Mingli Song
Abstract:
High-quality open-source datasets, which necessitate substantial efforts for curation, has become the primary catalyst for the swift progress of deep learning. Concurrently, protecting these datasets is paramount for the well-being of the data owner. Dataset ownership verification emerges as a crucial method in this domain, but existing approaches are often limited to supervised models and cannot be directly extended to increasingly popular unsupervised pre-trained models. In this work, we propose the first dataset ownership verification method tailored specifically for self-supervised pre-trained models by contrastive learning. Its primary objective is to ascertain whether a suspicious black-box backbone has been pre-trained on a specific unlabeled dataset, aiding dataset owners in upholding their rights. The proposed approach is motivated by our empirical insights that when models are trained with the target dataset, the unary and binary instance relationships within the embedding space exhibit significant variations compared to models trained without the target dataset. We validate the efficacy of this approach across multiple contrastive pre-trained models including SimCLR, BYOL, SimSiam, MOCO v3, and DINO. The results demonstrate that our method rejects the null hypothesis with a $p$-value markedly below $0.05$, surpassing all previous methodologies. Our code is available at https://github.com/xieyc99/DOV4CL.
中文: 本研究首次提出针对自监督预训练模型的数据集所有权验证方法,通过对比学习分析嵌入空间中的实例关系来准确判断模型是否使用特定数据集训练,并在多个模型中验证了其显著有效性。
English: This study introduces the first dataset ownership verification method for self-supervised pre-trained models using contrastive learning, effectively determining if a model was trained on a specific dataset by analyzing embedding space relationships, with validation across multiple models showing significant results.
Authors:Xuefeng Liu, Songhao Jiang, Siyu Chen, Zhuoran Yang, Yuxin Chen, Ian Foster, Rick Stevens
Abstract:
Finetuning a Large Language Model (LLM) is crucial for generating results towards specific objectives. This research delves into the realm of drug optimization and introduce a novel reinforcement learning algorithm to finetune a drug optimization LLM-based generative model, enhancing the original drug across target objectives, while retains the beneficial chemical properties of the original drug. This work is comprised of two primary components: (1) DrugImprover: A framework tailored for improving robustness and efficiency in drug optimization. It includes a LLM designed for drug optimization and a novel Structured Policy Optimization (SPO) algorithm, which is theoretically grounded. This algorithm offers a unique perspective for fine-tuning the LLM-based generative model by aligning the improvement of the generated molecule with the input molecule under desired objectives. (2) A dataset of 1 million compounds, each with OEDOCK docking scores on 5 human proteins associated with cancer cells and 24 binding sites from SARS-CoV-2 virus. We conduct a comprehensive evaluation of SPO and demonstrate its effectiveness in improving the original drug across target properties. Our code and dataset will be publicly available at: https://github.com/xuefeng-cs/DrugImproverGPT.
中文摘要:本研究提出了一种新颖的强化学习算法——结构化策略优化(SPO),用于微调药物优化大语言模型,在提升目标特性的同时保持原始药物的有益化学性质。
English Summary: This research introduces a reinforcement learning algorithm called Structured Policy Optimization (SPO) to fine-tune a drug optimization LLM, improving target properties while preserving beneficial chemical characteristics of original drugs.
Authors:Girish A. Koushik, Diptesh Kanojia, Helen Treharne
Abstract:
Social media platforms enable the propagation of hateful content across different modalities such as textual, auditory, and visual, necessitating effective detection methods. While recent approaches have shown promise in handling individual modalities, their effectiveness across different modality combinations remains unexplored. This paper presents a systematic analysis of fusion-based approaches for multimodal hate detection, focusing on their performance across video and image-based content. Our comprehensive evaluation reveals significant modality-specific limitations: while simple embedding fusion achieves state-of-the-art performance on video content (HateMM dataset) with a 9.9% points F1-score improvement, it struggles with complex image-text relationships in memes (Hateful Memes dataset). Through detailed ablation studies and error analysis, we demonstrate how current fusion approaches fail to capture nuanced cross-modal interactions, particularly in cases involving benign confounders. Our findings provide crucial insights for developing more robust hate detection systems and highlight the need for modality-specific architectural considerations. The code is available at https://github.com/gak97/Video-vs-Meme-Hate.
Chinese: 本研究系统评估了基于融合的多模态仇恨内容检测方法,发现简单嵌入融合在视频内容上表现优异,但在处理表情包中复杂的图文关系时存在不足,因其难以捕捉细微的跨模态交互特征。
English: This study systematically evaluates fusion-based approaches for multimodal hate detection, revealing that while simple embedding fusion excels with video content, it struggles with complex image-text relationships in memes due to limitations in capturing nuanced cross-modal interactions.
Authors:Kwanghee Choi, Eunjung Yeo, Kalvin Chang, Shinji Watanabe, David Mortensen
Abstract:
Allophony refers to the variation in the phonetic realization of a phoneme based on its phonetic environment. Modeling allophones is crucial for atypical pronunciation assessment, which involves distinguishing atypical from typical pronunciations. However, recent phoneme classifier-based approaches often simplify this by treating various realizations as a single phoneme, bypassing the complexity of modeling allophonic variation. Motivated by the acoustic modeling capabilities of frozen self-supervised speech model (S3M) features, we propose MixGoP, a novel approach that leverages Gaussian mixture models to model phoneme distributions with multiple subclusters. Our experiments show that MixGoP achieves state-of-the-art performance across four out of five datasets, including dysarthric and non-native speech. Our analysis further suggests that S3M features capture allophonic variation more effectively than MFCCs and Mel spectrograms, highlighting the benefits of integrating MixGoP with S3M features.
Chinese: MixGoP是一种利用高斯混合模型对音素分布进行多子簇建模的新方法,在多数数据集上实现了最优性能,并证明自监督语音模型特征比传统方法能更有效地捕捉音位变体差异。
English: MixGoP is a novel approach that uses Gaussian mixture models to model phoneme distributions with multiple subclusters, achieving state-of-the-art performance on most datasets and demonstrating that self-supervised speech model features capture allophonic variation more effectively than traditional methods.
Authors:Tai Hoang, Huy Le, Philipp Becker, Vien Anh Ngo, Gerhard Neumann
Abstract:
Manipulating objects with varying geometries and deformable objects is a major challenge in robotics. Tasks such as insertion with different objects or cloth hanging require precise control and effective modelling of complex dynamics. In this work, we frame this problem through the lens of a heterogeneous graph that comprises smaller sub-graphs, such as actuators and objects, accompanied by different edge types describing their interactions. This graph representation serves as a unified structure for both rigid and deformable objects tasks, and can be extended further to tasks comprising multiple actuators. To evaluate this setup, we present a novel and challenging reinforcement learning benchmark, including rigid insertion of diverse objects, as well as rope and cloth manipulation with multiple end-effectors. These tasks present a large search space, as both the initial and target configurations are uniformly sampled in 3D space. To address this issue, we propose a novel graph-based policy model, dubbed Heterogeneous Equivariant Policy (HEPi), utilizing $SE(3)$ equivariant message passing networks as the main backbone to exploit the geometric symmetry. In addition, by modeling explicit heterogeneity, HEPi can outperform Transformer-based and non-heterogeneous equivariant policies in terms of average returns, sample efficiency, and generalization to unseen objects. Our project page is available at https://thobotics.github.io/hepi.
Authors:Siddarth Venkatraman, Mohsin Hasan, Minsu Kim, Luca Scimeca, Marcin Sendera, Yoshua Bengio, Glen Berseth, Nikolay Malkin
Abstract:
Any well-behaved generative model over a variable $\mathbf{x}$ can be expressed as a deterministic transformation of an exogenous ('outsourced') Gaussian noise variable $\mathbf{z}$: $\mathbf{x}=f_θ(\mathbf{z})$. In such a model (\eg, a VAE, GAN, or continuous-time flow-based model), sampling of the target variable $\mathbf{x} \sim p_θ(\mathbf{x})$ is straightforward, but sampling from a posterior distribution of the form $p(\mathbf{x}\mid\mathbf{y}) \propto p_θ(\mathbf{x})r(\mathbf{x},\mathbf{y})$, where $r$ is a constraint function depending on an auxiliary variable $\mathbf{y}$, is generally intractable. We propose to amortize the cost of sampling from such posterior distributions with diffusion models that sample a distribution in the noise space ($\mathbf{z}$). These diffusion samplers are trained by reinforcement learning algorithms to enforce that the transformed samples $f_θ(\mathbf{z})$ are distributed according to the posterior in the data space ($\mathbf{x}$). For many models and constraints, the posterior in noise space is smoother than in data space, making it more suitable for amortized inference. Our method enables conditional sampling under unconditional GAN, (H)VAE, and flow-based priors, comparing favorably with other inference methods. We demonstrate the proposed outsourced diffusion sampling in several experiments with large pretrained prior models: conditional image generation, reinforcement learning with human feedback, and protein structure generation.
中文: 本研究提出了一种基于扩散的方法,通过强化学习在噪声空间中训练采样器,将平滑化的后验分布转换至数据空间,从而实现对复杂生成模型的高效条件采样。
English: The study introduces a diffusion-based method that uses reinforcement learning to train samplers in the noise space, enabling efficient conditional sampling from complex generative models by transforming smoothed posterior distributions into the data space.
Authors:Arghadip Das, Arnab Raha, Shamik Kundu, Soumendu Kumar Ghosh, Deepak Mathaikutty, Vijay Raghunathan
Abstract:
State-Space Models (SSMs) have emerged as efficient alternatives to transformers for sequential data tasks, offering linear or near-linear scalability with sequence length, making them ideal for long-sequence applications in NLP, vision, and edge AI, including real-time transcription, translation, and contextual search. These applications require lightweight, high-performance models for deployment on resource-constrained devices like laptops and PCs. Designing specialized accelerators for every emerging neural network is costly and impractical; instead, optimizing models for existing NPUs in AI PCs provides a scalable solution. To this end, we propose XAMBA, the first framework to enable and optimize SSMs on commercial off-the-shelf (COTS) state-of-the-art (SOTA) NPUs. XAMBA follows a three-step methodology: (1) enabling SSMs on NPUs, (2) optimizing performance to meet KPI requirements, and (3) trading accuracy for additional performance gains. After enabling SSMs on NPUs, XAMBA mitigates key bottlenecks using CumBA and ReduBA, replacing sequential CumSum and ReduceSum operations with matrix-based computations, significantly improving execution speed and memory efficiency. Additionally, ActiBA enhances performance by approximating expensive activation functions (e.g., Swish, Softplus) using piecewise linear mappings, reducing latency with minimal accuracy loss. Evaluations on an Intel Core Ultra Series 2 AI PC show that XAMBA achieves up to 4.8X speed-up over the baseline. Our implementation is available at https://github.com/arghadippurdue/XAMBA.
Chinese: XAMBA是首个在商用NPU上实现并优化状态空间模型的框架,通过解决计算瓶颈和近似激活函数,在AI PC上实现了高达4.8倍的加速。
English: XAMBA is the first framework that enables and optimizes State-Space Models (SSMs) on commercial NPUs by addressing computational bottlenecks and approximating activation functions, achieving up to 4.8X speed-up on AI PCs.
Authors:Songtao Huang, Zhen Zhao, Can Li, Lei Bai
Abstract:
Real-world time series often have multiple frequency components that are intertwined with each other, making accurate time series forecasting challenging. Decomposing the mixed frequency components into multiple single frequency components is a natural choice. However, the information density of patterns varies across different frequencies, and employing a uniform modeling approach for different frequency components can lead to inaccurate characterization. To address this challenges, inspired by the flexibility of the recent Kolmogorov-Arnold Network (KAN), we propose a KAN-based Frequency Decomposition Learning architecture (TimeKAN) to address the complex forecasting challenges caused by multiple frequency mixtures. Specifically, TimeKAN mainly consists of three components: Cascaded Frequency Decomposition (CFD) blocks, Multi-order KAN Representation Learning (M-KAN) blocks and Frequency Mixing blocks. CFD blocks adopt a bottom-up cascading approach to obtain series representations for each frequency band. Benefiting from the high flexibility of KAN, we design a novel M-KAN block to learn and represent specific temporal patterns within each frequency band. Finally, Frequency Mixing blocks is used to recombine the frequency bands into the original format. Extensive experimental results across multiple real-world time series datasets demonstrate that TimeKAN achieves state-of-the-art performance as an extremely lightweight architecture. Code is available at https://github.com/huangst21/TimeKAN.
中文摘要:TimeKAN是一种基于Kolmogorov-Arnold网络的轻量级架构,通过分解多频时序成分并学习其差异化模式,在多个真实数据集上实现了最优的预测性能。
English Summary: TimeKAN is a lightweight forecasting architecture that uses Kolmogorov-Arnold Networks to decompose mixed-frequency time series components and model their distinct patterns, achieving state-of-the-art performance across multiple real-world datasets.
Authors:Sina Tayebati, Divake Kumar, Nastaran Darabi, Dinithi Jayasuriya, Ranganath Krishnan, Amit Ranjan Trivedi
Abstract:
Large Language and Vision-Language Models (LLMs/VLMs) are increasingly used in safety-critical applications, yet their opaque decision-making complicates risk assessment and reliability. Uncertainty quantification (UQ) helps assess prediction confidence and enables abstention when uncertainty is high. Conformal prediction (CP), a leading UQ method, provides statistical guarantees but relies on static thresholds, which fail to adapt to task complexity and evolving data distributions, leading to suboptimal trade-offs in accuracy, coverage, and informativeness. To address this, we propose learnable conformal abstention, integrating reinforcement learning (RL) with CP to optimize abstention thresholds dynamically. By treating CP thresholds as adaptive actions, our approach balances multiple objectives, minimizing prediction set size while maintaining reliable coverage. Extensive evaluations across diverse LLM/VLM benchmarks show our method outperforms Least Ambiguous Classifiers (LAC) and Adaptive Prediction Sets (APS), improving accuracy by up to 3.2%, boosting AUROC for hallucination detection by 22.19%, enhancing uncertainty-guided selective generation (AUARC) by 21.17%, and reducing calibration error by 70%-85%. These improvements hold across multiple models and datasets while consistently meeting the 90% coverage target, establishing our approach as a more effective and flexible solution for reliable decision-making in safety-critical applications. The code is available at: {https://github.com/sinatayebati/vlm-uncertainty}.
Chinese: 本文提出可学习的不确定性弃权方法,通过强化学习动态优化共形预测的弃权阈值,显著提升了大语言与视觉语言模型的不确定性量化性能,在多个基准测试中实现了精度提升、幻觉检测改进和校准误差降低,同时保持可靠的覆盖范围。
English: This paper introduces learnable conformal abstention, a reinforcement learning-based method that dynamically optimizes abstention thresholds in conformal prediction to improve uncertainty quantification for large language and vision-language models, achieving significant performance gains across multiple benchmarks while maintaining reliable coverage.
Authors:Jinyu Xiang, Jiayi Zhang, Zhaoyang Yu, Xinbing Liang, Fengwei Teng, Jinhao Tu, Fashen Ren, Xiangru Tang, Sirui Hong, Chenglin Wu, Yuyu Luo
Abstract:
Well-designed prompts are crucial for enhancing Large language models' (LLMs) reasoning capabilities while aligning their outputs with task requirements across diverse domains. However, manually designed prompts require expertise and iterative experimentation. While existing prompt optimization methods aim to automate this process, they rely heavily on external references such as ground truth or by humans, limiting their applicability in real-world scenarios where such data is unavailable or costly to obtain. To address this, we propose Self-Supervised Prompt Optimization (SPO), a cost-efficient framework that discovers effective prompts for both closed and open-ended tasks without requiring external reference. Motivated by the observations that prompt quality manifests directly in LLM outputs and LLMs can effectively assess adherence to task requirements, we derive evaluation and optimization signals purely from output comparisons. Specifically, SPO selects superior prompts through pairwise output comparisons evaluated by an LLM evaluator, followed by an LLM optimizer that aligns outputs with task requirements. Extensive experiments demonstrate that SPO outperforms state-of-the-art prompt optimization methods, achieving comparable or superior results with significantly lower costs (e.g., 1.1% to 5.6% of existing methods) and fewer samples (e.g., three samples). The code is available at https://github.com/FoundationAgents/SPO.
Chinese Summary: 本文提出自监督提示优化框架(SPO),通过利用纯输出比较而无需外部参考,自主发现适用于封闭式和开放式任务的有效提示,以极低成本实现了优于现有方法的性能。
English Summary: The paper introduces Self-Supervised Prompt Optimization (SPO), a cost-effective framework that autonomously discovers effective prompts for both closed and open-ended tasks by leveraging pairwise output comparisons without requiring external references, achieving superior performance at significantly reduced costs.
Authors:Muhammed Ãz, Nicholas Kiefer, Charlotte Debus, Jasmin Hörter, Achim Streit, Markus Götz
Abstract:
Ensemble learning is a widespread technique to improve the prediction performance of neural networks. However, it comes at the price of increased memory and inference time. In this work we propose a novel model fusion technique called \emph{Neuron Transplantation (NT)} in which we fuse an ensemble of models by transplanting important neurons from all ensemble members into the vacant space obtained by pruning insignificant neurons. An initial loss in performance post-transplantation can be quickly recovered via fine-tuning, consistently outperforming individual ensemble members of the same model capacity and architecture. Furthermore, NT enables all the ensemble members to be jointly pruned and jointly trained in a combined model. Comparing it to alignment-based averaging (like Optimal-Transport-fusion), it requires less fine-tuning than the corresponding OT-fused model, the fusion itself is faster and requires less memory, while the resulting model performance is comparable or better. The code is available under the following link: https://github.com/masterbaer/neuron-transplantation.
Chinese: 本研究提出了一种名为“神经元移植”的模型融合技术,通过将集成模型中重要神经元移植到修剪后的空缺位置,在减少内存占用和加速推理的同时,获得了与传统方法相当或更优的性能。
English: The study introduces Neuron Transplantation, a model fusion technique that integrates key neurons from ensemble members into a pruned model, achieving comparable or superior performance with less memory and faster inference than traditional methods.
Authors:Xu Zhang, Kaidi Xu, Ziqing Hu, Ren Wang
Abstract:
Mixture of Experts (MoE) have shown remarkable success in leveraging specialized expert networks for complex machine learning tasks. However, their susceptibility to adversarial attacks presents a critical challenge for deployment in robust applications. This paper addresses the critical question of how to incorporate robustness into MoEs while maintaining high natural accuracy. We begin by analyzing the vulnerability of MoE components, finding that expert networks are notably more susceptible to adversarial attacks than the router. Based on this insight, we propose a targeted robust training technique that integrates a novel loss function to enhance the adversarial robustness of MoE, requiring only the robustification of one additional expert without compromising training or inference efficiency. Building on this, we introduce a dual-model strategy that linearly combines a standard MoE model with our robustified MoE model using a smoothing parameter. This approach allows for flexible control over the robustness-accuracy trade-off. We further provide theoretical foundations by deriving certified robustness bounds for both the single MoE and the dual-model. To push the boundaries of robustness and accuracy, we propose a novel joint training strategy JTDMoE for the dual-model. This joint training enhances both robustness and accuracy beyond what is achievable with separate models. Experimental results on CIFAR-10 and TinyImageNet datasets using ResNet18 and Vision Transformer (ViT) architectures demonstrate the effectiveness of our proposed methods. The code is publicly available at https://github.com/TIML-Group/Robust-MoE-Dual-Model.
Chinese: 本文针对专家混合模型提出了一种鲁棒训练技术和双模型策略,在保持高精度的同时增强了对抗鲁棒性,并在CIFAR-10和TinyImageNet数据集上进行了实验验证。
English: This paper introduces a robust training technique and a dual-model strategy for Mixture of Experts (MoE) to enhance adversarial robustness while maintaining high accuracy, with experimental validation on CIFAR-10 and TinyImageNet datasets.
Authors:Xingye Chen, Wei Feng, Zhenbang Du, Weizhen Wang, Yanyin Chen, Haohan Wang, Linkai Liu, Yaoyu Li, Jinyuan Zhao, Yu Li, Zheng Zhang, Jingjing Lv, Junjie Shen, Zhangang Lin, Jingping Shao, Yuanjie Shao, Xinge You, Changxin Gao, Nong Sang
Abstract:
In web data, advertising images are crucial for capturing user attention and improving advertising effectiveness. Most existing methods generate background for products primarily focus on the aesthetic quality, which may fail to achieve satisfactory online performance. To address this limitation, we explore the use of Multimodal Large Language Models (MLLMs) for generating advertising images by optimizing for Click-Through Rate (CTR) as the primary objective. Firstly, we build targeted pre-training tasks, and leverage a large-scale e-commerce multimodal dataset to equip MLLMs with initial capabilities for advertising image generation tasks. To further improve the CTR of generated images, we propose a novel reward model to fine-tune pre-trained MLLMs through Reinforcement Learning (RL), which can jointly utilize multimodal features and accurately reflect user click preferences. Meanwhile, a product-centric preference optimization strategy is developed to ensure that the generated background content aligns with the product characteristics after fine-tuning, enhancing the overall relevance and effectiveness of the advertising images. Extensive experiments have demonstrated that our method achieves state-of-the-art performance in both online and offline metrics. Our code and pre-trained models are publicly available at: https://github.com/Chenguoz/CAIG.
中文: 本研究提出了一种利用多模态大语言模型生成广告图像的新方法,通过针对性预训练、结合奖励模型的强化学习以及以产品为中心的优化策略,显著提升了点击率,在线上线下指标中均实现了最优性能。
English: This study introduces a novel approach using Multimodal Large Language Models (MLLMs) to generate advertising images optimized for Click-Through Rate (CTR), employing targeted pre-training, reinforcement learning with a reward model, and product-centric optimization to achieve state-of-the-art performance in both online and offline metrics.
Authors:Siyeol Jung, Taehwan Kim
Abstract:
The listener head generation (LHG) task aims to generate natural nonverbal listener responses based on the speaker's multimodal cues. While prior work either rely on limited modalities (e.g. audio and facial information) or employ autoregressive approaches which have limitations such as accumulating prediction errors. To address these limitations, we propose DiffListener, a discrete diffusion based approach for non-autoregressive listener head generation. Our model takes the speaker's facial information, audio, and text as inputs, additionally incorporating facial differential information to represent the temporal dynamics of expressions and movements. With this explicit modeling of facial dynamics, DiffListener can generate coherent reaction sequences in a non-autoregressive manner. Through comprehensive experiments, DiffListener demonstrates state-of-the-art performance in both quantitative and qualitative evaluations. The user study shows that DiffListener generates natural context-aware listener reactions that are well synchronized with the speaker. The code and demo videos are available in https://siyeoljung.github.io/DiffListener
Authors:Peng Huang, Shu Hu, Bo Peng, Xun Gong, Penghang Yin, Hongtu Zhu, Xi Wu, Xin Wang
Abstract:
MedSAM, a medical foundation model derived from the SAM architecture, has demonstrated notable success across diverse medical domains. However, its clinical application faces two major challenges: the dependency on labor-intensive manual prompt generation, which imposes a significant burden on clinicians, and the absence of semantic labeling in the generated segmentation masks for organs or lesions, limiting its practicality for non-expert users. To address these limitations, we propose AutoMedSAM, an end-to-end framework derived from SAM, designed to enhance usability and segmentation performance. AutoMedSAM retains MedSAM's image encoder and mask decoder structure while introducing a novel diffusion-based class prompt encoder. The diffusion-based encoder employs a dual-decoder structure to collaboratively generate prompt embeddings guided by sparse and dense prompt definitions. These embeddings enhance the model's ability to understand and process clinical imagery autonomously. With this encoder, AutoMedSAM leverages class prompts to embed semantic information into the model's predictions, transforming MedSAM's semi-automated pipeline into a fully automated workflow. Furthermore, AutoMedSAM employs an uncertainty-aware joint optimization strategy during training to effectively inherit MedSAM's pre-trained knowledge while improving generalization by integrating multiple loss functions. Experimental results across diverse datasets demonstrate that AutoMedSAM achieves superior performance while broadening its applicability to both clinical settings and non-expert users. Code is available at https://github.com/HP-ML/AutoPromptMedSAM.git.
中文摘要:AutoMedSAM是一种自动化框架,通过消除手动提示并为分割掩码添加语义标签,增强了MedSAM的性能和可用性,使其更适用于临床场景和非专业用户。
English Summary: AutoMedSAM is an automated framework that enhances MedSAM by eliminating manual prompts and adding semantic labels to segmentation masks, improving both performance and usability for clinical and non-expert applications.
Authors:Chen Jin, Ryutaro Tanno, Amrutha Saseendran, Tom Diethe, Philip Teare
Abstract:
We introduce Lavender, a simple supervised fine-tuning (SFT) method that boosts the performance of advanced vision-language models (VLMs) by leveraging state-of-the-art image generation models such as Stable Diffusion. Specifically, Lavender aligns the text-vision attention in the VLM transformer with the equivalent used by Stable Diffusion during SFT, instead of adapting separate encoders. This alignment enriches the model's visual understanding and significantly boosts performance across in- and out-of-distribution tasks. Lavender requires just 0.13 million training examples, 2.5% of typical large-scale SFT datasets, and fine-tunes on standard hardware (8 GPUs) in a single day. It consistently improves state-of-the-art open-source multimodal LLMs (e.g., Llama-3.2-11B, MiniCPM-Llama3-v2.5), achieving up to 30% gains and a 68% boost on challenging out-of-distribution medical QA tasks. By efficiently transferring the visual expertise of image generators with minimal supervision, Lavender offers a scalable solution for more accurate vision-language systems. All code, training data, and models will be shared at https://astrazeneca.github.io/vlm/.
Authors:Hui Shen, Jingxuan Zhang, Boning Xiong, Rui Hu, Shoufa Chen, Zhongwei Wan, Xin Wang, Yu Zhang, Zixuan Gong, Guangyin Bao, Chaofan Tao, Yongfeng Huang, Ye Yuan, Mi Zhang
Abstract:
Diffusion models have emerged as powerful generative models capable of producing high-quality contents such as images, videos, and audio, demonstrating their potential to revolutionize digital content creation. However, these capabilities come at the cost of their significant computational resources and lengthy generation time, underscoring the critical need to develop efficient techniques for practical deployment. In this survey, we provide a systematic and comprehensive review of research on efficient diffusion models. We organize the literature in a taxonomy consisting of three main categories, covering distinct yet interconnected efficient diffusion model topics from algorithm-level, system-level, and framework perspective, respectively. We have also created a GitHub repository where we organize the papers featured in this survey at https://github.com/AIoT-MLSys-Lab/Efficient-Diffusion-Model-Survey. We hope our survey can serve as a valuable resource to help researchers and practitioners gain a systematic understanding of efficient diffusion model research and inspire them to contribute to this important and exciting field.
Chinese: 扩散模型作为能生成高质量内容的强大工具,虽计算资源需求高且生成耗时,但本综述系统梳理了从算法、系统和框架角度提升其效率的研究,旨在推动实际应用并为该领域提供参考。
English: Diffusion models are powerful generative tools for creating high-quality content but require significant computational resources, prompting a systematic survey to review efficient techniques from algorithm, system, and framework perspectives to aid practical deployment.
Authors:Tianlang Chen, Charilaos Kanatsoulis, Jure Leskovec
Abstract:
Predictive tasks on relational databases are critical in real-world applications spanning e-commerce, healthcare, and social media. To address these tasks effectively, Relational Deep Learning (RDL) encodes relational data as graphs, enabling Graph Neural Networks (GNNs) to exploit relational structures for improved predictions. However, existing RDL methods often overlook the intrinsic structural properties of the graphs built from relational databases, leading to modeling inefficiencies, particularly in handling many-to-many relationships. Here we introduce RelGNN, a novel GNN framework specifically designed to leverage the unique structural characteristics of the graphs built from relational databases. At the core of our approach is the introduction of atomic routes, which are simple paths that enable direct single-hop interactions between the source and destination nodes. Building upon these atomic routes, RelGNN designs new composite message passing and graph attention mechanisms that reduce redundancy, highlight key signals, and enhance predictive accuracy. RelGNN is evaluated on 30 diverse real-world tasks from Relbench (Fey et al., 2024), and achieves state-of-the-art performance on the vast majority of tasks, with improvements of up to 25%. Code is available at https://github.com/snap-stanford/RelGNN.
中文: RelGNN通过引入原子路径和复合消息传递机制,优化了图神经网络在关系数据库中的结构利用,在多数实际任务中实现了最高性能,提升幅度高达25%。
English: RelGNN introduces atomic routes and composite message passing to enhance GNNs for relational databases, achieving state-of-the-art performance with up to 25% improvement on real-world tasks.
Authors:Chengqi Lyu, Songyang Gao, Yuzhe Gu, Wenwei Zhang, Jianfei Gao, Kuikun Liu, Ziyi Wang, Shuaibin Li, Qian Zhao, Haian Huang, Weihan Cao, Jiangning Liu, Hongwei Liu, Junnan Liu, Songyang Zhang, Dahua Lin, Kai Chen
Abstract:
Reasoning abilities, especially those for solving complex math problems, are crucial components of general intelligence. Recent advances by proprietary companies, such as o-series models of OpenAI, have made remarkable progress on reasoning tasks. However, the complete technical details remain unrevealed, and the techniques that are believed certainly to be adopted are only reinforcement learning (RL) and the long chain of thoughts. This paper proposes a new RL framework, termed OREAL, to pursue the performance limit that can be achieved through \textbf{O}utcome \textbf{RE}w\textbf{A}rd-based reinforcement \textbf{L}earning for mathematical reasoning tasks, where only binary outcome rewards are easily accessible. We theoretically prove that behavior cloning on positive trajectories from best-of-N (BoN) sampling is sufficient to learn the KL-regularized optimal policy in binary feedback environments. This formulation further implies that the rewards of negative samples should be reshaped to ensure the gradient consistency between positive and negative samples. To alleviate the long-existing difficulties brought by sparse rewards in RL, which are even exacerbated by the partial correctness of the long chain of thought for reasoning tasks, we further apply a token-level reward model to sample important tokens in reasoning trajectories for learning. With OREAL, for the first time, a 7B model can obtain 94.0 pass@1 accuracy on MATH-500 through RL, being on par with 32B models. OREAL-32B also surpasses previous 32B models trained by distillation with 95.0 pass@1 accuracy on MATH-500. Our investigation also indicates the importance of initial policy models and training queries for RL. Code, models, and data will be released to benefit future research\footnote{https://github.com/InternLM/OREAL}.
中文:本文提出了OREAL这一新型强化学习框架,利用二元结果奖励显著提升语言模型的数学推理能力,使小规模模型首次达到与大型模型相媲美的准确率。
English: This paper introduces OREAL, a novel reinforcement learning framework that uses binary outcome rewards to significantly enhance mathematical reasoning in language models, achieving state-of-the-art accuracy with smaller model sizes.
Authors:Ling Yang, Zhaochen Yu, Bin Cui, Mengdi Wang
Abstract:
We present that hierarchical LLM reasoning via scaling thought templates can effectively optimize the reasoning search space and outperform the mathematical reasoning capabilities of powerful LLMs like OpenAI o1-preview and DeepSeek V3. We train our ReasonFlux-32B model with only 8 GPUs and introduces three innovations: (i) a structured and generic thought template library, containing around 500 high-level thought templates capable of generalizing to similar or relevant reasoning problems; (ii) performing hierarchical reinforcement learning on a sequence of thought templates instead of long CoTs, optimizing a base LLM to plan out an optimal template trajectory for gradually handling complex problems; (iii) a brand new inference scaling system that enables hierarchical LLM reasoning by adaptively scaling thought templates at inference time. With a template trajectory containing more explainable reasoning structures than DeepSeek-R1 and o3-mini, our ReasonFlux-32B significantly advances math reasoning capabilities to state-of-the-art levels. Notably, on the MATH benchmark, it achieves an accuracy of 91.2% and surpasses o1-preview by 6.7%. On the USA Math Olympiad (AIME) benchmark, ReasonFlux-32B solves an average of 56.7% of problems, surpassing o1-preview and DeepSeek-V3 by 27% and 45%, respectively. Code: https://github.com/Gen-Verse/ReasonFlux
中文: ReasonFlux-32B模型通过分层推理和可扩展思维模板,在MATH基准测试中达到91.2%准确率,在AIME中解题率达56.7%,显著超越了OpenAI o1-preview和DeepSeek V3等先进模型。
English: The ReasonFlux-32B model introduces hierarchical reasoning with scalable thought templates, achieving state-of-the-art math performance by surpassing leading models like OpenAI o1-preview and DeepSeek V3 on benchmarks including MATH (91.2% accuracy) and AIME (56.7% problem-solving rate).
Authors:Daouda Sow, Herbert Woisetschläger, Saikiran Bulusu, Shiqiang Wang, Hans-Arno Jacobsen, Yingbin Liang
Abstract:
Pretraining large language models (LLMs) on vast and heterogeneous datasets is crucial for achieving state-of-the-art performance across diverse downstream tasks. However, current training paradigms treat all samples equally, overlooking the importance or relevance of individual samples throughout the training process. Existing reweighting strategies, which primarily focus on group-level data importance, fail to leverage fine-grained instance-level information and do not adapt dynamically to individual sample importance as training progresses. In this paper, we introduce novel algorithms for dynamic, instance-level data reweighting aimed at improving both the efficiency and effectiveness of LLM pretraining. Our methods adjust the weight of each training sample based on its loss value in an online fashion, allowing the model to dynamically focus on more informative or important samples at the current training stage. In particular, our framework allows us to systematically devise reweighting strategies deprioritizing redundant or uninformative data, which we find tend to work best. Furthermore, we develop a new theoretical framework for analyzing the impact of loss-based reweighting on the convergence of gradient-based optimization, providing the first formal characterization of how these strategies affect convergence bounds. We empirically validate our approach across a spectrum of tasks, from pretraining 7B and 1.4B parameter LLMs to smaller-scale language models and linear regression problems, demonstrating that our loss-based reweighting approach can lead to faster convergence and significantly improved performance.
中文: 本文提出基于损失值的动态实例级数据重加权算法,通过在线调整样本权重使模型聚焦于关键训练数据,理论分析和多任务实验表明该方法能加速收敛并显著提升性能。
English: This paper introduces dynamic, instance-level data reweighting algorithms that adjust sample weights based on loss values to enhance LLM pretraining efficiency and effectiveness, supported by theoretical analysis and empirical validation across various tasks.
Authors:Bessie Dominguez-Dager, Felix Escalona, Francisco Gomez-Donoso, Miguel Cazorla
Abstract:
Person re-identification (Re-ID) is a key challenge in computer vision, requiring the matching of individuals across cameras, locations, and time. While most research focuses on short-term scenarios with minimal appearance changes, real-world applications demand robust systems that handle long-term variations caused by clothing and physical changes. We present CHIRLA, Comprehensive High-resolution Identification and Re-identification for Large-scale Analysis, a novel dataset designed for video-based long-term person Re-ID. CHIRLA was recorded over seven months in four connected indoor environments using seven strategically placed cameras, capturing realistic movements with substantial clothing and appearance variability. The dataset includes 22 individuals, more than five hours of video, and about 1M bounding boxes with identity annotations obtained through semi-automatic labeling. We also define benchmark protocols for person tracking and Re-ID, covering diverse and challenging scenarios such as occlusion, reappearance, and multi-camera conditions. By introducing this comprehensive benchmark, we aim to facilitate the development and evaluation of Re-ID algorithms that can reliably perform in challenging, long-term real-world scenarios. The benchmark code is publicly available at: https://github.com/bdager/CHIRLA.
中文摘要:CHIRLA数据集通过提供七个月的多摄像头视频数据,包含显著的服装变化,解决了长期行人重识别难题,并建立了基准测试以提升实际应用中的Re-ID算法性能。
English Summary: The CHIRLA dataset addresses long-term person re-identification challenges by providing seven months of multi-camera video data with significant clothing variations, establishing benchmarks to improve Re-ID algorithms for real-world applications.
Authors:Xingrun Xing, Zheng Liu, Shitao Xiao, Boyan Gao, Yiming Liang, Wanpeng Zhang, Haokun Lin, Guoqi Li, Jiajun Zhang
Abstract:
Modern large language models (LLMs) driven by scaling laws, achieve intelligence emergency in large model sizes. Recently, the increasing concerns about cloud costs, latency, and privacy make it an urgent requirement to develop compact edge language models. Distinguished from direct pretraining that bounded by the scaling law, this work proposes the pruning-aware pretraining, focusing on retaining performance of much larger optimized models. It features following characteristics: 1) Data-scalable: we introduce minimal parameter groups in LLM and continuously optimize structural pruning, extending post-training pruning methods like LLM-Pruner and SparseGPT into the pretraining phase. 2) Architecture-agnostic: the LLM architecture is auto-designed using saliency-driven pruning, which is the first time to exceed SoTA human-designed LLMs in modern pretraining. We reveal that it achieves top-quality edge language models, termed EfficientLLM, by scaling up LLM compression and extending its boundary. EfficientLLM significantly outperforms SoTA baselines with $100M \sim 1B$ parameters, such as MobileLLM, SmolLM, Qwen2.5-0.5B, OLMo-1B, Llama3.2-1B in common sense benchmarks. As the first attempt, EfficientLLM bridges the performance gap between traditional LLM compression and direct pretraining methods, and we will fully open source at https://github.com/Xingrun-Xing2/EfficientLLM.
Chinese: 本研究提出剪枝感知预训练方法,开发出高效边缘语言模型EfficientLLM,通过在预训练阶段融入结构化剪枝技术,显著超越现有最优基准模型,成功弥合了传统压缩方法与直接预训练之间的性能鸿沟。
English: This work introduces pruning-aware pretraining to develop EfficientLLM, a compact edge language model that surpasses state-of-the-art baselines by integrating structural pruning during pretraining, bridging the performance gap between traditional compression and direct pretraining methods.
Authors:Soobin Um, Beomsu Kim, Jong Chul Ye
Abstract:
Minority samples are underrepresented instances located in low-density regions of a data manifold, and are valuable in many generative AI applications, such as data augmentation, creative content generation, etc. Unfortunately, existing diffusion-based minority generators often rely on computationally expensive guidance dedicated for minority generation. To address this, here we present a simple yet powerful guidance-free approach called Boost-and-Skip for generating minority samples using diffusion models. The key advantage of our framework requires only two minimal changes to standard generative processes: (i) variance-boosted initialization and (ii) timestep skipping. We highlight that these seemingly-trivial modifications are supported by solid theoretical and empirical evidence, thereby effectively promoting emergence of underrepresented minority features. Our comprehensive experiments demonstrate that Boost-and-Skip greatly enhances the capability of generating minority samples, even rivaling guidance-based state-of-the-art approaches while requiring significantly fewer computations. Code is available at https://github.com/soobin-um/BnS.
中文摘要:本文提出的Boost-and-Skip方法通过方差增强初始化和时间步跳跃这两个简单修改,无需引导机制即可有效生成数据流形中的少数样本,在保持与先进引导方法相当性能的同时大幅降低了计算开销。
English Summary: The paper introduces Boost-and-Skip, a computationally efficient guidance-free method that enhances minority sample generation in diffusion models through variance-boosted initialization and timestep skipping, achieving performance comparable to state-of-the-art guidance-based approaches with significantly reduced computational costs.
Authors:Filip Ekström Kelvinius, Oskar B. Andersson, Abhijith S. Parackal, Dong Qian, Rickard Armiento, Fredrik Lindsten
Abstract:
Crystalline materials often exhibit a high level of symmetry. However, most generative models do not account for symmetry, but rather model each atom without any constraints on its position or element. We propose a generative model, Wyckoff Diffusion (WyckoffDiff), which generates symmetry-based descriptions of crystals. This is enabled by considering a crystal structure representation that encodes all symmetry, and we design a novel neural network architecture which enables using this representation inside a discrete generative model framework. In addition to respecting symmetry by construction, the discrete nature of our model enables fast generation. We additionally present a new metric, Fréchet Wrenformer Distance, which captures the symmetry aspects of the materials generated, and we benchmark WyckoffDiff against recently proposed generative models for crystal generation. As a proof-of-concept study, we use WyckoffDiff to find new materials below the convex hull of thermodynamical stability.
Chinese: WyckoffDiff提出了一种基于对称性的晶体生成模型,通过离散框架确保结构对称性并实现快速生成,同时引入新的对称性评估指标,并验证了其在发现热力学稳定新材料方面的潜力。
English: WyckoffDiff introduces a symmetry-aware generative model for crystals that uses a discrete framework to ensure structural symmetry and enable rapid generation, while also proposing a new metric for evaluating symmetry in generated materials and demonstrating its potential in discovering thermodynamically stable compounds.
Authors:Lingao Xiao, Songhua Liu, Yang He, Xinchao Wang
Abstract:
Dataset distillation and dataset pruning are two prominent techniques for compressing datasets to improve computational and storage efficiency. Despite their overlapping objectives, these approaches are rarely compared directly. Even within each field, the evaluation protocols are inconsistent across various methods, which complicates fair comparisons and hinders reproducibility. Considering these limitations, we introduce in this paper a benchmark that equitably evaluates methodologies across both distillation and pruning literatures. Notably, our benchmark reveals that in the mainstream dataset distillation setting for large-scale datasets, which heavily rely on soft labels from pre-trained models, even randomly selected subsets can achieve surprisingly competitive performance. This finding suggests that an overemphasis on soft labels may be diverting attention from the intrinsic value of the image data, while also imposing additional burdens in terms of generation, storage, and application. To address these issues, we propose a new framework for dataset compression, termed Prune, Combine, and Augment (PCA), which focuses on leveraging image data exclusively, relies solely on hard labels for evaluation, and achieves state-of-the-art performance in this setup. By shifting the emphasis back to the images, our benchmark and PCA framework pave the way for more balanced and accessible techniques in dataset compression research. Our code is available at: https://github.com/ArmandXiao/Rethinking-Dataset-Compression
中文: 本文提出了一个公平评估数据集蒸馏与剪枝方法的基准,发现随机子集在主流设定中可与依赖软标签的复杂方法相媲美,并提出了仅使用硬标签的新框架PCA,通过聚焦图像数据实现了最优性能。
English: This paper introduces a benchmark that fairly compares dataset distillation and pruning methods, revealing that random subsets can match the performance of complex soft-label approaches, and proposes a new hard-label-only framework called PCA that achieves state-of-the-art results by focusing on image data.
Authors:Qian Chen, Xingjian Dong, Kui Hu, Kangkang Chen, Zhike Peng, Guang Meng
Abstract:
Neural networks (NNs), with their powerful nonlinear mapping and end-to-end capabilities, are widely applied in mechanical intelligent fault diagnosis (IFD). However, as typical black-box models, they pose challenges in understanding their decision basis and logic, limiting their deployment in high-reliability scenarios. Hence, various methods have been proposed to enhance the interpretability of IFD. Among these, post-hoc approaches can provide explanations without changing model architecture, preserving its flexibility and scalability. However, existing post-hoc methods often suffer from limitations in explanation forms. They either require preprocessing that disrupts the end-to-end nature or overlook fault mechanisms, leading to suboptimal explanations. To address these issues, we derived the cyclic-spectral (CS) transform and proposed the CS-SHAP by extending Shapley additive explanations (SHAP) to the CS domain. CS-SHAP can evaluate contributions from both carrier and modulation frequencies, aligning more closely with fault mechanisms and delivering clearer and more accurate explanations. Three datasets are utilized to validate the superior interpretability of CS-SHAP, ensuring its correctness, reproducibility, and practical performance. With open-source code and outstanding interpretability, CS-SHAP has the potential to be widely adopted and become the post-hoc interpretability benchmark in IFD, even in other classification tasks. The code is available on https://github.com/ChenQian0618/CS-SHAP.
中文: 针对神经网络在机械智能故障诊断中可解释性不足的问题,本研究提出CS-SHAP方法,通过将SHAP扩展至循环谱域实现与故障机理契合的清晰解释,经三个数据集验证具备卓越性能。
English: Neural networks are widely used in mechanical intelligent fault diagnosis but face interpretability challenges, which the proposed CS-SHAP method addresses by extending SHAP to the cyclic-spectral domain for clearer, mechanism-aligned explanations validated across three datasets.
Authors:Yongqi An, Xu Zhao, Tao Yu, Ming Tang, Jinqiao Wang
Abstract:
Outliers have been widely observed in Large Language Models (LLMs), significantly impacting model performance and posing challenges for model compression. Understanding the functionality and formation mechanisms of these outliers is critically important. Existing works, however, largely focus on reducing the impact of outliers from an algorithmic perspective, lacking an in-depth investigation into their causes and roles. In this work, we provide a detailed analysis of the formation process, underlying causes, and functions of outliers in LLMs. We define and categorize three types of outliers-activation outliers, weight outliers, and attention outliers-and analyze their distributions across different dimensions, uncovering inherent connections between their occurrences and their ultimate influence on the attention mechanism. Based on these observations, we hypothesize and explore the mechanisms by which these outliers arise and function, demonstrating through theoretical derivations and experiments that they emerge due to the self-attention mechanism's softmax operation. These outliers act as implicit context-aware scaling factors within the attention mechanism. As these outliers stem from systematic influences, we term them systematic outliers. Our study not only enhances the understanding of Transformer-based LLMs but also shows that structurally eliminating outliers can accelerate convergence and improve model compression. The code is avilable at https://github.com/an-yongqi/systematic-outliers.
中文摘要:本研究分析了大型语言模型中异常值的形成机制、成因及功能,揭示其源于自注意力机制的softmax操作并作为隐式缩放因子发挥作用,实验表明结构性地消除这些异常值可加速模型收敛并提升压缩效果。
English Summary: This study analyzes the formation, causes, and functions of outliers in Large Language Models, revealing they emerge from the softmax operation in self-attention and serve as implicit scaling factors, with their structural elimination shown to accelerate convergence and improve model compression.
Authors:Yiru Jiao, Sander van Cranenburgh, Simeon Calvert, Hans van Lint
Abstract:
The effectiveness of neural network models largely relies on learning meaningful latent patterns from data, where self-supervised learning of informative representations can enhance model performance and generalisability. However, self-supervised representation learning for spatially characterised time series, which are ubiquitous in transportation domain, poses unique challenges due to the necessity of maintaining fine-grained spatio-temporal similarities in the latent space. In this study, we introduce two structure-preserving regularisers for the contrastive learning of spatial time series: one regulariser preserves the topology of similarities between instances, and the other preserves the graph geometry of similarities across spatial and temporal dimensions. To balance the contrastive learning objective and the need for structure preservation, we propose a dynamic weighting mechanism that adaptively manages this trade-off and stabilises training. We validate the proposed method through extensive experiments, including multivariate time series classification to demonstrate its general applicability, as well as macroscopic and microscopic traffic prediction to highlight its particular usefulness in encoding traffic interactions. Across all tasks, our method preserves the similarity structures more effectively and improves state-of-the-art task performances. This method can be integrated with an arbitrary neural network model and is particularly beneficial for time series data with spatial or geographical features. Furthermore, our findings suggest that well-preserved similarity structures in the latent space indicate more informative and useful representations. This provides insights to design more effective neural networks for data-driven transportation research. Our code is made openly accessible with all resulting data at https://github.com/yiru-jiao/spclt
中文: 本研究针对空间时间序列提出了两个结构保持正则化器和一个动态权重机制,通过对比学习有效保持时空相似性结构,在多项任务中提升了模型性能,尤其适用于交通领域的预测应用。
English: This study introduces two structure-preserving regularizers and a dynamic weighting mechanism for contrastive learning of spatial time series, which effectively maintains spatio-temporal similarities and enhances model performance across various tasks, particularly in transportation applications.
Authors:Filip Ekström Kelvinius, Zheng Zhao, Fredrik Lindsten
Abstract:
A recent line of research has exploited pre-trained generative diffusion models as priors for solving Bayesian inverse problems. We contribute to this research direction by designing a sequential Monte Carlo method for linear-Gaussian inverse problems which builds on "decoupled diffusion", where the generative process is designed such that larger updates to the sample are possible. The method is asymptotically exact and we demonstrate the effectiveness of our Decoupled Diffusion Sequential Monte Carlo (DDSMC) algorithm on both synthetic as well as protein and image data. Further, we demonstrate how the approach can be extended to discrete data.
中文摘要:本文针对线性高斯逆问题提出了一种基于解耦扩散的序列蒙特卡洛方法,该方法允许更大的样本更新且具有渐近精确性,并在合成数据、蛋白质和图像数据上验证了有效性,同时展示了向离散数据的扩展能力。
English Summary: This paper introduces a sequential Monte Carlo method for linear-Gaussian inverse problems using decoupled diffusion, which enables larger sample updates and is proven asymptotically exact, with validation on synthetic, protein, and image data, plus an extension to discrete data.
Authors:Yawei Li, David Rügamer, Bernd Bischl, Mina Rezaei
Abstract:
Fine-tuned large language models (LLMs) often exhibit overconfidence, particularly when trained on small datasets, resulting in poor calibration and inaccurate uncertainty estimates. Evidential Deep Learning (EDL), an uncertainty-aware approach, enables uncertainty estimation in a single forward pass, making it a promising method for calibrating fine-tuned LLMs. However, despite its computational efficiency, EDL is prone to overfitting, as its training objective can result in overly concentrated probability distributions. To mitigate this, we propose regularizing EDL by incorporating an information bottleneck (IB). Our approach IB-EDL suppresses spurious information in the evidence generated by the model and encourages truly predictive information to influence both the predictions and uncertainty estimates. Extensive experiments across various fine-tuned LLMs and tasks demonstrate that IB-EDL outperforms both existing EDL and non-EDL approaches. By improving the trustworthiness of LLMs, IB-EDL facilitates their broader adoption in domains requiring high levels of confidence calibration. Code is available at https://github.com/sandylaker/ib-edl.
中文: 针对微调后大语言模型常出现的过度自信和校准不佳问题,本文提出IB-EDL方法,通过信息瓶颈正则化证据深度学习,有效提升不确定性估计能力并增强模型可信度。
English: Fine-tuned large language models often suffer from overconfidence and poor calibration, which is addressed by the proposed IB-EDL method that regularizes Evidential Deep Learning with an information bottleneck to enhance uncertainty estimation and model trustworthiness.
Authors:Qi Wang, Tianfei Zhou, Ye Yuan, Rui Mao
Abstract:
Continual Graph Learning (CGL), which aims to accommodate new tasks over evolving graph data without forgetting prior knowledge, is garnering significant research interest. Mainstream solutions adopt the memory replay-based idea, ie, caching representative data from earlier tasks for retraining the graph model. However, this strategy struggles with scalability issues for constantly evolving graphs and raises concerns regarding data privacy. Inspired by recent advancements in the prompt-based learning paradigm, this paper introduces a novel prompt-driven continual graph learning (PROMPTCGL) framework, which learns a separate prompt for each incoming task and maintains the underlying graph neural network model fixed. In this way, PROMPTCGL naturally avoids catastrophic forgetting of knowledge from previous tasks. More specifically, we propose hierarchical prompting to instruct the model from both feature- and topology-level to fully address the variability of task graphs in dynamic continual learning. Additionally, we develop a personalized prompt generator to generate tailored prompts for each graph node while minimizing the number of prompts needed, leading to constant memory consumption regardless of the graph scale. Extensive experiments on four benchmarks show that PROMPTCGL achieves superior performance against existing CGL approaches while significantly reducing memory consumption. Our code is available at https://github.com/QiWang98/PromptCGL.
中文:本文提出了PROMPTCGL框架,通过分层提示和个性化提示生成器实现持续图学习,既能防止灾难性遗忘又保持恒定内存消耗,在性能与效率上均优于现有方法。
English: This paper introduces PROMPTCGL, a prompt-driven framework for continual graph learning that uses hierarchical prompting and a personalized prompt generator to prevent catastrophic forgetting while maintaining constant memory consumption, demonstrating superior performance and efficiency over existing methods.
Authors:Haiduo Huang, Fuwei Yang, Zhenhua Liu, Yixing Xu, Jinze Li, Yang Liu, Xuanwu Yin, Dong Li, Pengju Ren, Emad Barsoum
Abstract:
Speculative decoding (SD) accelerates large language model inference by using a smaller draft model to predict multiple tokens, which are then verified in parallel by the larger target model. However, the limited capacity of the draft model often necessitates tree-based sampling to improve prediction accuracy, where multiple candidates are generated at each step. We identify a key limitation in this approach: the candidates at the same step are derived from the same representation, limiting diversity and reducing overall effectiveness. To address this, we propose Jakiro, leveraging Mixture of Experts (MoE), where independent experts generate diverse predictions, effectively decoupling correlations among candidates. Furthermore, we introduce a hybrid inference strategy, combining autoregressive decoding for initial tokens with parallel decoding for subsequent stages, and enhance the latter with contrastive mechanism in features to improve accuracy. Our method significantly boosts prediction accuracy and achieves higher inference speedups. Extensive experiments across diverse models validate the effectiveness and robustness of our approach, establishing a new SOTA in speculative decoding. Our codes are available at https://github.com/haiduo/Jakiro.
中文: Jakiro通过采用专家混合模型生成多样化的令牌预测并结合混合推理策略,显著提高了推测解码的准确性和推理速度,在不同模型上均验证了其有效性。
English: Jakiro enhances speculative decoding by employing Mixture of Experts to generate diverse token predictions and a hybrid inference strategy, significantly improving both accuracy and inference speed across various models.
Authors:Zhixun Li, Dingshuo Chen, Tong Zhao, Daixin Wang, Hongrui Liu, Zhiqiang Zhang, Jun Zhou, Jeffrey Xu Yu
Abstract:
Graph Neural Networks (GNNs) have achieved great success in dealing with non-Euclidean graph-structured data and have been widely deployed in many real-world applications. However, their effectiveness is often jeopardized under class-imbalanced training sets. Most existing studies have analyzed class-imbalanced node classification from a supervised learning perspective, but they do not fully utilize the large number of unlabeled nodes in semi-supervised scenarios. We claim that the supervised signal is just the tip of the iceberg and a large number of unlabeled nodes have not yet been effectively utilized. In this work, we propose IceBerg, a debiased self-training framework to address the class-imbalanced and few-shot challenges for GNNs at the same time. Specifically, to figure out the Matthew effect and label distribution shift in self-training, we propose Double Balancing, which can largely improve the performance of existing baselines with just a few lines of code as a simple plug-and-play module. Secondly, to enhance the long-range propagation capability of GNNs, we disentangle the propagation and transformation operations of GNNs. Therefore, the weak supervision signals can propagate more effectively to address the few-shot issue. In summary, we find that leveraging unlabeled nodes can significantly enhance the performance of GNNs in class-imbalanced and few-shot scenarios, and even small, surgical modifications can lead to substantial performance improvements. Systematic experiments on benchmark datasets show that our method can deliver considerable performance gain over existing class-imbalanced node classification baselines. Additionally, due to IceBerg's outstanding ability to leverage unsupervised signals, it also achieves state-of-the-art results in few-shot node classification scenarios. The code of IceBerg is available at: https://github.com/ZhixunLEE/IceBerg.
中文: 本文提出IceBerg框架,通过利用未标记节点和双重平衡技术,有效提升图神经网络在类别不平衡和少样本场景下的性能表现。
English: The paper introduces IceBerg, a debiased self-training framework that enhances Graph Neural Networks' performance in class-imbalanced and few-shot scenarios by effectively leveraging unlabeled nodes and implementing double balancing techniques.
Authors:Zhaoying Wang, Yingdan Shi, Xiang Liu, Can Chen, Jun Wen, Ren Wang
Abstract:
Drug-side effect research is vital for understanding adverse reactions arising in complex multi-drug therapies. However, the scarcity of higher-order datasets that capture the combinatorial effects of multiple drugs severely limits progress in this field. Existing resources such as TWOSIDES primarily focus on pairwise interactions. To fill this critical gap, we introduce HODDI, the first Higher-Order Drug-Drug Interaction Dataset, constructed from U.S. Food and Drug Administration (FDA) Adverse Event Reporting System (FAERS) records spanning the past decade, to advance computational pharmacovigilance. HODDI contains 109,744 records involving 2,506 unique drugs and 4,569 unique side effects, specifically curated to capture multi-drug interactions and their collective impact on adverse effects. Comprehensive statistical analyses demonstrate HODDI's extensive coverage and robust analytical metrics, making it a valuable resource for studying higher-order drug relationships. Evaluating HODDI with multiple models, we found that simple Multi-Layer Perceptron (MLP) can outperform graph models, while hypergraph models demonstrate superior performance in capturing complex multi-drug interactions, further validating HODDI's effectiveness. Our findings highlight the inherent value of higher-order information in drug-side effect prediction and position HODDI as a benchmark dataset for advancing research in pharmacovigilance, drug safety, and personalized medicine. The dataset and codes are available at https://github.com/TIML-Group/HODDI.
中文: HODDI是首个高阶药物相互作用数据集,基于FDA记录构建,填补了多药治疗副作用数据的空白,评估显示超图模型在捕捉复杂相互作用方面表现优异,成为药物警戒研究的基准资源。
English: HODDI is the first higher-order drug-drug interaction dataset, created from FDA records to address the scarcity of data on multi-drug therapy side effects, demonstrating through evaluations that hypergraph models excel in capturing complex interactions and establishing it as a benchmark for pharmacovigilance research.
Authors:Zhichen Dong, Zhanhui Zhou, Zhixuan Liu, Chao Yang, Chaochao Lu
Abstract:
In this work, we argue that large language models (LLMs), though trained to predict only the next token, exhibit emergent planning behaviors: $\textbf{their hidden representations encode future outputs beyond the next token}$. Through simple probing, we demonstrate that LLM prompt representations encode global attributes of their entire responses, including $\textit{structure attributes}$ (e.g., response length, reasoning steps), $\textit{content attributes}$ (e.g., character choices in storywriting, multiple-choice answers at the end of response), and $\textit{behavior attributes}$ (e.g., answer confidence, factual consistency). In addition to identifying response planning, we explore how it scales with model size across tasks and how it evolves during generation. The findings that LLMs plan ahead for the future in their hidden representations suggest potential applications for improving transparency and generation control.
中文: 本研究发现大型语言模型通过隐藏表征编码未来输出的结构、内容和行为等全局属性,展现出涌现的规划能力,这种能力随模型规模扩展并在生成过程中演变,为提升透明度和控制力提供了可能。
English: This study reveals that large language models exhibit emergent planning capabilities by encoding future response attributes—such as structure, content, and behavior—in their hidden representations, which scales with model size and evolves during generation, offering potential for enhanced transparency and control.
Authors:Yeho Gwon, Sehyun Hwang, Hoyoung Kim, Jungseul Ok, Suha Kwak
Abstract:
This paper introduces a cost-efficient active learning (AL) framework for classification, featuring a novel query design called candidate set query. Unlike traditional AL queries requiring the oracle to examine all possible classes, our method narrows down the set of candidate classes likely to include the ground-truth class, significantly reducing the search space and labeling cost. Moreover, we leverage conformal prediction to dynamically generate small yet reliable candidate sets, adapting to model enhancement over successive AL rounds. To this end, we introduce an acquisition function designed to prioritize data points that offer high information gain at lower cost. Empirical evaluations on CIFAR-10, CIFAR-100, and ImageNet64x64 demonstrate the effectiveness and scalability of our framework. Notably, it reduces labeling cost by 48% on ImageNet64x64. The project page can be found at https://yehogwon.github.io/csq-al.
Authors:Guanglong Sun, Hongwei Yan, Liyuan Wang, Qian Li, Bo Lei, Yi Zhong
Abstract:
Knowledge distillation (KD) is a powerful strategy for training deep neural networks (DNNs). Although it was originally proposed to train a more compact "student" model from a large "teacher" model, many recent efforts have focused on adapting it to promote generalization of the model itself, such as online KD and self KD. Here, we propose an accessible and compatible strategy named Spaced KD to improve the effectiveness of both online KD and self KD, in which the student model distills knowledge from a teacher model trained with a space interval ahead. This strategy is inspired by a prominent theory named spacing effect in biological learning and memory, positing that appropriate intervals between learning trials can significantly enhance learning performance. With both theoretical and empirical analyses, we demonstrate that the benefits of the proposed Spaced KD stem from convergence to a flatter loss landscape during stochastic gradient descent (SGD). We perform extensive experiments to validate the effectiveness of Spaced KD in improving the learning performance of DNNs (e.g., the performance gain is up to 2.31% and 3.34% on Tiny-ImageNet over online KD and self KD, respectively). Our codes have been released on github https://github.com/SunGL001/Spaced-KD.
Chinese: 作者提出Spaced KD策略,受生物学习中的间隔效应启发,通过让学生模型从提前训练的教师模型中提取知识,有效提升了在线和自知识蒸馏的性能,实现了更平坦的损失景观,在Tiny-ImageNet上分别获得最高2.31%和3.34%的性能提升。
English: The authors propose Spaced KD, a strategy inspired by the spacing effect in biological learning that enhances online and self-knowledge distillation by having the student learn from a teacher trained with a time interval, leading to improved generalization through a flatter loss landscape and performance gains of up to 2.31% and 3.34% on Tiny-ImageNet.
Authors:Dongyuan Li, Satoshi Kosugi, Ying Zhang, Manabu Okumura, Feng Xia, Renhe Jiang
Abstract:
Dynamic graph clustering aims to detect and track time-varying clusters in dynamic graphs, revealing the evolutionary mechanisms of complex real-world dynamic systems. Matrix factorization-based methods are promising approaches for this task; however, these methods often struggle with scalability and can be time-consuming when applied to large-scale dynamic graphs. Moreover, they tend to lack robustness and are vulnerable to real-world noisy data. To address these issues, we make three key contributions. First, to improve scalability, we propose temporal separated matrix factorization, where a single matrix is divided into multiple smaller matrices for independent factorization, resulting in faster computation. Second, to improve robustness, we introduce bi-clustering regularization, which jointly optimizes graph embedding and clustering, thereby filtering out noisy features from the graph embeddings. Third, to further enhance effectiveness and efficiency, we propose selective embedding updating, where we update only the embeddings of dynamic nodes while the embeddings of static nodes are fixed among different timestamps. Experimental results on six synthetic and five real-world benchmarks demonstrate the scalability, robustness and effectiveness of our proposed method. Source code is available at https://github.com/Clearloveyuan/DyG-MF.
中文: 本研究提出了一种可扩展且鲁棒的动态图聚类方法,通过时序分离矩阵分解、双聚类正则化和选择性嵌入更新,有效处理大规模含噪声图数据。
English: This study introduces a scalable and robust dynamic graph clustering method using temporal separated matrix factorization, bi-clustering regularization, and selective embedding updating to efficiently handle large-scale noisy graphs.
Authors:Jusheng Zhang, Yijia Fan, Kaitong Cai, Keze Wang
Abstract:
Although Kolmogorov-Arnold based interpretable networks (KAN) have strong theoretical expressiveness, they face significant parameter explosion and high-frequency feature capture challenges in high-dimensional tasks. To address this issue, we propose the Kolmogorov-Arnold-Fourier Network (KAF), which effectively integrates trainable Random Fourier Features (RFF) and a novel hybrid GELU-Fourier activation mechanism to balance parameter efficiency and spectral representation capabilities. Our key technical contributions include: (1) merging KAN's dual-matrix structure through matrix association properties to substantially reduce parameters; (2) introducing learnable RFF initialization strategies to eliminate spectral distortion in high-dimensional approximation tasks; (3) implementing an adaptive hybrid activation function that progressively enhances frequency representation during the training process. Comprehensive experiments demonstrate the superiority of our KAF across various domains including vision, NLP, audio processing, and differential equation-solving tasks, effectively combining theoretical interpretability with practical utility and computational efficiency.
中文摘要:提出的Kolmogorov-Arnold-Fourier网络(KAF)通过融合可训练傅里叶特征和混合激活机制,有效解决了可解释网络在高维任务中的参数爆炸与高频特征捕获难题,在保持理论可解释性的同时实现了跨领域应用的卓越性能。
English Summary: The proposed Kolmogorov-Arnold-Fourier Network (KAF) overcomes parameter explosion and high-frequency limitations in interpretable networks by integrating trainable Fourier features and hybrid activation, achieving superior performance across multiple domains with enhanced efficiency.
Authors:Tenglong Liu, Jianxiong Li, Yinan Zheng, Haoyi Niu, Yixing Lan, Xin Xu, Xianyuan Zhan
Abstract:
Humans excel at reusing prior knowledge to address new challenges and developing skills while solving problems. This paradigm becomes increasingly popular in the development of autonomous agents, as it develops systems that can self-evolve in response to new challenges like human beings. However, previous methods suffer from limited training efficiency when expanding new skills and fail to fully leverage prior knowledge to facilitate new task learning. In this paper, we propose Parametric Skill Expansion and Composition (PSEC), a new framework designed to iteratively evolve the agents' capabilities and efficiently address new challenges by maintaining a manageable skill library. This library can progressively integrate skill primitives as plug-and-play Low-Rank Adaptation (LoRA) modules in parameter-efficient finetuning, facilitating efficient and flexible skill expansion. This structure also enables the direct skill compositions in parameter space by merging LoRA modules that encode different skills, leveraging shared information across skills to effectively program new skills. Based on this, we propose a context-aware module to dynamically activate different skills to collaboratively handle new tasks. Empowering diverse applications including multi-objective composition, dynamics shift, and continual policy shift, the results on D4RL, DSRL benchmarks, and the DeepMind Control Suite show that PSEC exhibits superior capacity to leverage prior knowledge to efficiently tackle new challenges, as well as expand its skill libraries to evolve the capabilities. Project website: https://ltlhuuu.github.io/PSEC/.
中文:PSEC框架通过可插拔的LoRA模块库使自主智能体能够高效扩展和组合技能,动态激活这些技能以应对新挑战,同时利用先验知识实现卓越性能。
English: The PSEC framework enables autonomous agents to efficiently expand and compose skills using a library of plug-and-play LoRA modules, dynamically activating them to tackle new challenges while leveraging prior knowledge for superior performance.
Authors:Zhifei Yang, Keyang Lu, Chao Zhang, Jiaxing Qi, Hanqi Jiang, Ruifei Ma, Shenglin Yin, Yifan Xu, Mingzhe Xing, Zhen Xiao, Jieyi Long, Guangyao Zhai
Abstract:
Controllable 3D scene generation has extensive applications in virtual reality and interior design, where the generated scenes should exhibit high levels of realism and controllability in terms of geometry. Scene graphs provide a suitable data representation that facilitates these applications. However, current graph-based methods for scene generation are constrained to text-based inputs and exhibit insufficient adaptability to flexible user inputs, hindering the ability to precisely control object geometry. To address this issue, we propose MMGDreamer, a dual-branch diffusion model for scene generation that incorporates a novel Mixed-Modality Graph, visual enhancement module, and relation predictor. The mixed-modality graph allows object nodes to integrate textual and visual modalities, with optional relationships between nodes. It enhances adaptability to flexible user inputs and enables meticulous control over the geometry of objects in the generated scenes. The visual enhancement module enriches the visual fidelity of text-only nodes by constructing visual representations using text embeddings. Furthermore, our relation predictor leverages node representations to infer absent relationships between nodes, resulting in more coherent scene layouts. Extensive experimental results demonstrate that MMGDreamer exhibits superior control of object geometry, achieving state-of-the-art scene generation performance. Project page: https://yangzhifeio.github.io/project/MMGDreamer.
Authors:Lu Chen, Yizhou Wang, Shixiang Tang, Qianhong Ma, Tong He, Wanli Ouyang, Xiaowei Zhou, Hujun Bao, Sida Peng
Abstract:
Learning an agent model that behaves like humans-capable of jointly perceiving the environment, predicting the future, and taking actions from a first-person perspective-is a fundamental challenge in computer vision. Existing methods typically train separate models for these abilities, which fail to capture their intrinsic relationships and prevent them from learning from each other. Inspired by how humans learn through the perception-action loop, we propose EgoAgent, a unified agent model that simultaneously learns to represent, predict, and act within a single transformer. EgoAgent explicitly models the causal and temporal dependencies among these abilities by formulating the task as an interleaved sequence of states and actions. It further introduces a joint embedding-action-prediction architecture with temporally asymmetric predictor and observer branches, enabling synergistic optimization across all three capabilities. Comprehensive evaluations of EgoAgent on representative tasks such as image classification, egocentric future state prediction, and 3D human motion prediction demonstrate the superiority of our method. The code and trained models will be publicly available at https://github.com/zju3dv/EgoAgent.
Chinese: EgoAgent是一种基于Transformer的统一智能体模型,通过单一架构同时学习感知、预测和行动,并通过显式建模它们之间的因果与时间依赖关系,在多项任务中展现出卓越性能。
English: EgoAgent is a unified transformer-based model that jointly learns perception, prediction, and action through a single architecture, demonstrating superior performance across multiple tasks by explicitly modeling their causal and temporal dependencies.
Authors:RafaÅ Karczewski, Markus Heinonen, Vikas Garg
Abstract:
Diffusion models have emerged as a powerful class of generative models, capable of producing high-quality images by mapping noise to a data distribution. However, recent findings suggest that image likelihood does not align with perceptual quality: high-likelihood samples tend to be smooth, while lower-likelihood ones are more detailed. Controlling sample density is thus crucial for balancing realism and detail. In this paper, we analyze an existing technique, Prior Guidance, which scales the latent code to influence image detail. We introduce score alignment, a condition that explains why this method works and show that it can be tractably checked for any continuous normalizing flow model. We then propose Density Guidance, a principled modification of the generative ODE that enables exact log-density control during sampling. Finally, we extend Density Guidance to stochastic sampling, ensuring precise log-density control while allowing controlled variation in structure or fine details. Our experiments demonstrate that these techniques provide fine-grained control over image detail without compromising sample quality. Code is available at https://github.com/Aalto-QuML/density-guidance.
中文: 本文提出密度引导方法,通过调整生成过程精确控制样本密度,从而在不牺牲图像质量的前提下实现对图像细节的精细化调控。
English: This paper introduces Density Guidance, a method that enables precise control over image detail by modifying the generative process to regulate sample density, ensuring high-quality outputs without sacrificing perceptual quality.
Authors:Wenfang Sun, Xinyuan Song, Pengxiang Li, Lu Yin, Yefeng Zheng, Shiwei Liu
Abstract:
In this paper, we introduce the Curse of Depth, a concept that highlights, explains, and addresses the recent observation in modern Large Language Models (LLMs) where nearly half of the layers are less effective than expected. We first confirm the wide existence of this phenomenon across the most popular families of LLMs such as Llama, Mistral, DeepSeek, and Qwen. Our analysis, theoretically and empirically, identifies that the underlying reason for the ineffectiveness of deep layers in LLMs is the widespread usage of Pre-Layer Normalization (Pre-LN). While Pre-LN stabilizes the training of Transformer LLMs, its output variance exponentially grows with the model depth, which undesirably causes the derivative of the deep Transformer blocks to be an identity matrix, and therefore barely contributes to the training. To resolve this training pitfall, we propose LayerNorm Scaling (LNS), which scales the variance of output of the layer normalization inversely by the square root of its depth. This simple modification mitigates the output variance explosion of deeper Transformer layers, improving their contribution. Across a wide range of model sizes (130M to 7B), our experiments show that LNS consistently outperforms previous normalization and scaling techniques in enhancing LLM pre-training performance. Moreover, this improvement seamlessly carries over to supervised fine-tuning. All these gains can be attributed to the fact that LayerNorm Scaling enables deeper layers to contribute more effectively during training. Our code is available at \href{https://github.com/lmsdss/LayerNorm-Scaling}{LayerNorm-Scaling}.
中文: 本文揭示了大型语言模型中的“深度诅咒”现象,指出其源于预层归一化导致的输出方差爆炸,并提出层归一化缩放方法,有效缓解该问题并提升不同规模模型的训练效果。
English: This paper identifies the Curse of Depth in LLMs, attributing it to Pre-Layer Normalization's output variance explosion, and proposes LayerNorm Scaling to effectively mitigate this issue and enhance training performance across various model sizes.
Authors:Miroslav Štrupl, Oleg Szehr, Francesco Faccio, Dylan R. Ashley, Rupesh Kumar Srivastava, Jürgen Schmidhuber
Abstract:
This article provides a rigorous analysis of convergence and stability of Episodic Upside-Down Reinforcement Learning, Goal-Conditioned Supervised Learning and Online Decision Transformers. These algorithms performed competitively across various benchmarks, from games to robotic tasks, but their theoretical understanding is limited to specific environmental conditions. This work initiates a theoretical foundation for algorithms that build on the broad paradigm of approaching reinforcement learning through supervised learning or sequence modeling. At the core of this investigation lies the analysis of conditions on the underlying environment, under which the algorithms can identify optimal solutions. We also assess whether emerging solutions remain stable in situations where the environment is subject to tiny levels of noise. Specifically, we study the continuity and asymptotic convergence of command-conditioned policies, values and the goal-reaching objective depending on the transition kernel of the underlying Markov Decision Process. We demonstrate that near-optimal behavior is achieved if the transition kernel is located in a sufficiently small neighborhood of a deterministic kernel. The mentioned quantities are continuous (with respect to a specific topology) at deterministic kernels, both asymptotically and after a finite number of learning cycles. The developed methods allow us to present the first explicit estimates on the convergence and stability of policies and values in terms of the underlying transition kernels. On the theoretical side we introduce a number of new concepts to reinforcement learning, like working in segment spaces, studying continuity in quotient topologies and the application of the fixed-point theory of dynamical systems. The theoretical study is accompanied by a detailed investigation of example environments and numerical experiments.
中文: 本文为基于监督学习的强化学习算法建立了理论基础,证明了在接近确定性转移核的环境中,这些算法能够实现近似最优性能并保持稳定性。
English: This article establishes a theoretical foundation for reinforcement learning algorithms based on supervised learning, demonstrating their near-optimal performance and stability in environments with minimal noise near deterministic transition kernels.
Authors:Diego Calanzone, Pierluca D'Oro, Pierre-Luc Bacon
Abstract:
Recent advances in language models have enabled framing molecule generation as sequence modeling. However, existing approaches often rely on single-objective reinforcement learning, limiting their applicability to real-world drug design, where multiple competing properties must be optimized. Traditional multi-objective reinforcement learning (MORL) methods require costly retraining for each new objective combination, making rapid exploration of trade-offs impractical. To overcome these limitations, we introduce Mol-MoE, a mixture-of-experts (MoE) architecture that enables efficient test-time steering of molecule generation without retraining. Central to our approach is a preference-based router training objective that incentivizes the router to combine experts in a way that aligns with user-specified trade-offs. This provides improved flexibility in exploring the chemical property space at test time, facilitating rapid trade-off exploration. Benchmarking against state-of-the-art methods, we show that Mol-MoE achieves superior sample quality and steerability.
Chinese: Mol-MoE采用专家混合架构,无需重新训练即可在测试时灵活引导分子生成,通过将专家组合与用户指定的权衡对齐,实现了卓越的样本质量和可操控性。
English: Mol-MoE introduces a mixture-of-experts architecture that enables flexible, test-time steering of molecule generation without retraining, achieving superior sample quality and steerability by aligning expert combinations with user-specified trade-offs.
Authors:Marian Lupascu, Ana-Cristina Rogoz, Mihai Sorin Stupariu, Radu Tudor Ionescu
Abstract:
In this survey, we systematically analyze techniques used to adapt large multimodal models (LMMs) for low-resource (LR) languages, examining approaches ranging from visual enhancement and data creation to cross-modal transfer and fusion strategies. Through a comprehensive analysis of 106 studies across 75 LR languages, we identify key patterns in how researchers tackle the challenges of limited data and computational resources. We find that visual information often serves as a crucial bridge for improving model performance in LR settings, though significant challenges remain in areas such as hallucination mitigation and computational efficiency. We aim to provide researchers with a clear understanding of current approaches and remaining challenges in making LMMs more accessible to speakers of LR (understudied) languages. We complement our survey with an open-source repository available at: https://github.com/marianlupascu/LMM4LRL-Survey.
中文摘要:本综述系统分析了将大型多模态模型适配低资源语言的技术,发现视觉增强是提升性能的关键桥梁,但幻觉缓解和计算效率仍是主要挑战。
English Summary: This survey systematically examines techniques for adapting large multimodal models to low-resource languages, identifying visual enhancement as a key strategy while highlighting persistent challenges in hallucination mitigation and computational efficiency.
Authors:Jingang Qu, David Holzmüller, Gaël Varoquaux, Marine Le Morvan
Abstract:
The long-standing dominance of gradient-boosted decision trees on tabular data is currently challenged by tabular foundation models using In-Context Learning (ICL): setting the training data as context for the test data and predicting in a single forward pass without parameter updates. While TabPFNv2 foundation model excels on tables with up to 10K samples, its alternating column- and row-wise attentions make handling large training sets computationally prohibitive. So, can ICL be effectively scaled and deliver a benefit for larger tables? We introduce TabICL, a tabular foundation model for classification, pretrained on synthetic datasets with up to 60K samples and capable of handling 500K samples on affordable resources. This is enabled by a novel two-stage architecture: a column-then-row attention mechanism to build fixed-dimensional embeddings of rows, followed by a transformer for efficient ICL. Across 200 classification datasets from the TALENT benchmark, TabICL is on par with TabPFNv2 while being systematically faster (up to 10 times), and significantly outperforms all other approaches. On 53 datasets with over 10K samples, TabICL surpasses both TabPFNv2 and CatBoost, demonstrating the potential of ICL for large data. Pretraining code, inference code, and pre-trained models are available at https://github.com/soda-inria/tabicl.
中文: TabICL通过创新的两阶段架构(先列后行的注意力机制)实现了表格数据中上下文学习的高效扩展,在保持与TabPFNv2相当性能的同时速度显著提升,并在大规模数据集上超越所有现有方法。
English: TabICL introduces a novel two-stage architecture with column-then-row attention to efficiently scale In-Context Learning for tabular data, matching TabPFNv2's performance while being significantly faster and outperforming other methods on large datasets.
Authors:Zinan Lin, Tadas Baltrusaitis, Wenyu Wang, Sergey Yekhanin
Abstract:
Differentially private (DP) synthetic data, which closely resembles the original private data while maintaining strong privacy guarantees, has become a key tool for unlocking the value of private data without compromising privacy. Recently, Private Evolution (PE) has emerged as a promising method for generating DP synthetic data. Unlike other training-based approaches, PE only requires access to inference APIs from foundation models, enabling it to harness the power of state-of-the-art (SoTA) models. However, a suitable foundation model for a specific private data domain is not always available. In this paper, we discover that the PE framework is sufficiently general to allow APIs beyond foundation models. In particular, we demonstrate that many SoTA data synthesizers that do not rely on neural networks--such as computer graphics-based image generators, which we refer to as simulators--can be effectively integrated into PE. This insight significantly broadens PE's applicability and unlocks the potential of powerful simulators for DP data synthesis. We explore this approach, named Sim-PE, in the context of image synthesis. Across four diverse simulators, Sim-PE performs well, improving the downstream classification accuracy of PE by up to 3x, reducing FID by up to 80%, and offering much greater efficiency. We also show that simulators and foundation models can be easily leveraged together within PE to achieve further improvements. The code is open-sourced in the Private Evolution Python library: https://github.com/microsoft/DPSDA.
中文: 本研究提出了Sim-PE,作为私有进化框架的扩展,通过整合非神经网络的模拟器来生成差分隐私合成数据,显著提升了图像合成中的性能、效率和适用性。
English: The study introduces Sim-PE, an extension of the Private Evolution framework that integrates non-neural network simulators for differentially private synthetic data generation, significantly enhancing performance, efficiency, and applicability in image synthesis.
Authors:Qianteng Zhu, Gert Aarts, Wei Wang, Kai Zhou, Lingxiao Wang
Abstract:
We develop diffusion models for simulating lattice gauge theories, where stochastic quantization is explicitly incorporated as a physical condition for sampling. We demonstrate the applicability of this novel sampler to U(1) gauge theory in two spacetime dimensions and find that a model trained at a small inverse coupling constant can be extrapolated to larger inverse coupling regions without encountering the topological freezing problem. Additionally, the trained model can be employed to sample configurations on different lattice sizes without requiring further training. The exactness of the generated samples is ensured by incorporating Metropolis-adjusted Langevin dynamics into the generation process. Furthermore, we demonstrate that this approach enables more efficient sampling of topological quantities compared to traditional algorithms such as Hybrid Monte Carlo and Langevin simulations.
中文: 本研究开发了用于模拟晶格规范理论的扩散模型,通过随机量化对二维时空中的U(1)规范理论进行采样,无需重新训练即可外推至更大耦合参数和不同晶格尺寸,并利用Metropolis调整的Langevin动力学保证样本精确性,在拓扑量采样效率上优于传统算法。
English: This study introduces diffusion models for simulating lattice gauge theories, incorporating stochastic quantization to sample U(1) gauge theory in 2D spacetime, enabling extrapolation to larger couplings and different lattice sizes without retraining while ensuring sample exactness through Metropolis-adjusted Langevin dynamics and outperforming traditional algorithms in topological quantity sampling.
Authors:Shadab Ahamed, Simon Ghyselincks, Pablo Chang Huang Arias, Julian Kloiber, Yasin Ranjbar, Jingrong Tang, Niloufar Zakariaei, Eldad Haber
Abstract:
Magnetic data inversion is an important tool in geophysics, used to infer subsurface magnetic susceptibility distributions from surface magnetic field measurements. This inverse problem is inherently ill-posed, characterized by non-unique solutions, depth ambiguity, and sensitivity to noise. Traditional inversion approaches rely on predefined regularization techniques to stabilize solutions, limiting their adaptability to complex or diverse geological scenarios. In this study, we propose an approach that integrates variable dictionary learning and scale-space methods to address these challenges. Our method employs learned dictionaries, allowing for adaptive representation of complex subsurface features that are difficult to capture with predefined bases. Additionally, we extend classical variational inversion by incorporating multi-scale representations through a scale-space framework, enabling the progressive introduction of structural detail while mitigating overfitting. We implement both fixed and dynamic dictionary learning techniques, with the latter introducing iteration-dependent dictionaries for enhanced flexibility. Using a synthetic dataset to simulate geological scenarios, we demonstrate significant improvements in reconstruction accuracy and robustness compared to conventional variational and dictionary-based methods. Our results highlight the potential of learned dictionaries, especially when coupled with scale-space dynamics, to improve model recovery and noise handling. These findings underscore the promise of our data-driven approach for advance magnetic data inversion and its applications in geophysical exploration, environmental assessment, and mineral prospecting. The code is publicly available at: https://github.com/ahxmeds/magnetic-inversion-dictionary.git.
中文: 本研究提出了一种结合自适应字典学习和多尺度框架的新型磁数据反演方法,通过优化特征表征和噪声抑制能力显著提升了地下结构重建效果,在精度与鲁棒性上均优于传统方法。
English: This study introduces a novel magnetic data inversion method that combines adaptive dictionary learning with a multi-scale framework to enhance subsurface reconstruction by improving feature representation and noise resilience, outperforming traditional techniques in accuracy and robustness.
Authors:Dylan Waldner, Risto Miikkulainen
Abstract:
As AI models grow in power and generality, understanding how agents learn and make decisions in complex environments is critical to promoting ethical behavior. This study introduces the Odyssey, a lightweight, adaptive text based adventure game, providing a scalable framework for exploring AI ethics and safety. The Odyssey examines the ethical implications of implementing biological drives, specifically, self preservation, into three different agents. A Bayesian agent optimized with NEAT, a Bayesian agent optimized with stochastic variational inference, and a GPT 4o agent. The agents select actions at each scenario to survive, adapting to increasingly challenging scenarios. Post simulation analysis evaluates the ethical scores of the agent decisions, uncovering the tradeoffs it navigates to survive. Specifically, analysis finds that when danger increases, agents ethical behavior becomes unpredictable. Surprisingly, the GPT 4o agent outperformed the Bayesian models in both survival and ethical consistency, challenging assumptions about traditional probabilistic methods and raising a new challenge to understand the mechanisms of LLMs' probabilistic reasoning.
中文摘要:本研究通过引入轻量级文本冒险游戏《奥德赛》,探索了三种具有生物驱动力的智能体的伦理行为,发现在危险增加时伦理行为变得不可预测,且GPT-4o智能体在生存能力和伦理一致性上均意外优于贝叶斯模型。
English Summary: This study introduces the Odyssey, a text-based game framework, to explore AI ethics by testing three agents with biological drives, finding that ethical behavior becomes unpredictable under danger and that the GPT-4o agent surprisingly outperformed Bayesian models in both survival and ethical consistency.
Authors:William Huey, Huaxiaoyue Wang, Anne Wu, Yoav Artzi, Sanjiban Choudhury
Abstract:
We examine the problem of learning sequential tasks from a single visual demonstration. A key challenge arises when demonstrations are temporally misaligned due to variations in timing, differences in embodiment, or inconsistencies in execution. Existing approaches treat imitation as a distribution-matching problem, aligning individual frames between the agent and the demonstration. However, we show that such frame-level matching fails to enforce temporal ordering or ensure consistent progress. Our key insight is that matching should instead be defined at the level of sequences. We propose that perfect matching occurs when one sequence successfully covers all the subgoals in the same order as the other sequence. We present ORCA (ORdered Coverage Alignment), a dense per-timestep reward function that measures the probability of the agent covering demonstration frames in the correct order. On temporally misaligned demonstrations, we show that agents trained with the ORCA reward achieve $4.5$x improvement ($0.11 \rightarrow 0.50$ average normalized returns) for Meta-world tasks and $6.6$x improvement ($6.55 \rightarrow 43.3$ average returns) for Humanoid-v4 tasks compared to the best frame-level matching algorithms. We also provide empirical analysis showing that ORCA is robust to varying levels of temporal misalignment. Our code is available at https://github.com/portal-cornell/orca/
中文: 本文提出ORCA序列对齐方法,通过确保智能体按顺序覆盖演示子目标来解决视觉模仿学习中的时序错位问题,相比帧级匹配方法实现了显著性能提升。
English: This paper introduces ORCA, a sequence-level alignment method that addresses temporal misalignment in visual imitation learning by ensuring ordered coverage of demonstration subgoals, achieving significant performance improvements over frame-level matching approaches.
Authors:Chongyu Fan, Jinghan Jia, Yihua Zhang, Anil Ramakrishna, Mingyi Hong, Sijia Liu
Abstract:
The LLM unlearning technique has recently been introduced to comply with data regulations and address the safety and ethical concerns of LLMs by removing the undesired data-model influence. However, state-of-the-art unlearning methods face a critical vulnerability: they are susceptible to ``relearning'' the removed information from a small number of forget data points, known as relearning attacks. In this paper, we systematically investigate how to make unlearned models robust against such attacks. For the first time, we establish a connection between robust unlearning and sharpness-aware minimization (SAM) through a unified robust optimization framework, in an analogy to adversarial training designed to defend against adversarial attacks. Our analysis for SAM reveals that smoothness optimization plays a pivotal role in mitigating relearning attacks. Thus, we further explore diverse smoothing strategies to enhance unlearning robustness. Extensive experiments on benchmark datasets, including WMDP and MUSE, demonstrate that SAM and other smoothness optimization approaches consistently improve the resistance of LLM unlearning to relearning attacks. Notably, smoothness-enhanced unlearning also helps defend against (input-level) jailbreaking attacks, broadening our proposal's impact in robustifying LLM unlearning. Codes are available at https://github.com/OPTML-Group/Unlearn-Smooth.
中文: 本研究提出了一种鲁棒遗忘框架,通过锐度感知最小化和平滑性优化来防御大语言模型中的再学习和越狱攻击,并在基准数据集上进行了实验验证。
English: The study introduces a robust unlearning framework for LLMs that leverages sharpness-aware minimization and smoothness optimization to defend against relearning and jailbreaking attacks, with experimental validation on benchmark datasets.
Authors:Weihua Du, Yiming Yang, Sean Welleck
Abstract:
Multi-sample aggregation strategies, such as majority voting and best-of-N sampling, are widely used in contemporary large language models (LLMs) to enhance predictive accuracy across various tasks. A key challenge in this process is temperature selection, which significantly impacts model performance. Existing approaches either rely on a fixed default temperature or require labeled validation data for tuning, which are often scarce and difficult to obtain. This paper addresses the challenge of automatically identifying the (near)-optimal temperature for different LLMs using multi-sample aggregation strategies, without relying on task-specific validation data. We provide a comprehensive analysis of temperature's role in performance optimization, considering variations in model architectures, datasets, task types, model sizes, and predictive accuracy. Furthermore, we propose a novel entropy-based metric for automated temperature optimization, which consistently outperforms fixed-temperature baselines. Additionally, we incorporate a stochastic process model to enhance interpretability, offering deeper insights into the relationship between temperature and model performance.
中文摘要:本文提出一种无需标注验证数据的自动化方法,通过新颖的基于熵的指标和随机过程模型优化大语言模型中的温度选择,在多样本聚合策略下显著提升模型性能与可解释性。
English Summary: This paper introduces an automated method for optimizing temperature selection in large language models using multi-sample aggregation, eliminating the need for labeled validation data through a novel entropy-based metric and stochastic process model for improved performance and interpretability.
Authors:Pierre Fernandez
Abstract:
Watermarking embeds information into digital content like images, audio, or text, imperceptible to humans but robustly detectable by specific algorithms. This technology has important applications in many challenges of the industry such as content moderation, tracing AI-generated content, and monitoring the usage of AI models. The contributions of this thesis include the development of new watermarking techniques for images, audio, and text. We first introduce methods for active moderation of images on social platforms. We then develop specific techniques for AI-generated content. We specifically demonstrate methods to adapt latent generative models to embed watermarks in all generated content, identify watermarked sections in speech, and improve watermarking in large language models with tests that ensure low false positive rates. Furthermore, we explore the use of digital watermarking to detect model misuse, including the detection of watermarks in language models fine-tuned on watermarked text, and introduce training-free watermarks for the weights of large transformers. Through these contributions, the thesis provides effective solutions for the challenges posed by the increasing use of generative AI models and the need for model monitoring and content moderation. It finally examines the challenges and limitations of watermarking techniques and discuss potential future directions for research in this area.
Authors:Jonas Geiping, Sean McLeish, Neel Jain, John Kirchenbauer, Siddharth Singh, Brian R. Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, Tom Goldstein
Abstract:
We study a novel language model architecture that is capable of scaling test-time computation by implicitly reasoning in latent space. Our model works by iterating a recurrent block, thereby unrolling to arbitrary depth at test-time. This stands in contrast to mainstream reasoning models that scale up compute by producing more tokens. Unlike approaches based on chain-of-thought, our approach does not require any specialized training data, can work with small context windows, and can capture types of reasoning that are not easily represented in words. We scale a proof-of-concept model to 3.5 billion parameters and 800 billion tokens. We show that the resulting model can improve its performance on reasoning benchmarks, sometimes dramatically, up to a computation load equivalent to 50 billion parameters.
中文摘要:本研究提出了一种新颖的语言模型架构,通过隐式潜在空间推理扩展测试时计算,无需专门训练数据,在增加计算负载的情况下显著提升了推理基准测试性能。
English Summary: This study introduces a novel language model architecture that scales test-time computation through latent reasoning, requiring no specialized training data and outperforming traditional reasoning models on benchmarks with increased computational load.
Authors:Yihe Deng, Yu Yang, Junkai Zhang, Wei Wang, Bo Li
Abstract:
The rapid advancement of large language models (LLMs) has increased the need for guardrail models to ensure responsible use, particularly in detecting unsafe and illegal content. While substantial safety data exist in English, multilingual guardrail modeling remains underexplored due to the scarcity of open-source safety data in other languages. To address this gap, we propose a novel two-player Reinforcement Learning (RL) framework, where a generator and a guardrail model co-evolve adversarially to produce high-quality synthetic data for multilingual guardrail training. We theoretically formalize this interaction as a two-player game, proving convergence to a Nash equilibrium. Empirical evaluations show that our model \ours outperforms state-of-the-art models, achieving nearly 10% improvement over LlamaGuard3 (8B) on English benchmarks while being 4.5x faster at inference with a significantly smaller model (0.5B). We achieve substantial advancements in multilingual safety tasks, particularly in addressing the imbalance for lower-resource languages in a collected real dataset. Ablation studies emphasize the critical role of synthetic data generation in bridging the imbalance in open-source data between English and other languages. These findings establish a scalable and efficient approach to synthetic data generation, paving the way for improved multilingual guardrail models to enhance LLM safety. Code, model, and data will be open-sourced at https://github.com/yihedeng9/DuoGuard.
中文: 本研究提出了一种新颖的双玩家强化学习框架,通过生成高质量合成数据来增强多语言护栏模型,在性能和效率上均优于现有方法。
English: This study introduces a novel two-player reinforcement learning framework that generates high-quality synthetic data to enhance multilingual guardrail models, achieving superior performance and efficiency over existing methods.
Authors:Shiqin Tang, Shujian Yu, Yining Dong, S. Joe Qin
Abstract:
This paper presents Deep Dynamic Probabilistic Canonical Correlation Analysis (D2PCCA), a model that integrates deep learning with probabilistic modeling to analyze nonlinear dynamical systems. Building on the probabilistic extensions of Canonical Correlation Analysis (CCA), D2PCCA captures nonlinear latent dynamics and supports enhancements such as KL annealing for improved convergence and normalizing flows for a more flexible posterior approximation. D2PCCA naturally extends to multiple observed variables, making it a versatile tool for encoding prior knowledge about sequential datasets and providing a probabilistic understanding of the system's dynamics. Experimental validation on real financial datasets demonstrates the effectiveness of D2PCCA and its extensions in capturing latent dynamics.
中文: 本文提出D2PCCA模型,将深度学习与概率建模相结合,用于分析非线性动态系统,捕捉潜在动态并支持KL退火和标准化流等增强功能以提升性能。
English: This paper introduces D2PCCA, a model that combines deep learning with probabilistic modeling to analyze nonlinear dynamical systems, capturing latent dynamics and supporting enhancements like KL annealing and normalizing flows for better performance.
Authors:Andros Tjandra, Yi-Chiao Wu, Baishan Guo, John Hoffman, Brian Ellis, Apoorv Vyas, Bowen Shi, Sanyuan Chen, Matt Le, Nick Zacharov, Carleigh Wood, Ann Lee, Wei-Ning Hsu
Abstract:
The quantification of audio aesthetics remains a complex challenge in audio processing, primarily due to its subjective nature, which is influenced by human perception and cultural context. Traditional methods often depend on human listeners for evaluation, leading to inconsistencies and high resource demands. This paper addresses the growing need for automated systems capable of predicting audio aesthetics without human intervention. Such systems are crucial for applications like data filtering, pseudo-labeling large datasets, and evaluating generative audio models, especially as these models become more sophisticated. In this work, we introduce a novel approach to audio aesthetic evaluation by proposing new annotation guidelines that decompose human listening perspectives into four distinct axes. We develop and train no-reference, per-item prediction models that offer a more nuanced assessment of audio quality. Our models are evaluated against human mean opinion scores (MOS) and existing methods, demonstrating comparable or superior performance. This research not only advances the field of audio aesthetics but also provides open-source models and datasets to facilitate future work and benchmarking. We release our code and pre-trained model at: https://github.com/facebookresearch/audiobox-aesthetics
中文: 本文提出了一种新颖的音频美学自动评估方法,通过将人类听觉视角分解为四个维度并训练无参考预测模型,在性能上达到或超越了人工评分和现有方法,同时提供了开源资源以供后续研究。
English: This paper introduces a novel automated approach for evaluating audio aesthetics by decomposing human listening perspectives into four axes and training no-reference prediction models, which demonstrate performance comparable or superior to human ratings and existing methods while providing open-source resources for future research.
Authors:Xiuyuan Hu, Guoqing Liu, Can Chen, Yang Zhao, Hao Zhang, Xue Liu
Abstract:
Structure-based drug discovery, encompassing the tasks of protein-ligand docking and pocket-aware 3D drug design, represents a core challenge in drug discovery. However, no existing work can deal with both tasks to effectively leverage the duality between them, and current methods for each task are hindered by challenges in modeling 3D information and the limitations of available data. To address these issues, we propose 3DMolFormer, a unified dual-channel transformer-based framework applicable to both docking and 3D drug design tasks, which exploits their duality by utilizing docking functionalities within the drug design process. Specifically, we represent 3D pocket-ligand complexes using parallel sequences of discrete tokens and continuous numbers, and we design a corresponding dual-channel transformer model to handle this format, thereby overcoming the challenges of 3D information modeling. Additionally, we alleviate data limitations through large-scale pre-training on a mixed dataset, followed by supervised and reinforcement learning fine-tuning techniques respectively tailored for the two tasks. Experimental results demonstrate that 3DMolFormer outperforms previous approaches in both protein-ligand docking and pocket-aware 3D drug design, highlighting its promising application in structure-based drug discovery. The code is available at: https://github.com/HXYfighter/3DMolFormer .
中文: 提出的3DMolFormer是一个统一的双通道Transformer框架,通过创新的令牌表示和大规模预训练同时解决蛋白质-配体对接和口袋感知3D药物设计任务,在两项任务中均优于现有方法。
English: The proposed 3DMolFormer is a unified dual-channel transformer framework that addresses both protein-ligand docking and pocket-aware 3D drug design by leveraging their duality through innovative token representation and large-scale pre-training, outperforming existing methods in both tasks.
Authors:Andrei Panferov, Jiale Chen, Soroush Tabesh, Roberto L. Castro, Mahdi Nikdan, Dan Alistarh
Abstract:
One approach to reducing the massive costs of large language models (LLMs) is the use of quantized or sparse representations for training or deployment. While post-training compression methods are very popular, the question of obtaining even more accurate compressed models by directly training over such representations, i.e., Quantization-Aware Training (QAT), is still open: for example, a recent study (arXiv:2411.04330) put the "optimal" bit-width at which models can be trained using QAT, while staying accuracy-competitive with standard FP16/BF16 precision, at 8-bits weights and activations. We advance this state-of-the-art via a new method called QuEST, for which we demonstrate optimality at 4-bits and stable convergence as low as 1-bit weights and activations. QuEST achieves this by improving two key aspects of QAT methods: (1) accurate and fast quantization of the (continuous) distributions of weights and activations via Hadamard normalization and MSE-optimal fitting; (2) a new trust gradient estimator based on the idea of explicitly minimizing the error between the noisy gradient computed over quantized states and the "true" (but unknown) full-precision gradient. Experiments on Llama-type architectures show that QuEST induces stable scaling laws across the entire range of hardware-supported precisions, and can be extended to sparse representations. We provide GPU kernel support showing that models produced by QuEST can be executed efficiently. Our code is available at https://github.com/IST-DASLab/QuEST.
中文摘要:QuEST方法通过改进量化技术和采用新型梯度估计器,将量化感知训练的最优位宽推进至4比特,并能在低至1比特的权重和激活下实现稳定收敛,同时保持模型精度。
English Summary: The QuEST method advances quantization-aware training by enabling stable convergence at extremely low bit-widths down to 1-bit, while maintaining accuracy through improved quantization techniques and a novel gradient estimator.
Authors:Daniel Marczak, Simone Magistri, Sebastian Cygert, BartÅomiej Twardowski, Andrew D. Bagdanov, Joost van de Weijer
Abstract:
Model merging integrates the weights of multiple task-specific models into a single multi-task model. Despite recent interest in the problem, a significant performance gap between the combined and single-task models remains. In this paper, we investigate the key characteristics of task matrices -- weight update matrices applied to a pre-trained model -- that enable effective merging. We show that alignment between singular components of task-specific and merged matrices strongly correlates with performance improvement over the pre-trained model. Based on this, we propose an isotropic merging framework that flattens the singular value spectrum of task matrices, enhances alignment, and reduces the performance gap. Additionally, we incorporate both common and task-specific subspaces to further improve alignment and performance. Our proposed approach achieves state-of-the-art performance on vision and language tasks across various sets of tasks and model scales. This work advances the understanding of model merging dynamics, offering an effective methodology to merge models without requiring additional training. Code is available at https://github.com/danielm1405/iso-merging .
Chinese: 本文提出了一种各向同性的模型融合框架,通过平坦化奇异值谱并结合公共与任务特定子空间,无需额外训练即可实现最先进的性能。
English: This paper introduces an isotropic merging framework that enhances model merging by flattening singular value spectra and incorporating common and task-specific subspaces, achieving state-of-the-art performance without additional training.
Authors:Juan Miguel Lopez Alcaraz, Ebenezer Oloyede, David Taylor, Wilhelm Haverkamp, Nils Strodthoff
Abstract:
Electrocardiogram (ECG) analysis has emerged as a promising tool for identifying physiological changes associated with neuropsychiatric conditions. The relationship between cardiovascular health and neuropsychiatric disorders suggests that ECG abnormalities could serve as valuable biomarkers for more efficient detection, therapy monitoring, and risk stratification. However, the potential of the ECG to accurately distinguish neuropsychiatric conditions, particularly among diverse patient populations, remains underexplored. This study utilized ECG markers and basic demographic data to predict neuropsychiatric conditions using machine learning models, with targets defined through ICD-10 codes. Both internal and external validation were performed using the MIMIC-IV and ECG-View datasets respectively. Performance was assessed using AUROC scores. To enhance model interpretability, Shapley values were applied to provide insights into the contributions of individual ECG features to the predictions. Significant predictive performance was observed for conditions within the neurological and psychiatric groups. For the neurological group, Alzheimer's disease (G30) achieved an internal AUROC of 0.813 (0.812-0.814) and an external AUROC of 0.868 (0.867-0.868). In the psychiatric group, unspecified dementia (F03) showed an internal AUROC of 0.849 (0.848-0.849) and an external AUROC of 0.862 (0.861-0.863). Discriminative features align with known ECG markers but also provide hints on potentially new markers. ECG offers significant promise for diagnosing and monitoring neuropsychiatric conditions, with robust predictive performance across internal and external cohorts. Future work should focus on addressing potential confounders, such as therapy-related cardiotoxicity, and expanding the scope of ECG applications, including personalized care and early intervention strategies.
中文: 本研究证明通过机器学习分析心电图标记可有效预测神经精神疾病,在内部和外部验证中均表现稳健,同时揭示了潜在的新型诊断生物标志物。
English: This study demonstrates that electrocardiogram (ECG) markers analyzed through machine learning can effectively predict neuropsychiatric conditions, achieving robust performance in both internal and external validations while revealing potential new diagnostic biomarkers.
Authors:Etienne Gauthier, Francis Bach, Michael I. Jordan
Abstract:
As platforms increasingly rely on learning algorithms, collectives may form and seek ways to influence these platforms to align with their own interests. This can be achieved by coordinated submission of altered data. To evaluate the potential impact of such behavior, it is essential to understand the computations that collectives must perform to impact platforms in this way. In particular, collectives need to make a priori assessments of the effect of the collective before taking action, as they may face potential risks when modifying their data. Moreover they need to develop implementable coordination algorithms based on quantities that can be inferred from observed data. We develop a framework that provides a theoretical and algorithmic treatment of these issues and present experimental results in a product evaluation domain.
中文: 群体可通过协调提交篡改数据来影响平台,这需要事先评估影响并制定可实施的算法,我们的框架在理论及产品评估实验中对此进行了研究。
English: Collectives can influence platforms by coordinating altered data submissions, requiring a priori impact assessments and implementable algorithms, which our framework theoretically and experimentally addresses in product evaluation.
Authors:Yijun Wang, Yong Wang, Chendong xu, Shuai Yao, Qisong Wu
Abstract:
Human Activity Recognition (HAR) such as fall detection has become increasingly critical due to the aging population, necessitating effective monitoring systems to prevent serious injuries and fatalities associated with falls. This study focuses on fine-tuning the Vision Transformer (ViT) model specifically for HAR using radar-based Time-Doppler signatures. Unlike traditional image datasets, these signals present unique challenges due to their non-visual nature and the high degree of similarity among various activities. Directly fine-tuning the ViT with all parameters proves suboptimal for this application. To address this challenge, we propose a novel approach that employs Low-Rank Adaptation (LoRA) fine-tuning in the weight space to facilitate knowledge transfer from pre-trained ViT models. Additionally, to extract fine-grained features, we enhance feature representation through the integration of a serial-parallel adapter in the feature space. Our innovative joint fine-tuning method, tailored for radar-based Time-Doppler signatures, significantly improves HAR accuracy, surpassing existing state-of-the-art methodologies in this domain. Our code is released at https://github.com/wangyijunlyy/SelaFD.
中文: 本研究提出了一种新颖的联合微调方法,通过低秩适应和串并联适配器优化视觉变换器在雷达时频信号上的人类活动识别性能,实现了超越现有方法的更高精度。
English: This study introduces a novel joint fine-tuning approach using Low-Rank Adaptation and serial-parallel adapters to optimize Vision Transformers for human activity recognition with radar-based Time-Doppler signatures, achieving superior accuracy over existing methods.
Authors:Yuwei Yin, Giuseppe Carenini
Abstract:
Large language models (LLMs) have demonstrated impressive capabilities on complex evaluation benchmarks, many of which are formulated as question-answering (QA) tasks. Enhancing the performance of LLMs in QA contexts is becoming increasingly vital for advancing their development and applicability. This paper introduces ARR, an intuitive, effective, and general QA solving method that explicitly incorporates three key steps: analyzing the intent of the question, retrieving relevant information, and reasoning step by step. Notably, this paper is the first to introduce intent analysis in QA, which plays a vital role in ARR. Comprehensive evaluations across 10 diverse QA tasks demonstrate that ARR consistently outperforms the baseline methods. Ablation and case studies further validate the positive contributions of each ARR component. Furthermore, experiments involving variations in prompt design indicate that ARR maintains its effectiveness regardless of the specific prompt formulation. Additionally, extensive evaluations across various model sizes, LLM series, and generation settings solidify the effectiveness, robustness, and generalizability of ARR.
中文: 本文提出的ARR方法通过意图分析、信息检索和逐步推理三大步骤,显著提升了大语言模型在问答任务中的表现,并在多种测试中展现出优于基线方法的稳定性和泛化能力。
English: This paper introduces ARR, a novel question-answering method that enhances LLM performance through intent analysis, information retrieval, and step-by-step reasoning, demonstrating consistent superiority across diverse tasks and robust generalizability.
Authors:Amitayush Thakur, George Tsoukalas, Greg Durrett, Swarat Chaudhuri
Abstract:
Neural networks have shown substantial promise at automatic theorem-proving in interactive proof assistants (ITPs) like Lean and Coq. However, most neural theorem-proving models are restricted to specific ITPs, leaving out opportunities for cross-lingual $\textit{transfer}$ between ITPs. We address this weakness with a multilingual proof framework, ${\rm P{\small ROOF}W{\small ALA}}$, that allows a standardized form of interaction between neural theorem-provers and two established ITPs (Coq and Lean). It enables the collection of multilingual proof step data -- data recording the result of proof actions on ITP states -- for training neural provers. ${\rm P{\small ROOF}W{\small ALA}}$ allows the systematic evaluation of a model's performance across different ITPs and problem domains via efficient parallel proof search algorithms. We show that multilingual training enabled by ${\rm P{\small ROOF}W{\small ALA}}$ can lead to successful transfer across ITPs. Specifically, a model trained on a mix of ${\rm P{\small ROOF}W{\small ALA}}$-generated Coq and Lean data outperforms Lean-only and Coq-only models on the standard prove-at-$k$ metric. We open source all code including code for the ${\rm P{\small ROOF}W{\small ALA}}$ Framework (https://github.com/trishullab/proof-wala), and the Multilingual ITP interaction framework (https://github.com/trishullab/itp-interface).
中文:多语言证明框架ProofWala实现了Coq和Lean等交互式定理证明器间的跨语言迁移,其多语言训练模型在标准评估指标上优于单一语言模型。
English: The multilingual proof framework ProofWala enables cross-lingual transfer between interactive theorem provers like Coq and Lean, with multilingual training demonstrating superior performance over single-language models.
Authors:Brian Formento, Chuan Sheng Foo, See-Kiong Ng
Abstract:
A fundamental issue in deep learning has been adversarial robustness. As these systems have scaled, such issues have persisted. Currently, large language models (LLMs) with billions of parameters suffer from adversarial attacks just like their earlier, smaller counterparts. However, the threat models have changed. Previously, having gray-box access, where input embeddings or output logits/probabilities were visible to the user, might have been reasonable. However, with the introduction of closed-source models, no information about the model is available apart from the generated output. This means that current black-box attacks can only utilize the final prediction to detect if an attack is successful. In this work, we investigate and demonstrate the potential of attack guidance, akin to using output probabilities, while having only black-box access in a classification setting. This is achieved through the ability to elicit confidence from the model. We empirically show that the elicited confidence is calibrated and not hallucinated for current LLMs. By minimizing the elicited confidence, we can therefore increase the likelihood of misclassification. Our new proposed paradigm demonstrates promising state-of-the-art results on three datasets across two models (LLaMA-3-8B-Instruct and Mistral-7B-Instruct-V0.3) when comparing our technique to existing hard-label black-box attack methods that introduce word-level substitutions.
中文: 本研究提出了一种新型黑盒攻击方法,通过从大语言模型中获取校准后的置信度分数来指导对抗性攻击,在无需模型内部信息的情况下仅通过词级替换就实现了最先进的误分类效果。
English: This study introduces a novel black-box attack method that guides adversarial attacks by eliciting calibrated confidence scores from large language models, achieving state-of-the-art misclassification rates through word-level substitutions without accessing internal model information.
Authors:Congjie He, Yeqi Huang, Pei Mu, Ziming Miao, Jilong Xue, Lingxiao Ma, Fan Yang, Luo Mai
Abstract:
Emerging AI accelerators increasingly adopt wafer-scale manufacturing technologies, integrating hundreds of thousands of AI cores in a mesh architecture with large distributed on-chip memory (tens of GB in total) and ultra-high on-chip memory bandwidth (tens of PB/s). However, current LLM inference systems, optimized for shared memory architectures like GPUs, fail to exploit these accelerators fully.
We introduce WaferLLM, the first wafer-scale LLM inference system. WaferLLM is guided by a novel PLMR model (pronounced as "Plummer") that captures the unique hardware characteristics of wafer-scale architectures. Leveraging this model, WaferLLM pioneers wafer-scale LLM parallelism, optimizing the utilization of hundreds of thousands of on-chip cores. It also introduces MeshGEMM and MeshGEMV, the first GEMM and GEMV implementations designed to scale effectively on wafer-scale accelerators.
Evaluations show that WaferLLM achieves up to 200$\times$ higher accelerator utilization than state-of-the-art methods. Leveraging a wafer-scale accelerator (Cerebras WSE2), WaferLLM delivers GEMV operations 606$\times$ faster and 16$\times$ more energy-efficient than on an NVIDIA A100 GPU. For full LLM inference, WaferLLM achieves 10-20$\times$ speedups over A100 GPU clusters running SGLang and vLLM. These advantages are expected to grow as wafer-scale AI models, software, and hardware continue to mature. WaferLLM is open-sourced at https://github.com/MeshInfra/WaferLLM.
中文: WaferLLM是首个晶圆级大语言模型推理系统,通过创新的PLMR模型和晶圆级并行技术优化AI加速器性能,相比现有方法实现了高达200倍的利用率提升及显著的运行速度和能效优势。
English: WaferLLM is the first wafer-scale LLM inference system that introduces a novel PLMR model and wafer-scale parallelism to optimize performance on AI accelerators, achieving up to 200x higher utilization and significant speed and energy efficiency improvements over existing methods.
Authors:Shurui Gui, Xiner Li, Shuiwang Ji
Abstract:
We consider learning underlying laws of dynamical systems governed by ordinary differential equations (ODE). A key challenge is how to discover intrinsic dynamics across multiple environments while circumventing environment-specific mechanisms. Unlike prior work, we tackle more complex environments where changes extend beyond function coefficients to entirely different function forms. For example, we demonstrate the discovery of ideal pendulum's natural motion $α^2 \sin{θ_t}$ by observing pendulum dynamics in different environments, such as the damped environment $α^2 \sin(θ_t) - ÏÏ_t$ and powered environment $α^2 \sin(θ_t) + Ï\frac{Ï_t}{\left|Ï_t\right|}$. Here, we formulate this problem as an \emph{invariant function learning} task and propose a new method, known as \textbf{D}isentanglement of \textbf{I}nvariant \textbf{F}unctions (DIF), that is grounded in causal analysis. We propose a causal graph and design an encoder-decoder hypernetwork that explicitly disentangles invariant functions from environment-specific dynamics. The discovery of invariant functions is guaranteed by our information-based principle that enforces the independence between extracted invariant functions and environments. Quantitative comparisons with meta-learning and invariant learning baselines on three ODE systems demonstrate the effectiveness and efficiency of our method. Furthermore, symbolic regression explanation results highlight the ability of our framework to uncover intrinsic laws. Our code has been released as part of the AIRS library (\href{https://github.com/divelab/AIRS/tree/main/OpenODE/DIF}{https://github.com/divelab/AIRS/}).
Chinese: 本研究提出了一种名为解耦不变函数(DIF)的新方法,通过因果分析和编码器-解码器超网络,在多环境中学习常微分方程系统的内在动力学,基于信息论原则保证不变函数的发现,并在揭示基本规律方面显示出高效性。
English: This study introduces the Disentanglement of Invariant Functions (DIF) method, which uses causal analysis and an encoder-decoder hypernetwork to learn the intrinsic dynamics of ODE systems across varied environments, ensuring discovery through an information-based principle and demonstrating effectiveness in uncovering fundamental laws.
Authors:Luca Della Libera, Francesco Paissan, Cem Subakan, Mirco Ravanelli
Abstract:
Large language models have revolutionized natural language processing through self-supervised pretraining on massive datasets. Inspired by this success, researchers have explored adapting these methods to speech by discretizing continuous audio into tokens using neural audio codecs. However, existing approaches face limitations, including high bitrates, the loss of either semantic or acoustic information, and the reliance on multi-codebook designs when trying to capture both, which increases architectural complexity for downstream tasks. To address these challenges, we introduce FocalCodec, an efficient low-bitrate codec based on focal modulation that utilizes a single binary codebook to compress speech between 0.16 and 0.65 kbps. FocalCodec delivers competitive performance in speech resynthesis and voice conversion at lower bitrates than the current state-of-the-art, while effectively handling multilingual speech and noisy environments. Evaluation on downstream tasks shows that FocalCodec successfully preserves sufficient semantic and acoustic information, while also being well-suited for generative modeling. Demo samples, code and checkpoints are available at https://lucadellalib.github.io/focalcodec-web/.
中文: FocalCodec是一种基于焦点调制的高效低码率语音编解码器,仅使用单一二进制码本在0.16-0.65 kbps码率下实现优越的语音处理性能,同时有效保留语义和声学信息。
English: FocalCodec is an efficient low-bitrate speech codec using focal modulation and a single binary codebook, achieving competitive performance in speech tasks at 0.16-0.65 kbps while preserving both semantic and acoustic information.
Authors:Xing Li, Zeyu Xing, Yiming Li, Linping Qu, Hui-Ling Zhen, Wulong Liu, Yiwu Yao, Sinno Jialin Pan, Mingxuan Yuan
Abstract:
KV cache quantization can improve Large Language Models (LLMs) inference throughput and latency in long contexts and large batch-size scenarios while preserving LLMs effectiveness. However, current methods have three unsolved issues: overlooking layer-wise sensitivity to KV cache quantization, high overhead of online fine-grained decision-making, and low flexibility to different LLMs and constraints. Therefore, we theoretically analyze the inherent correlation of layer-wise transformer attention patterns to KV cache quantization errors and study why key cache is generally more important than value cache for quantization error reduction. We further propose a simple yet effective framework KVTuner to adaptively search for the optimal hardware-friendly layer-wise KV quantization precision pairs for coarse-grained KV cache with multi-objective optimization and directly utilize the offline searched configurations during online inference. To reduce the computational cost of offline calibration, we utilize the intra-layer KV precision pair pruning and inter-layer clustering to reduce the search space. Experimental results show that we can achieve nearly lossless 3.25-bit mixed precision KV cache quantization for LLMs like Llama-3.1-8B-Instruct and 4.0-bit for sensitive models like Qwen2.5-7B-Instruct on mathematical reasoning tasks. The maximum inference throughput can be improved by 21.25\% compared with KIVI-KV8 quantization over various context lengths. Our code and searched configurations are available at https://github.com/cmd2001/KVTuner.
中文摘要:KV缓存量化可提升大语言模型在长上下文和大批量场景下的推理效率,而KVTuner框架通过自适应优化分层量化精度,实现了近乎无损的性能和显著的吞吐量提升。
English Summary: KV cache quantization enhances LLM inference efficiency in long contexts and large batches, and the proposed KVTuner framework adaptively optimizes layer-wise quantization precision to achieve nearly lossless performance with significant throughput improvements.
Authors:Edgar Ramirez-Sanchez, Catherine Tang, Yaosheng Xu, Nrithya Renganathan, Vindula Jayawardana, Zhengbing He, Cathy Wu
Abstract:
The transportation sector significantly contributes to greenhouse gas emissions, necessitating accurate emission models to guide mitigation strategies. Despite its field validation and certification, the industry-standard Motor Vehicle Emission Simulator (MOVES) faces challenges related to complexity in usage, high computational demands, and its unsuitability for microscopic real-time applications. To address these limitations, we present NeuralMOVES, a comprehensive suite of high-performance, lightweight surrogate models for vehicle CO2 emissions. Developed based on reverse engineering and Neural Networks, NeuralMOVES achieves a remarkable 6.013% Mean Average Percentage Error relative to MOVES across extensive tests spanning over two million scenarios with diverse trajectories and the factors regarding environments and vehicles. NeuralMOVES is only 2.4 MB, largely condensing the original MOVES and the reverse engineered MOVES into a compact representation, while maintaining high accuracy. Therefore, NeuralMOVES significantly enhances accessibility while maintaining the accuracy of MOVES, simplifying CO2 evaluation for transportation analyses and enabling real-time, microscopic applications across diverse scenarios without reliance on complex software or extensive computational resources. Moreover, this paper provides, for the first time, a framework for reverse engineering industrial-grade software tailored specifically to transportation scenarios, going beyond MOVES. The surrogate models are available at https://github.com/edgar-rs/neuralMOVES.
中文: NeuralMOVES 是一种轻量级高精度车辆二氧化碳排放替代模型,解决了行业标准MOVES模拟器的复杂性和高计算需求问题,仅需2.4 MB存储空间且误差率为6.013%,可实现实时微观应用。
English: NeuralMOVES is a lightweight, high-accuracy surrogate model for vehicle CO₂ emissions that overcomes the complexity and computational demands of the industry-standard MOVES simulator, enabling real-time microscopic applications with only 2.4 MB size and 6.013% error rate.
Authors:Zehua Pei, Lancheng Zou, Hui-Ling Zhen, Xianzhi Yu, Wulong Liu, Sinno Jialin Pan, Mingxuan Yuan, Bei Yu
Abstract:
Scaling large language models (LLMs) improves performance but dramatically increases inference costs. The feed-forward network (FFN), consuming approximately 70\% of inference compute, represents a critical bottleneck, particularly in large batch size scenarios. While mixture-of-experts (MoE) architectures leverage activation sparsity for efficiency, converting existing dense models to MoEs traditionally requires resource-intensive continual pre-training. We present CMoE, a framework that rapidly transforms dense LLMs into MoEs without training. The key innovation lies in analyzing FFN neuron activations to partition them into shared (always active) and routed experts. Routed neurons are clustered using a balanced assignment algorithm, and a differentiable router is constructed analytically from activation statistics, enabling immediate deployment or optional lightweight fine-tuning. Experiments demonstrate that, with activation ratio of 75\%, it achieves remarkable results, delivering lossless precision in terms of perplexity while still maintaining a 5\% acceleration. Further experiments reveal that a CMoE configuration activating just 25\% of parameters reduces end-to-end latency by 1.5x while preserving usable perplexity without additional training. Moreover, a brief LoRA fine-tuning process (requiring only 1 hour and 2,000 samples) successfully recovers over 76\% of the dense model's downstream accuracy. By effectively balancing performance and efficiency, CMoE offers a viable path forward for deploying LLMs in real-world scenarios where computational resources are limited. We make our code publicly available at https://github.com/JarvisPei/CMoE.
中文: CMoE是一种无需训练即可将稠密大语言模型转换为混合专家模型的高效框架,通过分析神经元激活实现无损困惑度并提速5%,在仅激活25%参数时延迟降低1.5倍,且通过轻量微调可恢复76%下游任务准确率。
English: CMoE is a training-free framework that converts dense LLMs into efficient mixture-of-expert models by analyzing neuron activations, achieving lossless perplexity with 5% acceleration and enabling 1.5x latency reduction while preserving performance through optional lightweight fine-tuning.
Authors:Long Chen, Xiaotian Song, Andy Song, BaDong Chen, Jiancheng Lv, Yanan Sun
Abstract:
Spiking Large Language Models have been shown as a good alternative to LLMs in various scenarios. Existing methods for creating Spiking LLMs, i.e., direct training and ANN-SNN conversion, often suffer from performance degradation and relatively high computational costs. To address these issues, we propose a novel Fast ANN-SNN conversion strategy (FAS) that transforms LLMs into spiking LLMs in two stages. The first stage employs a full-parameter fine-tuning of pre-trained models, so it does not need any direct training from scratch. The second stage introduces a coarse-to-fine calibration method to reduce conversion errors and improve accuracy. Experiments on both language and vision-language tasks across four different scales of LLMs demonstrate that FAS can achieve state-of-the-art performance yet with significantly reduced inference latency and computational costs. Notably, FAS only takes eight timesteps to achieve an accuracy of 3\% higher than that of the OPT-7B model, while reducing energy consumption by 96.63\%. The source code is available at https://github.com/lc783/FAS
中文: 本文提出了一种新颖的快速人工神经网络-脉冲神经网络转换策略(FAS),通过两阶段微调和校准将大语言模型转换为脉冲神经网络,在显著降低延迟和能耗的同时实现了最先进的性能。
English: This paper introduces a novel Fast ANN-SNN conversion strategy (FAS) that transforms large language models into spiking neural networks through two-stage fine-tuning and calibration, achieving state-of-the-art performance with significantly reduced latency and energy consumption.
Authors:Siru Zhong, Weilin Ruan, Ming Jin, Huan Li, Qingsong Wen, Yuxuan Liang
Abstract:
Recent advancements in time series forecasting have explored augmenting models with text or vision modalities to improve accuracy. While text provides contextual understanding, it often lacks fine-grained temporal details. Conversely, vision captures intricate temporal patterns but lacks semantic context, limiting the complementary potential of these modalities. To address this, we propose \method, a novel multimodal framework that leverages pre-trained Vision-Language Models (VLMs) to bridge temporal, visual, and textual modalities for enhanced forecasting. Our framework comprises three key components: (1) a Retrieval-Augmented Learner, which extracts enriched temporal features through memory bank interactions; (2) a Vision-Augmented Learner, which encodes time series as informative images; and (3) a Text-Augmented Learner, which generates contextual textual descriptions. These components collaborate with frozen pre-trained VLMs to produce multimodal embeddings, which are then fused with temporal features for final prediction. Extensive experiments demonstrate that Time-VLM achieves superior performance, particularly in few-shot and zero-shot scenarios, thereby establishing a new direction for multimodal time series forecasting. Code is available at https://github.com/CityMind-Lab/ICML25-TimeVLM.
Chinese: Time-VLM提出了一种新颖的多模态框架,利用预训练的视觉语言模型整合时序、视觉和文本数据以提升时间序列预测性能,在少样本和零样本场景下表现尤为卓越。
English: Time-VLM introduces a novel multimodal framework that integrates temporal, visual, and textual data using pre-trained vision-language models to enhance time series forecasting, achieving superior performance especially in few-shot and zero-shot settings.
Authors:Yongchao Chen, Yilun Hao, Yueying Liu, Yang Zhang, Chuchu Fan
Abstract:
Existing methods fail to effectively steer Large Language Models (LLMs) between textual reasoning and code generation, leaving symbolic computing capabilities underutilized. We introduce CodeSteer, an effective method for guiding LLM code/text generation. We construct a comprehensive benchmark SymBench comprising 37 symbolic tasks with adjustable complexity and also synthesize datasets of 12k multi-turn guidance/generation trajectories and 5.5k guidance comparison pairs. We fine-tune the Llama-3-8B model with a newly designed multi-turn supervised fine-tuning (SFT) and direct preference optimization (DPO). The resulting model, CodeSteerLLM, augmented with the proposed symbolic and self-answer checkers, effectively guides the code/text generation of larger models. Augmenting GPT-4o with CodeSteer raises its average performance score from 53.3 to 86.4, even outperforming the existing best LLM OpenAI o1 (82.7), o1-preview (74.8), and DeepSeek R1 (76.8) across all 37 tasks (28 seen, 9 unseen). Trained for GPT-4o, CodeSteer demonstrates superior generalizability, providing an average 41.8 performance boost on Claude, Mistral, and GPT-3.5. CodeSteer-guided LLMs fully harness symbolic computing to maintain strong performance on highly complex tasks. Models, Datasets, and Codes are available at https://github.com/yongchao98/CodeSteer-v1.0 and https://huggingface.co/yongchao98.
中文: CodeSteer是一种创新方法,能有效引导大型语言模型在文本推理与代码生成间切换,显著提升其在各类任务中的符号计算性能。
English: CodeSteer is an innovative method that enhances LLMs' ability to switch between textual reasoning and code generation, significantly boosting their symbolic computing performance across diverse tasks.
Authors:Yiming Huang, Tolga Birdal
Abstract:
Graph generation is a critical yet challenging task as empirical analyses require a deep understanding of complex, non-Euclidean structures. Although diffusion models have recently made significant achievements in graph generation, these models typically adapt from the frameworks designed for image generation, making them ill-suited for capturing the topological properties of graphs. In this work, we propose a novel Higher-order Guided Diffusion (HOG-Diff) model that follows a coarse-to-fine generation curriculum and is guided by higher-order information, enabling the progressive generation of plausible graphs with inherent topological structures. We further prove that our model exhibits a stronger theoretical guarantee than classical diffusion frameworks. Extensive experiments on both molecular and generic graph generation tasks demonstrate that our method consistently outperforms or remains competitive with state-of-the-art baselines. Our code is available at https://github.com/Yiminghh/HOG-Diff.
中文: 该摘要提出HOG-Diff,一种高阶引导扩散框架,通过扩散桥实现从粗到细的生成过程,在分子和通用图生成任务中展现出优于或媲美现有先进方法的性能。
English: The abstract introduces HOG-Diff, a higher-order guided diffusion framework that generates graphs with inherent topological structures, demonstrating superior theoretical guarantees and competitive performance in molecular and generic graph generation tasks.
Authors:Zhao-Heng Yin, Changhao Wang, Luis Pineda, Francois Hogan, Krishna Bodduluri, Akash Sharma, Patrick Lancaster, Ishita Prasad, Mrinal Kalakrishnan, Jitendra Malik, Mike Lambeta, Tingfan Wu, Pieter Abbeel, Mustafa Mukadam
Abstract:
Teaching robots dexterous manipulation skills, such as tool use, presents a significant challenge. Current approaches can be broadly categorized into two strategies: human teleoperation (for imitation learning) and sim-to-real reinforcement learning. The first approach is difficult as it is hard for humans to produce safe and dexterous motions on a different embodiment without touch feedback. The second RL-based approach struggles with the domain gap and involves highly task-specific reward engineering on complex tasks. Our key insight is that RL is effective at learning low-level motion primitives, while humans excel at providing coarse motion commands for complex, long-horizon tasks. Therefore, the optimal solution might be a combination of both approaches. In this paper, we introduce DexterityGen (DexGen), which uses RL to pretrain large-scale dexterous motion primitives, such as in-hand rotation or translation. We then leverage this learned dataset to train a dexterous foundational controller. In the real world, we use human teleoperation as a prompt to the controller to produce highly dexterous behavior. We evaluate the effectiveness of DexGen in both simulation and real world, demonstrating that it is a general-purpose controller that can realize input dexterous manipulation commands and significantly improves stability by 10-100x measured as duration of holding objects across diverse tasks. Notably, with DexGen we demonstrate unprecedented dexterous skills including diverse object reorientation and dexterous tool use such as pen, syringe, and screwdriver for the first time.
中文摘要:DexGen将强化学习的底层运动技能与人类遥操作的高层指导相结合,开发出通用控制器,在复杂操作任务中显著提升了机器人的灵巧性和稳定性。
English Summary: DexGen combines reinforcement learning for low-level motion primitives with human teleoperation for high-level guidance, creating a general-purpose controller that significantly enhances robotic dexterity and stability in complex manipulation tasks.
Authors:Lirui Wang, Kevin Zhao, Chaoqi Liu, Xinlei Chen
Abstract:
We propose Heterogeneous Masked Autoregression (HMA) for modeling action-video dynamics to generate high-quality data and evaluation in scaling robot learning. Building interactive video world models and policies for robotics is difficult due to the challenge of handling diverse settings while maintaining computational efficiency to run in real time. HMA uses heterogeneous pre-training from observations and action sequences across different robotic embodiments, domains, and tasks. HMA uses masked autoregression to generate quantized or soft tokens for video predictions. \ourshort achieves better visual fidelity and controllability than the previous robotic video generation models with 15 times faster speed in the real world. After post-training, this model can be used as a video simulator from low-level action inputs for evaluating policies and generating synthetic data. See this link https://liruiw.github.io/hma for more information.
Authors:Marco Mistretta, Alberto Baldrati, Lorenzo Agnolucci, Marco Bertini, Andrew D. Bagdanov
Abstract:
Pre-trained multi-modal Vision-Language Models like CLIP are widely used off-the-shelf for a variety of applications. In this paper, we show that the common practice of individually exploiting the text or image encoders of these powerful multi-modal models is highly suboptimal for intra-modal tasks like image-to-image retrieval. We argue that this is inherently due to the CLIP-style inter-modal contrastive loss that does not enforce any intra-modal constraints, leading to what we call intra-modal misalignment. To demonstrate this, we leverage two optimization-based modality inversion techniques that map representations from their input modality to the complementary one without any need for auxiliary data or additional trained adapters. We empirically show that, in the intra-modal tasks of image-to-image and text-to-text retrieval, approaching these tasks inter-modally significantly improves performance with respect to intra-modal baselines on more than fifteen datasets. Additionally, we demonstrate that approaching a native inter-modal task (e.g. zero-shot image classification) intra-modally decreases performance, further validating our findings. Finally, we show that incorporating an intra-modal term in the pre-training objective or narrowing the modality gap between the text and image feature embedding spaces helps reduce the intra-modal misalignment. The code is publicly available at: https://github.com/miccunifi/Cross-the-Gap.
中文摘要:本研究发现,由于模态内不对齐问题,在图像检索等模态内任务中单独使用CLIP的文本或图像编码器效果欠佳,但通过跨模态映射方法能显著提升性能。
English Summary: This study reveals that using CLIP's individual text or image encoders for intra-modal tasks is suboptimal due to intra-modal misalignment, but performance significantly improves by adopting inter-modal approaches through modality inversion techniques.
Authors:Yixin Liu, Lie Lu, Jihui Jin, Lichao Sun, Andrea Fanelli
Abstract:
The rapid proliferation of generative audio synthesis and editing technologies has raised significant concerns about copyright infringement, data provenance, and the spread of misinformation through deepfake audio. Watermarking offers a proactive solution by embedding imperceptible, identifiable, and traceable marks into audio content. While recent neural network-based watermarking methods like WavMark and AudioSeal have improved robustness and quality, they struggle to achieve both robust detection and accurate attribution simultaneously. This paper introduces Cross-Attention Robust Audio Watermark (XAttnMark), which bridges this gap by leveraging partial parameter sharing between the generator and the detector, a cross-attention mechanism for efficient message retrieval, and a temporal conditioning module for improved message distribution. Additionally, we propose a psychoacoustic-aligned temporal-frequency masking loss that captures fine-grained auditory masking effects, enhancing watermark imperceptibility. Our approach achieves state-of-the-art performance in both detection and attribution, demonstrating superior robustness against a wide range of audio transformations, including challenging generative editing with strong editing strength. The project webpage is available at https://liuyixin-louis.github.io/xattnmark/.
中文: 本文提出XAttnMark新型音频水印方法,通过共享参数、交叉注意力机制和心理声学损失,在应对各类音频变换时实现了卓越的鲁棒性和不可感知性,突破了现有技术的局限。
English: This paper introduces XAttnMark, a novel audio watermarking method that overcomes limitations of existing approaches by integrating shared parameters, cross-attention mechanisms, and psychoacoustic loss to achieve superior robustness and imperceptibility against audio transformations.
Authors:Shaopeng Fu, Liang Ding, Jingfeng Zhang, Di Wang
Abstract:
Jailbreak attacks against large language models (LLMs) aim to induce harmful behaviors in LLMs through carefully crafted adversarial prompts. To mitigate attacks, one way is to perform adversarial training (AT)-based alignment, i.e., training LLMs on some of the most adversarial prompts to help them learn how to behave safely under attacks. During AT, the length of adversarial prompts plays a critical role in the robustness of aligned LLMs. While long-length adversarial prompts during AT might lead to strong LLM robustness, their synthesis however is very resource-consuming, which may limit the application of LLM AT. This paper focuses on adversarial suffix jailbreak attacks and unveils that to defend against a jailbreak attack with an adversarial suffix of length $Î(M)$, it is enough to align LLMs on prompts with adversarial suffixes of length $Î(\sqrt{M})$. Theoretically, we analyze the adversarial in-context learning of linear transformers on linear regression tasks and prove a robust generalization bound for trained transformers. The bound depends on the term $Î(\sqrt{M_{\text{test}}}/M_{\text{train}})$, where $M_{\text{train}}$ and $M_{\text{test}}$ are the numbers of adversarially perturbed in-context samples during training and testing. Empirically, we conduct AT on popular open-source LLMs and evaluate their robustness against jailbreak attacks of different adversarial suffix lengths. Results confirm a positive correlation between the attack success rate and the ratio of the square root of the adversarial suffix length during jailbreaking to the length during AT. Our findings show that it is practical to defend against ``long-length'' jailbreak attacks via efficient ``short-length'' AT. The code is available at https://github.com/fshp971/adv-icl.
中文: 本研究通过理论分析和实证验证表明,针对大型语言模型的长对抗后缀越狱攻击,可以通过使用显著缩短的对抗提示进行高效对抗训练来实现有效防御。
English: This study demonstrates that defending against long adversarial suffix jailbreak attacks in large language models can be effectively achieved through efficient adversarial training using significantly shorter adversarial prompts, as confirmed by both theoretical analysis and empirical results.
Authors:Qinhan Yu, Zhiyou Xiao, Binghui Li, Zhengren Wang, Chong Chen, Wentao Zhang
Abstract:
Recent advances in Retrieval-Augmented Generation (RAG) have significantly improved response accuracy and relevance by incorporating external knowledge into Large Language Models (LLMs). However, existing RAG methods primarily focus on generating text-only answers, even in Multimodal Retrieval-Augmented Generation (MRAG) scenarios, where multimodal elements are retrieved to assist in generating text answers. To address this, we introduce the Multimodal Retrieval-Augmented Multimodal Generation (MRAMG) task, in which we aim to generate multimodal answers that combine both text and images, fully leveraging the multimodal data within a corpus. Despite growing attention to this challenging task, a notable lack of a comprehensive benchmark persists for effectively evaluating its performance. To bridge this gap, we provide MRAMG-Bench, a meticulously curated, human-annotated benchmark comprising 4,346 documents, 14,190 images, and 4,800 QA pairs, distributed across six distinct datasets and spanning three domains: Web, Academia, and Lifestyle. The datasets incorporate diverse difficulty levels and complex multi-image scenarios, providing a robust foundation for evaluating the MRAMG task. To facilitate rigorous evaluation, MRAMG-Bench incorporates a comprehensive suite of both statistical and LLM-based metrics, enabling a thorough analysis of the performance of generative models in the MRAMG task. Additionally, we propose an efficient and flexible multimodal answer generation framework that can leverage LLMs/MLLMs to generate multimodal responses. Our datasets and complete evaluation results for 11 popular generative models are available at https://github.com/MRAMG-Bench/MRAMG.
中文: 本文提出了多模态检索增强多模态生成(MRAMG)任务,旨在生成图文结合的答案,并通过提供包含丰富数据和评估指标的MRAMG-Bench来弥补该领域基准的缺失。
English: This paper introduces the Multimodal Retrieval-Augmented Multimodal Generation (MRAMG) task for generating combined text and image answers, addressing the lack of benchmarks by providing MRAMG-Bench with extensive data and metrics for evaluation.
Authors:Jost Arndt, Utku Isil, Michael Detzel, Wojciech Samek, Jackie Ma
Abstract:
Many physical processes can be expressed through partial differential equations (PDEs). Real-world measurements of such processes are often collected at irregularly distributed points in space, which can be effectively represented as graphs; however, there are currently only a few existing datasets. Our work aims to make advancements in the field of PDE-modeling accessible to the temporal graph machine learning community, while addressing the data scarcity problem, by creating and utilizing datasets based on PDEs. In this work, we create and use synthetic datasets based on PDEs to support spatio-temporal graph modeling in machine learning for different applications. More precisely, we showcase three equations to model different types of disasters and hazards in the fields of epidemiology, atmospheric particles, and tsunami waves. Further, we show how such created datasets can be used by benchmarking several machine learning models on the epidemiological dataset. Additionally, we show how pre-training on this dataset can improve model performance on real-world epidemiological data. The presented methods enable others to create datasets and benchmarks customized to individual requirements. The source code for our methodology and the three created datasets can be found on https://github.com/github-usr-ano/Temporal_Graph_Data_PDEs.
中文: 本研究通过创建基于三个灾害建模偏微分方程的合成数据集,解决了时空图机器学习领域的数据稀缺问题,并通过基准测试和预训练应用展示了其实用价值。
English: This work addresses data scarcity in PDE-based temporal graph machine learning by creating synthetic datasets from three disaster-modeling equations and demonstrating their utility through benchmarking and pre-training applications.
Authors:Nikunj Gupta, James Zachary Hare, Rajgopal Kannan, Viktor Prasanna
Abstract:
This paper presents deep meta coordination graphs (DMCG) for learning cooperative policies in multi-agent reinforcement learning (MARL). Coordination graph formulations encode local interactions and accordingly factorize the joint value function of all agents to improve efficiency in MARL. However, existing approaches rely solely on pairwise relations between agents, which potentially oversimplifies complex multi-agent interactions. DMCG goes beyond these simple direct interactions by also capturing useful higher-order and indirect relationships among agents. It generates novel graph structures accommodating multiple types of interactions and arbitrary lengths of multi-hop connections in coordination graphs to model such interactions. It then employs a graph convolutional network module to learn powerful representations in an end-to-end manner. We demonstrate its effectiveness in multiple coordination problems in MARL where other state-of-the-art methods can suffer from sample inefficiency or fail entirely. All codes can be found here: https://github.com/Nikunj-Gupta/dmcg-marl.
Chinese: 本文提出深度元协调图(DMCG),通过构建新型图结构和图卷积网络来建模智能体间的高阶间接交互,在多智能体强化学习的复杂协作任务中显著优于现有方法。
English: This paper introduces Deep Meta Coordination Graphs (DMCG), which enhance multi-agent reinforcement learning by modeling higher-order and indirect agent interactions through novel graph structures and graph convolutional networks, outperforming existing methods in complex coordination tasks.
Authors:Keonvin Park, Jisu Kim, Jaemin Seo
Abstract:
This paper introduces PINT (Physics-Informed Neural Time Series Models), a framework that integrates physical constraints into neural time series models to improve their ability to capture complex dynamics. We apply PINT to the ERA5 WeatherBench dataset, focusing on long-term forecasting of 2m-temperature data. PINT incorporates the Simple Harmonic Oscillator Equation as a physics-informed prior, embedding its periodic dynamics into RNN, LSTM, and GRU architectures. This equation's analytical solutions (sine and cosine functions) facilitate rigorous evaluation of the benefits of incorporating physics-informed constraints. By benchmarking against a linear regression baseline derived from its exact solutions, we quantify the impact of embedding physical principles in data-driven models. Unlike traditional time series models that rely on future observations, PINT is designed for practical forecasting. Using only the first 90 days of observed data, it iteratively predicts the next two years, addressing challenges posed by limited real-time updates. Experiments on the WeatherBench dataset demonstrate PINT's ability to generalize, capture periodic trends, and align with physical principles. This study highlights the potential of physics-informed neural models in bridging machine learning and interpretable climate applications.
Our models and datasets are publicly available on GitHub: https://github.com/KV-Park.
中文: 本文提出PINT物理约束神经网络框架,通过将简谐振荡方程嵌入时序模型,仅用90天观测数据即可实现两年期温度预测,有效提升气象应用的物理一致性与泛化能力。
English: This paper presents PINT, a physics-informed neural framework that embeds harmonic oscillator dynamics into time series models to enhance long-term weather forecasting accuracy and physical consistency using limited initial data.
Authors:Yousef Koka, David Selby, Gerrit GroÃmann, Sebastian Vollmer
Abstract:
Data preprocessing is a critical yet frequently neglected aspect of machine learning, often paid little attention despite its potentially significant impact on model performance. While automated machine learning pipelines are starting to recognize and integrate data preprocessing into their solutions for classification and regression tasks, this integration is lacking for more specialized tasks like survival or time-to-event models. As a result, survival analysis not only faces the general challenges of data preprocessing but also suffers from the lack of tailored, automated solutions in this area.
To address this gap, this paper presents 'CleanSurvival', a reinforcement-learning-based solution for optimizing preprocessing pipelines, extended specifically for survival analysis. The framework can handle continuous and categorical variables, using Q-learning to select which combination of data imputation, outlier detection and feature extraction techniques achieves optimal performance for a Cox, random forest, neural network or user-supplied time-to-event model. The package is available on GitHub: https://github.com/datasciapps/CleanSurvival
Experimental benchmarks on real-world datasets show that the Q-learning-based data preprocessing results in superior predictive performance to standard approaches, finding such a model up to 10 times faster than undirected random grid search. Furthermore, a simulation study demonstrates the effectiveness in different types and levels of missingness and noise in the data.
中文: 本文提出"CleanSurvival"框架,基于强化学习为生存分析任务自动化数据预处理,相比传统方法能显著提升模型性能并加快优化速度。
English: This paper introduces "CleanSurvival," a reinforcement learning-based framework that automates data preprocessing for survival analysis, enhancing model performance and efficiency compared to standard methods.
Authors:Heyi Zhang, Yule Liu, Xinlei He, Jun Wu, Tianshuo Cong, Xinyi Huang
Abstract:
Federated learning (FL) enables collaborative model training while preserving data privacy, but its decentralized nature exposes it to client-side data poisoning attacks (DPAs) and model poisoning attacks (MPAs) that degrade global model performance. While numerous proposed defenses claim substantial effectiveness, their evaluation is typically done in isolation with limited attack strategies, raising concerns about their validity. Additionally, existing studies overlook the mutual effectiveness of defenses against both DPAs and MPAs, causing fragmentation in this field. This paper aims to provide a unified benchmark and analysis of defenses against DPAs and MPAs, clarifying the distinction between these two similar but slightly distinct domains. We present a systematic taxonomy of poisoning attacks and defense strategies, outlining their design, strengths, and limitations. Then, a unified comparative evaluation across FL algorithms and data heterogeneity is conducted to validate their individual and mutual effectiveness and derive key insights for design principles and future research. Along with the analysis, we frame our work to a unified benchmark, FLPoison, with high modularity and scalability to evaluate 15 representative poisoning attacks and 17 defense strategies, facilitating future research in this domain. Code is available at https://github.com/vio1etus/FLPoison.
Chinese Summary: 本文提出了FLPoison统一基准,系统评估了联邦学习中的15种投毒攻击和17种防御策略,以解决该领域研究碎片化问题,并验证它们对数据和模型投毒威胁的共同防御效果。
English Summary: This paper introduces FLPoison, a unified benchmark that systematically evaluates 15 poisoning attacks and 17 defense strategies in federated learning to address fragmented research and validate their effectiveness against both data and model poisoning threats.
Authors:Chhavi Yadav, Evan Monroe Laufer, Dan Boneh, Kamalika Chaudhuri
Abstract:
In principle, explanations are intended as a way to increase trust in machine learning models and are often obligated by regulations. However, many circumstances where these are demanded are adversarial in nature, meaning the involved parties have misaligned interests and are incentivized to manipulate explanations for their purpose. As a result, explainability methods fail to be operational in such settings despite the demand \cite{bordt2022post}. In this paper, we take a step towards operationalizing explanations in adversarial scenarios with Zero-Knowledge Proofs (ZKPs), a cryptographic primitive. Specifically we explore ZKP-amenable versions of the popular explainability algorithm LIME and evaluate their performance on Neural Networks and Random Forests. Our code is publicly available at https://github.com/emlaufer/ExpProof.
Chinese: 本文针对对抗性场景下可解释性方法的失效问题,提出利用零知识证明(ZKP)实现可解释性的操作化,特别开发了兼容ZKP的LIME算法版本,并在神经网络和随机森林上进行了性能评估。
English: This paper addresses the failure of explainability methods in adversarial settings by proposing Zero-Knowledge Proofs (ZKP) to operationalize explanations, specifically developing ZKP-amenable versions of LIME and evaluating them on Neural Networks and Random Forests.
Authors:Sharana Dharshikgan Suresh Dass, Hrishav Bakul Barua, Ganesh Krishnasamy, Raveendran Paramesran, Raphael C. -W. Phan
Abstract:
Action recognition in dark, low-light (under-exposed) or noisy videos is a challenging task due to visibility degradation, which can hinder critical spatiotemporal details. This paper proposes MD-BERT, a novel multi-stream approach that integrates complementary pre-processing techniques such as gamma correction and histogram equalization alongside raw dark frames to address these challenges. We introduce the Dynamic Feature Fusion (DFF) module, extending existing attentional fusion methods to a three-stream setting, thereby capturing fine-grained and global contextual information across different brightness and contrast enhancements. The fused spatiotemporal features are then processed by a BERT-based temporal model, which leverages its bidirectional self-attention to effectively capture long-range dependencies and contextual relationships across frames. Extensive experiments on the ARID V1.0 and ARID V1.5 dark video datasets show that MD-BERT outperforms existing methods, establishing a new state-of-the-art performance. Ablation studies further highlight the individual contributions of each input stream and the effectiveness of the proposed DFF and BERT modules. The official website of this work is available at: https://github.com/HrishavBakulBarua/DarkBERT
中文摘要:本文提出MD-BERT多流框架,通过动态特征融合模块整合增强视频输入与BERT时序模型,在暗光视频行为识别任务中实现了最优性能。
English Summary: This paper introduces MD-BERT, a multi-stream framework that combines enhanced video inputs with a BERT-based temporal model to achieve state-of-the-art action recognition in dark videos through dynamic feature fusion.
Authors:Gian Mario Favero, Parham Saremi, Emily Kaczmarek, Brennan Nichyporuk, Tal Arbel
Abstract:
Discriminative classifiers have become a foundational tool in deep learning for medical imaging, excelling at learning separable features of complex data distributions. However, these models often need careful design, augmentation, and training techniques to ensure safe and reliable deployment. Recently, diffusion models have become synonymous with generative modeling in 2D. These models showcase robustness across a range of tasks including natural image classification, where classification is performed by comparing reconstruction errors across images generated for each possible conditioning input. This work presents the first exploration of the potential of class conditional diffusion models for 2D medical image classification. First, we develop a novel majority voting scheme shown to improve the performance of medical diffusion classifiers. Next, extensive experiments on the CheXpert and ISIC Melanoma skin cancer datasets demonstrate that foundation and trained-from-scratch diffusion models achieve competitive performance against SOTA discriminative classifiers without the need for explicit supervision. In addition, we show that diffusion classifiers are intrinsically explainable, and can be used to quantify the uncertainty of their predictions, increasing their trustworthiness and reliability in safety-critical, clinical contexts. Further information is available on our project page: https://faverogian.github.io/med-diffusion-classifier.github.io/.
Authors:Kushagra Pandey, Farrin Marouf Sofian, Felix Draxler, Theofanis Karaletsos, Stephan Mandt
Abstract:
Diffusion models exhibit excellent sample quality, but existing guidance methods often require additional model training or are limited to specific tasks. We revisit guidance in diffusion models from the perspective of variational inference and control, introducing Diffusion Trajectory Matching (DTM) that enables guiding pretrained diffusion trajectories to satisfy a terminal cost. DTM unifies a broad class of guidance methods and enables novel instantiations. We introduce a new method within this framework that achieves state-of-the-art results on several linear, non-linear, and blind inverse problems without requiring additional model training or specificity to pixel or latent space diffusion models. Our code will be available at https://github.com/czi-ai/oc-guidance
中文摘要:扩散轨迹匹配(DTM)通过变分推理统一了预训练扩散模型的多种引导方法,无需额外训练或特定空间适配,即在多种逆问题上实现了最先进的性能。
English Summary: Diffusion Trajectory Matching (DTM) unifies various guidance methods for pretrained diffusion models through variational inference, achieving state-of-the-art performance on multiple inverse problems without requiring additional training or space-specific adaptations.
Authors:Zhuowei Li, Haizhou Shi, Yunhe Gao, Di Liu, Zhenting Wang, Yuxiao Chen, Ting Liu, Long Zhao, Hao Wang, Dimitris N. Metaxas
Abstract:
Large Vision-Language Models (LVLMs) can reason effectively over both textual and visual inputs, but they tend to hallucinate syntactically coherent yet visually ungrounded contents. In this paper, we investigate the internal dynamics of hallucination by examining the tokens logits ranking throughout the generation process, revealing three key patterns in how LVLMs process information: (1) gradual visual information loss - visually grounded tokens gradually become less favored throughout generation, and (2) early excitation - semantically meaningful tokens achieve peak activation in the layers earlier than the final layer. (3) hidden genuine information - visually grounded tokens though not being eventually decoded still retain relatively high rankings at inference. Based on these insights, we propose VISTA (Visual Information Steering with Token-logit Augmentation), a training-free inference-time intervention framework that reduces hallucination while promoting genuine information. VISTA works by combining two complementary approaches: reinforcing visual information in activation space and leveraging early layer activations to promote semantically meaningful decoding. Compared to existing methods, VISTA requires no external supervision and is applicable to various decoding strategies. Extensive experiments show that VISTA on average reduces hallucination by about 40% on evaluated open-ended generation task, and it consistently outperforms existing methods on four benchmarks across four architectures under three decoding strategies. Code is available at https://github.com/LzVv123456/VISTA.
中文: 大型视觉语言模型因视觉信息逐渐丢失和早期语义激发而产生幻觉内容,为此提出的VISTA框架通过增强视觉激活与利用早期层输出,无需训练即可有效减少幻觉现象。
English: Large Vision-Language Models often generate visually ungrounded content due to gradual loss of visual information and early token excitation, prompting the development of VISTA, a training-free framework that reduces hallucination by reinforcing visual activations and leveraging early layer outputs.
Authors:Mehrdad Asadi, Komi Sodoké, Ian J. Gerard, Marta Kersten-Oertel
Abstract:
In this work, we present a novel approach to multi-label chest X-ray (CXR) image classification that enhances clinical interpretability while maintaining a streamlined, single-model, single-run training pipeline. Leveraging the CheXpert dataset and VisualCheXbert-derived labels, we incorporate hierarchical label groupings to capture clinically meaningful relationships between diagnoses. To achieve this, we designed a custom hierarchical binary cross-entropy (HBCE) loss function that enforces label dependencies using either fixed or data-driven penalty types. Our model achieved a mean area under the receiver operating characteristic curve (AUROC) of 0.903 on the test set. Additionally, we provide visual explanations and uncertainty estimations to further enhance model interpretability. All code, model configurations, and experiment details are made available.
中文: 本研究提出了一种新颖的胸部X光多标签分类方法,通过自定义分层损失函数增强临床可解释性,在测试集上取得了0.903的平均AUROC值,同时提供可视化解释和不确定性评估。
English: This study introduces a hierarchical multi-label classification method for chest X-rays that improves clinical interpretability through a custom loss function while maintaining high diagnostic accuracy with a 0.903 AUROC score.
Authors:Liran Nochumsohn, Hedi Zisling, Omri Azencot
Abstract:
Accurate forecasting of multivariate time series data is important in many engineering and scientific applications. Recent state-of-the-art works ignore the inter-relations between variates, using their model on each variate independently. This raises several research questions related to proper modeling of multivariate data. In this work, we propose to view multivariate forecasting as a multi-task learning problem, facilitating the analysis of forecasting by considering the angle between task gradients and their balance. To do so, we analyze linear models to characterize the behavior of tasks. Our analysis suggests that tasks can be defined by grouping similar variates together, which we achieve via a simple clustering that depends on correlation-based similarities. Moreover, to balance tasks, we scale gradients with respect to their prediction error. Then, each task is solved with a linear model within our MTLinear framework. We evaluate our approach on challenging benchmarks in comparison to strong baselines, and we show it obtains on-par or better results on multivariate forecasting problems. The implementation is available at: https://github.com/azencot-group/MTLinear
中文: 本研究提出MTLinear多任务学习框架,通过将相关变量聚类为任务并基于预测误差平衡梯度,在多变量时间序列预测问题上取得了与现有方法相当或更优的性能。
English: This study introduces MTLinear, a multi-task learning framework for multivariate time series forecasting that groups correlated variates into tasks and balances their gradients based on prediction errors, achieving competitive or superior performance compared to existing methods.
Authors:Darina Koishigarina, Arnas Uselis, Seong Joon Oh
Abstract:
CLIP (Contrastive Language-Image Pretraining) has become a popular choice for various downstream tasks. However, recent studies have questioned its ability to represent compositional concepts effectively. These works suggest that CLIP often acts like a bag-of-words (BoW) model, interpreting images and text as sets of individual concepts without grasping the structural relationships. In particular, CLIP struggles to correctly bind attributes to their corresponding objects when multiple objects are present in an image or text. In this work, we investigate why CLIP exhibits this BoW-like behavior. We find that the correct attribute-object binding information is already present in individual text and image modalities. Instead, the issue lies in the cross-modal alignment, which relies on cosine similarity. To address this, we propose Linear Attribute Binding CLIP or LABCLIP. It applies a linear transformation to text embeddings before computing cosine similarity. This approach significantly improves CLIP's ability to bind attributes to correct objects, thereby enhancing its compositional understanding. The code is available at https://github.com/kdariina/CLIP-not-BoW-unimodally.
中文: CLIP常因跨模态对齐问题无法正确绑定属性与对象,而LABCLIP通过在计算相似度前对文本嵌入进行线性变换,显著提升了这种绑定能力。
English: CLIP often fails to bind attributes to objects due to cross-modal alignment issues, but LABCLIP enhances this by applying a linear transformation to text embeddings before similarity computation.
Authors:Haotian Lin, Pengcheng Wang, Jeff Schneider, Guanya Shi
Abstract:
Model-based reinforcement learning algorithms that combine model-based planning and learned value/policy prior have gained significant recognition for their high data efficiency and superior performance in continuous control. However, we discover that existing methods that rely on standard SAC-style policy iteration for value learning, directly using data generated by the planner, often result in \emph{persistent value overestimation}. Through theoretical analysis and experiments, we argue that this issue is deeply rooted in the structural policy mismatch between the data generation policy that is always bootstrapped by the planner and the learned policy prior. To mitigate such a mismatch in a minimalist way, we propose a policy regularization term reducing out-of-distribution (OOD) queries, thereby improving value learning. Our method involves minimum changes on top of existing frameworks and requires no additional computation. Extensive experiments demonstrate that the proposed approach improves performance over baselines such as TD-MPC2 by large margins, particularly in 61-DoF humanoid tasks. View qualitative results at https://darthutopian.github.io/tdmpc_square/.
Authors:SiYeoul Lee, SeonHo Kim, Minkyung Seo, SeongKyu Park, Salehin Imrus, Kambaluru Ashok, DongEon Lee, Chunsu Park, SeonYeong Lee, Jiye Kim, Jae-Heung Yoo, MinWoo Kim
Abstract:
This study introduces a motion-based learning network with a global-local self-attention module (MoGLo-Net) to enhance 3D reconstruction in handheld photoacoustic and ultrasound (PAUS) imaging. Standard PAUS imaging is often limited by a narrow field of view and the inability to effectively visualize complex 3D structures. The 3D freehand technique, which aligns sequential 2D images for 3D reconstruction, faces significant challenges in accurate motion estimation without relying on external positional sensors. MoGLo-Net addresses these limitations through an innovative adaptation of the self-attention mechanism, which effectively exploits the critical regions, such as fully-developed speckle area or high-echogenic tissue area within successive ultrasound images to accurately estimate motion parameters. This facilitates the extraction of intricate features from individual frames. Additionally, we designed a patch-wise correlation operation to generate a correlation volume that is highly correlated with the scanning motion. A custom loss function was also developed to ensure robust learning with minimized bias, leveraging the characteristics of the motion parameters. Experimental evaluations demonstrated that MoGLo-Net surpasses current state-of-the-art methods in both quantitative and qualitative performance metrics. Furthermore, we expanded the application of 3D reconstruction technology beyond simple B-mode ultrasound volumes to incorporate Doppler ultrasound and photoacoustic imaging, enabling 3D visualization of vasculature. The source code for this study is publicly available at: https://github.com/guhong3648/US3D
中文: 本研究提出MoGLo-Net运动学习网络,通过全局-局部自注意力模块改进手持式光声超声成像的3D重建,无需外部传感器即可精确估计运动参数,性能超越现有方法,并可扩展至多普勒和光声血管成像应用。
English: This study presents MoGLo-Net, a motion-based learning network with a global-local self-attention module that improves 3D reconstruction in handheld PAUS imaging by accurately estimating motion parameters without external sensors, outperforming current methods and extending to Doppler and photoacoustic vascular visualization.
Authors:Joshua Vendrow, Edward Vendrow, Sara Beery, Aleksander Madry
Abstract:
When deploying large language models (LLMs), it is important to ensure that these models are not only capable, but also reliable. Many benchmarks have been created to track LLMs' growing capabilities, however there has been no similar focus on measuring their reliability. To understand the potential ramifications of this gap, we investigate how well current benchmarks quantify model reliability. We find that pervasive label errors can compromise these evaluations, obscuring lingering model failures and hiding unreliable behavior.
Motivated by this gap in the evaluation of reliability, we then propose the concept of so-called platinum benchmarks, i.e., benchmarks carefully curated to minimize label errors and ambiguity. As a first attempt at constructing such benchmarks, we revise examples from fifteen existing popular benchmarks. We evaluate a wide range of models on these platinum benchmarks and find that, indeed, frontier LLMs still exhibit failures on simple tasks such as elementary-level math word problems. Analyzing these failures further reveals previously unidentified patterns of problems on which frontier models consistently struggle. We provide code at https://github.com/MadryLab/platinum-benchmarks
中文摘要:现有大语言模型基准常存在标签错误而掩盖可靠性问题,为此提出的铂金基准揭示了先进模型在简单任务上仍存在持续性缺陷。
English Summary: Current benchmarks for large language models often contain label errors that obscure reliability issues, prompting the creation of platinum benchmarks which reveal persistent failures in even advanced models on simple tasks.
Authors:Rui Pan, Boyao Wang, Shizhe Diao, Xingyuan Pan, Jipeng Zhang, Renjie Pi, Tong Zhang
Abstract:
Small language models (SLMs) have attracted considerable attention from both academia and industry due to their broad range of applications in edge devices. To obtain SLMs with strong performance, conventional approaches either pre-train the models from scratch, which incurs substantial computational costs, or compress/prune existing large language models (LLMs), which results in performance drops and falls short in comparison to pre-training. In this paper, we investigate the family of acceleration methods that involve both structured pruning and model training. We found 1) layer-wise adaptive pruning (Adapt-Pruner) is extremely effective in LLMs and yields significant improvements over existing pruning techniques, 2) adaptive pruning equipped with further training leads to models comparable to those pre-training from scratch, 3) incremental pruning brings non-trivial performance gain by interleaving pruning with training and only removing a small portion of neurons ($\sim$5%) at a time. Experimental results on LLaMA-3.1-8B demonstrate that Adapt-Pruner outperforms conventional pruning methods, such as LLM-Pruner, FLAP, and SliceGPT, by an average of 1%-7% in accuracy on commonsense benchmarks. Additionally, Adapt-Pruner restores the performance of MobileLLM-125M to 600M on the MMLU benchmark with 200$\times$ fewer tokens via pruning from its larger counterparts, and discovers a new 1B model that surpasses LLaMA-3.2-1B in multiple benchmarks. The official code is released at https://github.com/research4pan/AdaptPruner.
中文: 自适应剪枝结合增量训练使小型语言模型在显著降低计算成本的同时,性能可媲美预训练模型,并优于传统剪枝方法。
English: Adaptive pruning combined with incremental training enables small language models to achieve performance comparable to pre-trained models while significantly reducing computational costs and outperforming conventional pruning methods.
Authors:Edward Yeo, Yuxuan Tong, Morry Niu, Graham Neubig, Xiang Yue
Abstract:
Scaling inference compute enhances reasoning in large language models (LLMs), with long chains-of-thought (CoTs) enabling strategies like backtracking and error correction. Reinforcement learning (RL) has emerged as a crucial method for developing these capabilities, yet the conditions under which long CoTs emerge remain unclear, and RL training requires careful design choices. In this study, we systematically investigate the mechanics of long CoT reasoning, identifying the key factors that enable models to generate long CoT trajectories. Through extensive supervised fine-tuning (SFT) and RL experiments, we present four main findings: (1) While SFT is not strictly necessary, it simplifies training and improves efficiency; (2) Reasoning capabilities tend to emerge with increased training compute, but their development is not guaranteed, making reward shaping crucial for stabilizing CoT length growth; (3) Scaling verifiable reward signals is critical for RL. We find that leveraging noisy, web-extracted solutions with filtering mechanisms shows strong potential, particularly for out-of-distribution (OOD) tasks such as STEM reasoning; and (4) Core abilities like error correction are inherently present in base models, but incentivizing these skills effectively for complex tasks via RL demands significant compute, and measuring their emergence requires a nuanced approach. These insights provide practical guidance for optimizing training strategies to enhance long CoT reasoning in LLMs. Our code is available at: https://github.com/eddycmu/demystify-long-cot.
中文: 本研究通过监督微调和强化学习的系统实验,揭示了增强大语言模型中长思维链推理能力的关键因素,包括奖励塑造和训练计算扩展,并发现基础模型虽具备核心能力但需精细优化策略。
English: This study systematically investigates how to enhance long chain-of-thought reasoning in large language models through supervised fine-tuning and reinforcement learning, identifying key factors like reward shaping and training compute scaling while revealing that core abilities exist in base models but require careful optimization.
Authors:Nikolaos Tsagkas, Andreas Sochopoulos, Duolikun Danier, Sethu Vijayakumar, Chris Xiaoxuan Lu, Oisin Mac Aodha
Abstract:
The integration of pre-trained visual representations (PVRs) into visuo-motor robot learning has emerged as a promising alternative to training visual encoders from scratch. However, PVRs face critical challenges in the context of policy learning, including temporal entanglement and an inability to generalise even in the presence of minor scene perturbations. These limitations hinder performance in tasks requiring temporal awareness and robustness to scene changes. This work identifies these shortcomings and proposes solutions to address them. First, we augment PVR features with temporal perception and a sense of task completion, effectively disentangling them in time. Second, we introduce a module that learns to selectively attend to task-relevant local features, enhancing robustness when evaluated on out-of-distribution scenes. Our experiments demonstrate significant performance improvements, particularly in PVRs trained with masking objectives, and validate the effectiveness of our enhancements in addressing PVR-specific limitations.
Authors:Li Pan, Yupei Zhang, Qiushi Yang, Tan Li, Zhen Chen
Abstract:
Recently computer-aided diagnosis has demonstrated promising performance, effectively alleviating the workload of clinicians. However, the inherent sample imbalance among different diseases leads algorithms biased to the majority categories, leading to poor performance for rare categories. Existing works formulated this challenge as a long-tailed problem and attempted to tackle it by decoupling the feature representation and classification. Yet, due to the imbalanced distribution and limited samples from tail classes, these works are prone to biased representation learning and insufficient classifier calibration. To tackle these problems, we propose a new Long-tailed Medical Diagnosis (LMD) framework for balanced medical image classification on long-tailed datasets. In the initial stage, we develop a Relation-aware Representation Learning (RRL) scheme to boost the representation ability by encouraging the encoder to capture intrinsic semantic features through different data augmentations. In the subsequent stage, we propose an Iterative Classifier Calibration (ICC) scheme to calibrate the classifier iteratively. This is achieved by generating a large number of balanced virtual features and fine-tuning the encoder using an Expectation-Maximization manner. The proposed ICC compensates for minority categories to facilitate unbiased classifier optimization while maintaining the diagnostic knowledge in majority classes. Comprehensive experiments on three public long-tailed medical datasets demonstrate that our LMD framework significantly surpasses state-of-the-art approaches. The source code can be accessed at https://github.com/peterlipan/LMD.
中文: 提出的长尾医疗诊断框架通过关系感知表征学习和迭代分类器校准解决医学图像中的类别不平衡问题,在三个公共数据集上显著超越了现有最优方法。
English: The proposed Long-tailed Medical Diagnosis (LMD) framework addresses class imbalance in medical image analysis through Relation-aware Representation Learning and Iterative Classifier Calibration, achieving state-of-the-art performance on three public datasets.
Authors:Xiangyu Dong, Xingyi Zhang, Lei Chen, Mingxuan Yuan, Sibo Wang
Abstract:
Node Anomaly Detection (NAD) has gained significant attention in the deep learning community due to its diverse applications in real-world scenarios. Existing NAD methods primarily embed graphs within a single Euclidean space, while overlooking the potential of non-Euclidean spaces. Besides, to address the prevalent issue of limited supervision in real NAD tasks, previous methods tend to leverage synthetic data to collect auxiliary information, which is not an effective solution as shown in our experiments. To overcome these challenges, we introduce a novel SpaceGNN model designed for NAD tasks with extremely limited labels. Specifically, we provide deeper insights into a task-relevant framework by empirically analyzing the benefits of different spaces for node representations, based on which, we design a Learnable Space Projection function that effectively encodes nodes into suitable spaces. Besides, we introduce the concept of weighted homogeneity, which we empirically and theoretically validate as an effective coefficient during information propagation. This concept inspires the design of the Distance Aware Propagation module. Furthermore, we propose the Multiple Space Ensemble module, which extracts comprehensive information for NAD under conditions of extremely limited supervision. Our findings indicate that this module is more beneficial than data augmentation techniques for NAD. Extensive experiments conducted on 9 real datasets confirm the superiority of SpaceGNN, which outperforms the best rival by an average of 8.55% in AUC and 4.31% in F1 scores. Our code is available at https://github.com/xydong127/SpaceGNN.
中文摘要:SpaceGNN模型通过将节点编码到合适的非欧几里得空间并引入加权同质性概念来改进信息传播,有效解决了节点异常检测中监督信息有限的问题,在多个真实数据集上表现出优越性能。
English Summary: The SpaceGNN model addresses limitations in Node Anomaly Detection by encoding nodes into suitable non-Euclidean spaces and introducing weighted homogeneity for improved information propagation, achieving superior performance with limited supervision.
Authors:Yuchao Wu, Xiaofei Yu, Hao Chen, Yang Luo, Yeyu Tong, Yuzhe Ma
Abstract:
While large language models (LLMs) have shown remarkable potential in automating various tasks in digital chip design, the field of Photonic Integrated Circuits (PICs)-a promising solution to advanced chip designs-remains relatively unexplored in this context. The design of PICs is time-consuming and prone to errors due to the extensive and repetitive nature of code involved in photonic chip design. In this paper, we introduce PICBench, the first benchmarking and evaluation framework specifically designed to automate PIC design generation using LLMs, where the generated output takes the form of a netlist. Our benchmark consists of dozens of meticulously crafted PIC design problems, spanning from fundamental device designs to more complex circuit-level designs. It automatically evaluates both the syntax and functionality of generated PIC designs by comparing simulation outputs with expert-written solutions, leveraging an open-source simulator. We evaluate a range of existing LLMs, while also conducting comparative tests on various prompt engineering techniques to enhance LLM performance in automated PIC design. The results reveal the challenges and potential of LLMs in the PIC design domain, offering insights into the key areas that require further research and development to optimize automation in this field. Our benchmark and evaluation code is available at https://github.com/PICDA/PICBench.
中文摘要:本文提出了PICBench,这是首个利用大语言模型自动生成光子集成电路网表,并通过与专家方案对比来评估设计语法和功能性的基准测试框架。
English Summary: This paper introduces PICBench, the first benchmarking framework that uses large language models to automate photonic integrated circuit design by generating netlists and evaluating their syntax and functionality against expert solutions.
Authors:Yuancheng Wang, Jiachen Zheng, Junan Zhang, Xueyao Zhang, Huan Liao, Zhizheng Wu
Abstract:
We introduce Metis, a foundation model for unified speech generation. Unlike previous task-specific or multi-task models, Metis follows a pre-training and fine-tuning paradigm. It is pre-trained on large-scale unlabeled speech data using masked generative modeling and then fine-tuned to adapt to diverse speech generation tasks. Specifically, 1) Metis utilizes two discrete speech representations: SSL tokens derived from speech self-supervised learning (SSL) features, and acoustic tokens directly quantized from waveforms. 2) Metis performs masked generative pre-training on SSL tokens, utilizing 300K hours of diverse speech data, without any additional condition. 3) Through fine-tuning with task-specific conditions, Metis achieves efficient adaptation to various speech generation tasks while supporting multimodal input, even when using limited data and trainable parameters. Experiments demonstrate that Metis can serve as a foundation model for unified speech generation: Metis outperforms state-of-the-art task-specific or multi-task systems across five speech generation tasks, including zero-shot text-to-speech, voice conversion, target speaker extraction, speech enhancement, and lip-to-speech, even with fewer than 20M trainable parameters or 300 times less training data. Audio samples are are available at https://metis-demo.github.io/.
Chinese Summary: Metis是一种统一语音生成的基础模型,通过大规模无标签语音数据的预训练和任务特定微调,在五项语音生成任务中超越现有最优系统,且仅需极少参数和训练数据。
English Summary: Metis is a foundation model for unified speech generation that uses pre-training on large-scale unlabeled speech data and fine-tuning for diverse tasks, outperforming state-of-the-art systems across five speech generation tasks with minimal parameters and data.
Authors:Dan MacKinlay
Abstract:
The Ensemble Kalman Filter (EnKF) is a widely used method for data assimilation in high-dimensional systems, with an ensemble update step equivalent to an empirical version of the Matheron update popular in Gaussian process regression -- a connection that links half a century of data-assimilation engineering to modern path-wise GP sampling. This paper provides a compact introduction to this simple but under-exploited connection, with necessary definitions accessible to all fields involved. Source code is available at https://github.com/danmackinlay/paper_matheron_equals_enkf .
中文: 集合卡尔曼滤波的集合更新步骤等同于高斯过程回归中的马瑟隆更新,这一联系将半个世纪的数据同化工程与现代路径式GP采样技术联系起来。
English: The Ensemble Kalman Filter's ensemble update is equivalent to the Matheron update in Gaussian process regression, connecting decades of data assimilation with modern GP sampling techniques.
Authors:Mohannad Takrouri, Nicolás M. Cuadrado, Martin TakáÄ
Abstract:
Machine learning (ML) is increasingly vital for smart-grid research, yet restricted access to realistic, diverse data - often due to privacy concerns - slows progress and fuels doubts within the energy sector about adopting ML-based strategies. We propose integrating Large Language Models (LLMs) in energy modeling to generate realistic, culturally sensitive, and behavior-specific data for household energy usage across diverse geographies. In this study, we employ and compare five different LLMs to systematically produce family structures, weather patterns, and daily consumption profiles for households in six distinct countries. A four-stage methodology synthesizes contextual daily data, including culturally nuanced activities, realistic weather ranges, HVAC operations, and distinct `energy signatures' that capture unique consumption footprints. Additionally, we explore an alternative strategy where external weather datasets can be directly integrated, bypassing intermediate weather modeling stages while ensuring physically consistent data inputs. The resulting dataset provides insights into how cultural, climatic, and behavioral factors converge to shape carbon emissions, offering a cost-effective avenue for scenario-based energy optimization. This approach underscores how prompt engineering, combined with knowledge distillation, can advance sustainable energy research and climate mitigation efforts. Source code is available at https://github.com/Singularity-AI-Lab/LLM-Energy-Knowledge-Distillation .
中文摘要:本研究提出利用大型语言模型生成真实且具有文化敏感性的家庭能源数据,解决了智能电网研究中数据稀缺的问题,并能够对文化、气候和行为因素共同影响的碳排放进行成本效益分析。
English Summary: This study introduces a method using Large Language Models to generate realistic and culturally sensitive household energy data, addressing data scarcity in smart-grid research and enabling cost-effective analysis of carbon emissions influenced by cultural, climatic, and behavioral factors.
Authors:Hao Zeng, Kangdao Liu, Bingyi Jing, Hongxin Wei
Abstract:
Conformal prediction is a popular framework of uncertainty quantification that constructs prediction sets with coverage guarantees. To uphold the exchangeability assumption, many conformal prediction methods necessitate an additional holdout set for parameter tuning. Yet, the impact of violating this principle on coverage remains underexplored, making it ambiguous in practical applications. In this work, we empirically find that the tuning bias - the coverage gap introduced by leveraging the same dataset for tuning and calibration, is negligible for simple parameter tuning in many conformal prediction methods. In particular, we observe the scaling law of the tuning bias: this bias increases with parameter space complexity and decreases with calibration set size. Formally, we establish a theoretical framework to quantify the tuning bias and provide rigorous proof for the scaling law of the tuning bias by deriving its upper bound. In the end, we discuss how to reduce the tuning bias, guided by the theories we developed.
中文: 本研究证明,在共形预测中因使用同一数据集进行参数调优和校准而产生的调优偏差对于简单参数调优可忽略不计,且该偏差遵循参数复杂度增加而增大、校准集规模扩大而减小的缩放规律,并通过理论分析和提出的缓解策略加以验证。
English: This study demonstrates that the tuning bias in conformal prediction, arising from using the same dataset for parameter tuning and calibration, is minimal for simple parameter tuning and follows a scaling law where bias increases with parameter complexity but decreases with calibration set size, supported by theoretical analysis and proposed mitigation strategies.
Authors:Seng Pei Liew, Takuya Kato, Sho Takase
Abstract:
Pretraining large language models (LLMs) is resource-intensive, often requiring months of training time even with high-end GPU clusters. There are two approaches of mitigating such computational demands: reusing smaller models to train larger ones (upcycling), and training computationally efficient models like mixture-of-experts (MoE). In this paper, we study the upcycling of LLMs to MoE models, of which the scaling behavior remains underexplored. Through extensive experiments, we identify empirical scaling laws that describe how performance depends on dataset size and model configuration. Particularly, we show that, while scaling these factors improves performance, there is a novel interaction term between the dense and upcycled training dataset that limits the efficiency of upcycling at large computational budgets. Based on these findings, we provide guidance to scale upcycling, and establish conditions under which upcycling outperforms from-scratch trainings within budget constraints.
中文: 本研究探索了将大型语言模型升级为专家混合架构的扩展规律,通过实证发现性能随数据集规模和模型配置扩展而提升,但识别出密集与升级数据集间的限制性交互作用,最终为预算内高效升级提供了策略指导。
English: This study explores the scaling behavior of upcycling large language models into mixture-of-experts architectures, revealing empirical laws that show performance gains from scaling dataset size and model configuration but identify a limiting interaction between dense and upcycled datasets, ultimately providing guidance for cost-effective upcycling strategies.
Authors:Lu Yi, Jie Peng, Yanping Zheng, Fengran Mo, Zhewei Wei, Yuhang Ye, Yue Zixuan, Zengfeng Huang
Abstract:
Future link prediction is a fundamental challenge in various real-world dynamic systems. To address this, numerous temporal graph neural networks (temporal GNNs) and benchmark datasets have been developed. However, these datasets often feature excessive repeated edges and lack complex sequential dynamics, a key characteristic inherent in many real-world applications such as recommender systems and ``Who-To-Follow'' on social networks. This oversight has led existing methods to inadvertently downplay the importance of learning sequential dynamics, focusing primarily on predicting repeated edges.
In this study, we demonstrate that existing methods, such as GraphMixer and DyGFormer, are inherently incapable of learning simple sequential dynamics, such as ``a user who has followed OpenAI and Anthropic is more likely to follow AI at Meta next.'' Motivated by this issue, we introduce the Temporal Graph Benchmark with Sequential Dynamics (TGB-Seq), a new benchmark carefully curated to minimize repeated edges, challenging models to learn sequential dynamics and generalize to unseen edges. TGB-Seq comprises large real-world datasets spanning diverse domains, including e-commerce interactions, movie ratings, business reviews, social networks, citation networks and web link networks. Benchmarking experiments reveal that current methods usually suffer significant performance degradation and incur substantial training costs on TGB-Seq, posing new challenges and opportunities for future research. TGB-Seq datasets, leaderboards, and example codes are available at https://tgb-seq.github.io/.
Authors:Yang Li, Jinpei Guo, Runzhong Wang, Hongyuan Zha, Junchi Yan
Abstract:
Diffusion models have recently advanced Combinatorial Optimization (CO) as a powerful backbone for neural solvers. However, their iterative sampling process requiring denoising across multiple noise levels incurs substantial overhead. We propose to learn direct mappings from different noise levels to the optimal solution for a given instance, facilitating high-quality generation with minimal shots. This is achieved through an optimization consistency training protocol, which, for a given instance, minimizes the difference among samples originating from varying generative trajectories and time steps relative to the optimal solution. The proposed model enables fast single-step solution generation while retaining the option of multi-step sampling to trade for sampling quality, which offers a more effective and efficient alternative backbone for neural solvers. In addition, within the training-to-testing (T2T) framework, to bridge the gap between training on historical instances and solving new instances, we introduce a novel consistency-based gradient search scheme during the test stage, enabling more effective exploration of the solution space learned during training. It is achieved by updating the latent solution probabilities under objective gradient guidance during the alternation of noise injection and denoising steps. We refer to this model as Fast T2T. Extensive experiments on two popular tasks, the Traveling Salesman Problem (TSP) and Maximal Independent Set (MIS), demonstrate the superiority of Fast T2T regarding both solution quality and efficiency, even outperforming LKH given limited time budgets. Notably, Fast T2T with merely one-step generation and one-step gradient search can mostly outperform the SOTA diffusion-based counterparts that require hundreds of steps, while achieving tens of times speedup.
Chinese: 提出的Fast T2T模型通过学习从噪声水平到最优解的直接映射,实现了组合优化的高效单步求解,同时结合基于一致性的梯度搜索来提升解的质量和速度,在旅行商问题和最大独立集任务上展现出优越性能。
English: The proposed Fast T2T model enables efficient single-step solution generation for combinatorial optimization by learning direct mappings from noise levels to optimal solutions, while incorporating a consistency-based gradient search to enhance solution quality and speed.
Authors:Yuan Tian, Wenqi Zhou, Michele Viscione, Hao Dong, David Kammer, Olga Fink
Abstract:
Symbolic Regression (SR) holds great potential for uncovering underlying mathematical and physical relationships from observed data. However, the vast combinatorial space of possible expressions poses significant challenges for both online search methods and pre-trained transformer models. Additionally, current state-of-the-art approaches typically do not consider the integration of domain experts' prior knowledge and do not support iterative interactions with the model during the equation discovery process. To address these challenges, we propose the Symbolic Q-network (Sym-Q), an advanced interactive framework for large-scale symbolic regression. Unlike previous large-scale transformer-based SR approaches, Sym-Q leverages reinforcement learning without relying on a transformer-based decoder. This formulation allows the agent to learn through offline reinforcement learning using any type of tree encoder, enabling more efficient training and inference. Furthermore, we propose a co-design mechanism, where the reinforcement learning-based Sym-Q facilitates effective interaction with domain experts at any stage of the equation discovery process. Users can dynamically modify generated nodes of the expression, collaborating with the agent to tailor the mathematical expression to best fit the problem and align with the assumed physical laws, particularly when there is prior partial knowledge of the expected behavior. Our experiments demonstrate that the pre-trained Sym-Q surpasses existing SR algorithms on the challenging SSDNC benchmark. Moreover, we experimentally show on real-world cases that its performance can be further enhanced by the interactive co-design mechanism, with Sym-Q achieving greater performance gains than other state-of-the-art models. Our reproducible code is available at https://github.com/EPFL-IMOS/Sym-Q.
中文摘要:符号Q网络(Sym-Q)是一种基于强化学习的交互式框架,通过离线强化学习和树编码器实现高效训练推理,并允许领域专家在方程发现过程中动态修改表达式节点,在基准测试和实际案例中均超越现有方法。
English Summary: The Symbolic Q-network (Sym-Q) is a reinforcement learning-based interactive framework that overcomes limitations of traditional symbolic regression methods by enabling efficient training, inference, and dynamic collaboration with domain experts to refine mathematical expressions.
Authors:Baoyao Yang, Junxiang Chen, Wanyun Li, Wenbin Yao, Yang Zhou
Abstract:
Video-text retrieval has been stuck in the information mismatch caused by personalized and inadequate textual descriptions of videos. The substantial information gap between the two modalities hinders an effective cross-modal representation alignment, resulting in ambiguous retrieval results. Although text rewriting methods have been proposed to broaden text expressions, the modality gap remains significant, as the text representation space is hardly expanded with insufficient semantic enrichment.Instead, this paper turns to enhancing visual presentation, bridging video expression closer to textual representation via caption generation and thereby facilitating video-text matching.While multimodal large language models (mLLM) have shown a powerful capability to convert video content into text, carefully crafted prompts are essential to ensure the reasonableness and completeness of the generated captions. Therefore, this paper proposes an automatic caption enhancement method that improves expression quality and mitigates empiricism in augmented captions through self-learning.Additionally, an expertized caption selection mechanism is designed and introduced to customize augmented captions for each video, further exploring the utilization potential of caption augmentation.Our method is entirely data-driven, which not only dispenses with heavy data collection and computation workload but also improves self-adaptability by circumventing lexicon dependence and introducing personalized matching. The superiority of our method is validated by state-of-the-art results on various benchmarks, specifically achieving Top-1 recall accuracy of 68.5% on MSR-VTT, 68.1% on MSVD, and 62.0% on DiDeMo. Our code is publicly available at https://github.com/CaryXiang/ECA4VTR.
Chinese: 本文通过自动字幕增强方法和专业化字幕选择机制改进视觉表达以弥合视频与文本间的信息鸿沟,无需大量数据收集即在多个基准测试中取得领先性能。
English: This paper addresses the video-text retrieval mismatch by enhancing visual presentation through an automatic caption enhancement method and expertized caption selection, achieving state-of-the-art results on benchmarks without heavy data collection.
Authors:Jiaqing Zhang, Mingjia Yin, Hao Wang, Yawen Li, Yuyang Ye, Xingyu Lou, Junping Du, Enhong Chen
Abstract:
In the era of data-centric AI, the focus of recommender systems has shifted from model-centric innovations to data-centric approaches. The success of modern AI models is built on large-scale datasets, but this also results in significant training costs. Dataset distillation has emerged as a key solution, condensing large datasets to accelerate model training while preserving model performance. However, condensing discrete and sequentially correlated user-item interactions, particularly with extensive item sets, presents considerable challenges. This paper introduces \textbf{TD3}, a novel \textbf{T}ucker \textbf{D}ecomposition based \textbf{D}ataset \textbf{D}istillation method within a meta-learning framework, designed for sequential recommendation. TD3 distills a fully expressive \emph{synthetic sequence summary} from original data. To efficiently reduce computational complexity and extract refined latent patterns, Tucker decomposition decouples the summary into four factors: \emph{synthetic user latent factor}, \emph{temporal dynamics latent factor}, \emph{shared item latent factor}, and a \emph{relation core} that models their interconnections. Additionally, a surrogate objective in bi-level optimization is proposed to align feature spaces extracted from models trained on both original data and synthetic sequence summary beyond the naïve performance matching approach. In the \emph{inner-loop}, an augmentation technique allows the learner to closely fit the synthetic summary, ensuring an accurate update of it in the \emph{outer-loop}. To accelerate the optimization process and address long dependencies, RaT-BPTT is employed for bi-level optimization. Experiments and analyses on multiple public datasets have confirmed the superiority and cross-architecture generalizability of the proposed designs. Codes are released at https://github.com/USTC-StarTeam/TD3.
中文摘要:本文提出TD3,一种基于Tucker分解的数据集蒸馏方法,通过元学习和双层优化将用户-物品交互高效压缩为合成序列摘要,在保持模型性能的同时显著提升训练效率。
English Summary: This paper introduces TD3, a Tucker decomposition-based dataset distillation method for sequential recommendation that efficiently condenses user-item interactions into synthetic sequence summaries while preserving model performance through meta-learning and bi-level optimization.
Authors:Shuanghao Bai, Wanqi Zhou, Pengxiang Ding, Wei Zhao, Donglin Wang, Badong Chen
Abstract:
Behavior Cloning (BC) is a widely adopted visual imitation learning method in robot manipulation. Current BC approaches often enhance generalization by leveraging large datasets and incorporating additional visual and textual modalities to capture more diverse information. However, these methods overlook whether the learned representations contain redundant information and lack a solid theoretical foundation to guide the learning process. To address these limitations, we adopt an information-theoretic perspective and introduce mutual information to quantify and mitigate redundancy in latent representations. Building on this, we incorporate the Information Bottleneck (IB) principle into BC, which extends the idea of reducing redundancy by providing a structured framework for compressing irrelevant information while preserving task-relevant features. This work presents the first comprehensive study on redundancy in latent representations across various methods, backbones, and experimental settings, while extending the generalizability of the IB to BC. Extensive experiments and analyses on the CortexBench and LIBERO benchmarks demonstrate significant performance improvements with IB, underscoring the importance of reducing input data redundancy and highlighting its practical value for more practical applications. Project Page: https://baishuanghao.github.io/BC-IB.github.io.
中文摘要:本研究采用信息论视角,通过互信息和信息瓶颈原理减少行为克隆中潜在表征的冗余,在机器人操作基准测试中实现了显著性能提升。
English Summary: This study introduces an information-theoretic approach using mutual information and the Information Bottleneck principle to reduce redundancy in latent representations for Behavior Cloning, achieving significant performance improvements on robotics benchmarks.
Authors:Sunwoo Lee, Jaebak Hwang, Yonghyeon Jo, Seungyul Han
Abstract:
Traditional robust methods in multi-agent reinforcement learning (MARL) often struggle against coordinated adversarial attacks in cooperative scenarios. To address this limitation, we propose the Wolfpack Adversarial Attack framework, inspired by wolf hunting strategies, which targets an initial agent and its assisting agents to disrupt cooperation. Additionally, we introduce the Wolfpack-Adversarial Learning for MARL (WALL) framework, which trains robust MARL policies to defend against the proposed Wolfpack attack by fostering systemwide collaboration. Experimental results underscore the devastating impact of the Wolfpack attack and the significant robustness improvements achieved by WALL. Our code is available at https://github.com/sunwoolee0504/WALL.
中文: Wolfpack对抗攻击框架通过针对关键智能体破坏多智能体协作,而WALL框架则通过强化系统协作来训练具备抵御此类攻击的鲁棒策略。
English: The Wolfpack Adversarial Attack framework disrupts multi-agent cooperation by targeting key agents, while the WALL framework trains robust policies to defend against such attacks through enhanced collaboration.
Authors:Jeongmo Kim, Yisak Park, Minung Kim, Seungyul Han
Abstract:
Meta reinforcement learning aims to develop policies that generalize to unseen tasks sampled from a task distribution. While context-based meta-RL methods improve task representation using task latents, they often struggle with out-of-distribution (OOD) tasks. To address this, we propose Task-Aware Virtual Training (TAVT), a novel algorithm that accurately captures task characteristics for both training and OOD scenarios using metric-based representation learning. Our method successfully preserves task characteristics in virtual tasks and employs a state regularization technique to mitigate overestimation errors in state-varying environments. Numerical results demonstrate that TAVT significantly enhances generalization to OOD tasks across various MuJoCo and MetaWorld environments. Our code is available at https://github.com/JM-Kim-94/tavt.git.
Chinese: 提出的任务感知虚拟训练(TAVT)算法通过基于度量的表示学习和状态正则化技术,精确捕捉任务特征,显著提升了在多种环境中对分布外任务的泛化能力。
English: The proposed Task-Aware Virtual Training (TAVT) algorithm enhances meta reinforcement learning by accurately capturing task characteristics through metric-based representation learning and state regularization, significantly improving generalization to out-of-distribution tasks in various environments.
Authors:Sunny Sanyal, Hayden Prairie, Rudrajit Das, Ali Kavis, Sujay Sanghavi
Abstract:
Fine-tuning a pre-trained model on a downstream task often degrades its original capabilities, a phenomenon known as "catastrophic forgetting". This is especially an issue when one does not have access to the data and recipe used to develop the pre-trained model. Under this constraint, most existing methods for mitigating forgetting are inapplicable. To address this challenge, we propose a sample weighting scheme for the fine-tuning data solely based on the pre-trained model's losses. Specifically, we upweight the easy samples on which the pre-trained model's loss is low and vice versa to limit the drift from the pre-trained model. Our approach is orthogonal and yet complementary to existing methods; while such methods mostly operate on parameter or gradient space, we concentrate on the sample space. We theoretically analyze the impact of fine-tuning with our method in a linear setting, showing that it stalls learning in a certain subspace which inhibits overfitting to the target task. We empirically demonstrate the efficacy of our method on both language and vision tasks. As an example, when fine-tuning Gemma 2 2B on MetaMathQA, our method results in only a $0.8\%$ drop in accuracy on GSM8K (another math dataset) compared to standard fine-tuning, while preserving $5.4\%$ more accuracy on the pre-training datasets. Our code is publicly available at https://github.com/sanyalsunny111/FLOW_finetuning .
中文: 本文提出一种基于预训练模型损失的样本加权方法,通过侧重易学样本限制模型偏离,在适应新任务的同时有效缓解灾难性遗忘问题。
English: This paper introduces a sample weighting method that mitigates catastrophic forgetting during fine-tuning by prioritizing easy samples based on the pre-trained model's loss, preserving original capabilities while adapting to new tasks.
Authors:Calvin Yeung, Kenjiro Ide, Taiga Someya, Keisuke Fujii
Abstract:
Sports analytics has become both more professional and sophisticated, driven by the growing availability of detailed performance data. This progress enables applications such as match outcome prediction, player scouting, and tactical analysis. In soccer, the effective utilization of event and tracking data is fundamental for capturing and analyzing the dynamics of the game. However, there are two primary challenges: the limited availability of event data, primarily restricted to top-tier teams and leagues, and the scarcity and high cost of tracking data, which complicates its integration with event data for comprehensive analysis. Here we propose OpenSTARLab, an open-source framework designed to democratize spatio-temporal agent data analysis in sports by addressing these key challenges. OpenSTARLab includes the Pre-processing Package that standardizes event and tracking data through Unified and Integrated Event Data and State-Action-Reward formats, the Event Modeling Package that implements deep learning-based event prediction, alongside the RLearn Package for reinforcement learning tasks. These technical components facilitate the handling of diverse data sources and support advanced analytical tasks, thereby enhancing the overall functionality and usability of the framework. To assess OpenSTARLab's effectiveness, we conducted several experimental evaluations. These demonstrate the superior performance of the specific event prediction model in terms of action and time prediction accuracies and maintained its robust event simulation performance. Furthermore, reinforcement learning experiments reveal a trade-off between action accuracy and temporal difference loss and show comprehensive visualization. Overall, OpenSTARLab serves as a robust platform for researchers and practitioners, enhancing innovation and collaboration in the field of soccer data analytics.
中文: 体育分析在数据驱动应用中不断进步,但面临数据可获取性和整合的挑战;OpenSTARLab作为开源框架,通过标准化和解析时空数据,凭借强大的事件预测和强化学习功能,提升了足球数据分析的效能。
English: Sports analytics is advancing with data-driven applications, yet faces challenges in data accessibility and integration, which OpenSTARLab addresses as an open-source framework to standardize and analyze spatio-temporal data, enhancing soccer analytics through robust event prediction and reinforcement learning capabilities.
Authors:Patrick Yin, Tyler Westenbroek, Simran Bagaria, Kevin Huang, Ching-an Cheng, Andrey Kobolov, Abhishek Gupta
Abstract:
Robot learning requires a considerable amount of high-quality data to realize the promise of generalization. However, large data sets are costly to collect in the real world. Physics simulators can cheaply generate vast data sets with broad coverage over states, actions, and environments. However, physics engines are fundamentally misspecified approximations to reality. This makes direct zero-shot transfer from simulation to reality challenging, especially in tasks where precise and force-sensitive manipulation is necessary. Thus, fine-tuning these policies with small real-world data sets is an appealing pathway for scaling robot learning. However, current reinforcement learning fine-tuning frameworks leverage general, unstructured exploration strategies which are too inefficient to make real-world adaptation practical. This paper introduces the Simulation-Guided Fine-tuning (SGFT) framework, which demonstrates how to extract structural priors from physics simulators to substantially accelerate real-world adaptation. Specifically, our approach uses a value function learned in simulation to guide real-world exploration. We demonstrate this approach across five real-world dexterous manipulation tasks where zero-shot sim-to-real transfer fails. We further demonstrate our framework substantially outperforms baseline fine-tuning methods, requiring up to an order of magnitude fewer real-world samples and succeeding at difficult tasks where prior approaches fail entirely. Last but not least, we provide theoretical justification for this new paradigm which underpins how SGFT can rapidly learn high-performance policies in the face of large sim-to-real dynamics gaps. Project webpage: https://weirdlabuw.github.io/sgft/{weirdlabuw.github.io/sgft}
Authors:Adibvafa Fallahpour, Jun Ma, Alif Munim, Hongwei Lyu, Bo Wang
Abstract:
Chest X-rays (CXRs) play an integral role in driving critical decisions in disease management and patient care. While recent innovations have led to specialized models for various CXR interpretation tasks, these solutions often operate in isolation, limiting their practical utility in clinical practice. We present MedRAX, the first versatile AI agent that seamlessly integrates state-of-the-art CXR analysis tools and multimodal large language models into a unified framework. MedRAX dynamically leverages these models to address complex medical queries without requiring additional training. To rigorously evaluate its capabilities, we introduce ChestAgentBench, a comprehensive benchmark containing 2,500 complex medical queries across 7 diverse categories. Our experiments demonstrate that MedRAX achieves state-of-the-art performance compared to both open-source and proprietary models, representing a significant step toward the practical deployment of automated CXR interpretation systems. Data and code have been publicly available at https://github.com/bowang-lab/MedRAX
Chinese: MedRAX是首个将先进胸片分析工具与多模态大语言模型整合为一体的多功能AI代理,无需额外训练即可在复杂医疗查询中实现最优性能。
English: MedRAX is the first versatile AI agent that integrates advanced chest X-ray analysis tools and multimodal large language models into a unified framework, achieving state-of-the-art performance in complex medical queries without requiring additional training.
Authors:Mayuka Jayawardhana, Renbo, Samuel Dooley, Valeriia Cherepanova, Andrew Gordon Wilson, Frank Hutter, Colin White, Tom Goldstein, Micah Goldblum
Abstract:
Large language models (LLMs) perform remarkably well on tabular datasets in zero- and few-shot settings, since they can extract meaning from natural language column headers that describe features and labels. Similarly, TabPFN, a recent non-LLM transformer pretrained on numerous tables for in-context learning, has demonstrated excellent performance for dataset sizes up to a thousand samples. In contrast, gradient-boosted decision trees (GBDTs) are typically trained from scratch on each dataset without benefiting from pretraining data and must learn the relationships between columns from their entries alone since they lack natural language understanding. LLMs and TabPFN excel on small tabular datasets where a strong prior is essential, yet they are not competitive with GBDTs on medium or large datasets, since their context lengths are limited. In this paper, we propose a simple and lightweight approach for fusing large language models and TabPFN with gradient-boosted decision trees, which allows scalable GBDTs to benefit from the natural language capabilities and pretraining of transformers. We name our fusion methods LLM-Boost and PFN-Boost, respectively. While matching or surpassing the performance of the transformer at sufficiently small dataset sizes and GBDTs at sufficiently large sizes, LLM-Boost and PFN-Boost outperform both standalone components on a wide range of dataset sizes in between. We demonstrate state-of-the-art performance against numerous baselines and ensembling algorithms. We find that PFN-Boost achieves the best average performance among all methods we test for all but very small dataset sizes. We release our code at http://github.com/MayukaJ/LLM-Boost .
中文: 大语言模型和TabPFN在小规模表格数据上表现出色,而梯度提升决策树在大规模数据上更优,因此作者提出LLM-Boost和PFN-Boost融合方法,结合两者优势在不同规模数据集上均实现最优性能。
English: Large language models and TabPFN excel on small tabular datasets but are outperformed by gradient-boosted decision trees on larger ones, so the authors propose LLM-Boost and PFN-Boost fusion methods that combine their strengths to achieve state-of-the-art performance across various dataset sizes.
Authors:Yu-An Huang, Yao Hu, Yue-Chao Li, Xiyue Cao, Xinyuan Li, Kay Chen Tan, Zhu-Hong You, Zhi-An Huang
Abstract:
Functional MRI (fMRI) and single-cell transcriptomics are pivotal in Alzheimer's disease (AD) research, each providing unique insights into neural function and molecular mechanisms. However, integrating these complementary modalities remains largely unexplored. Here, we introduce scBIT, a novel method for enhancing AD prediction by combining fMRI with single-nucleus RNA (snRNA). scBIT leverages snRNA as an auxiliary modality, significantly improving fMRI-based prediction models and providing comprehensive interpretability. It employs a sampling strategy to segment snRNA data into cell-type-specific gene networks and utilizes a self-explainable graph neural network to extract critical subgraphs. Additionally, we use demographic and genetic similarities to pair snRNA and fMRI data across individuals, enabling robust cross-modal learning. Extensive experiments validate scBIT's effectiveness in revealing intricate brain region-gene associations and enhancing diagnostic prediction accuracy. By advancing brain imaging transcriptomics to the single-cell level, scBIT sheds new light on biomarker discovery in AD research. Experimental results show that incorporating snRNA data into the scBIT model significantly boosts accuracy, improving binary classification by 3.39% and five-class classification by 26.59%. The codes were implemented in Python and have been released on GitHub (https://github.com/77YQ77/scBIT) and Zenodo (https://zenodo.org/records/11599030) with detailed instructions.
中文: scBIT方法通过整合功能性磁共振成像与单核RNA数据,显著提高了阿尔茨海默病的预测准确性,并为大脑区域与基因关联提供了可解释的洞见。
English: The scBIT method integrates fMRI with single-nucleus RNA data to significantly enhance Alzheimer's disease prediction accuracy and provide interpretable insights into brain region-gene associations.
Authors:Yu-An Huang, Yue-Chao Li, Hai-Ru You, Jie Pan, Xiyue Cao, Xinyuan Li, Zhi-An Huang, Zhu-Hong You
Abstract:
The exploration of cellular heterogeneity within the tumor microenvironment (TME) via single-cell RNA sequencing (scRNA-seq) is essential for understanding cancer progression and response to therapy. Current scRNA-seq approaches, however, lack spatial context and rely on incomplete datasets of ligand-receptor interactions (LRIs), limiting accurate cell type annotation and cell-cell communication (CCC) inference. This study addresses these challenges using a novel graph neural network (GNN) model that enhances cell type prediction and cell interaction analysis. Our study utilized a dataset consisting of 49,020 cells from 19 patients across three cancer types: Leukemia, Breast Invasive Carcinoma, and Colorectal Cancer. The proposed scGSL model demonstrated robust performance, achieving an average accuracy of 84.83%, precision of 86.23%, recall of 81.51%, and an F1 score of 80.92% across all datasets. These metrics represent a significant enhancement over existing methods, which typically exhibit lower performance metrics. Additionally, by reviewing existing literature on gene interactions within the TME, the scGSL model proves to robustly identify biologically meaningful gene interactions in an unsupervised manner, validated by significant expression differences in key gene pairs across various cancers. The source code and data used in this paper can be found in https://github.com/LiYuechao1998/scGSL.
中文: 本研究提出的scGSL模型采用图神经网络,通过单细胞RNA测序数据改进了肿瘤微环境中的细胞类型注释和细胞间通讯分析,在多种癌症类型中实现了高准确率和稳健性能。
English: This study introduces the scGSL model, a graph neural network that improves cell type annotation and cell-cell communication analysis in the tumor microenvironment using single-cell RNA sequencing data, achieving high accuracy and robust performance across multiple cancer types.
Authors:Philipp Hoellmer, Thomas Egg, Maya M. Martirossyan, Eric Fuemmeler, Zeren Shui, Amit Gupta, Pawan Prakash, Adrian Roitberg, Mingjie Liu, George Karypis, Mark Transtrum, Richard G. Hennig, Ellad B. Tadmor, Stefano Martiniani
Abstract:
The discovery of new materials is essential for enabling technological advancements. Computational approaches for predicting novel materials must effectively learn the manifold of stable crystal structures within an infinite design space. We introduce Open Materials Generation (OMatG), a unifying framework for the generative design and discovery of inorganic crystalline materials. OMatG employs stochastic interpolants (SI) to bridge an arbitrary base distribution to the target distribution of inorganic crystals via a broad class of tunable stochastic processes, encompassing both diffusion models and flow matching as special cases. In this work, we adapt the SI framework by integrating an equivariant graph representation of crystal structures and extending it to account for periodic boundary conditions in unit cell representations. Additionally, we couple the SI flow over spatial coordinates and lattice vectors with discrete flow matching for atomic species. We benchmark OMatG's performance on two tasks: Crystal Structure Prediction (CSP) for specified compositions, and 'de novo' generation (DNG) aimed at discovering stable, novel, and unique structures. In our ground-up implementation of OMatG, we refine and extend both CSP and DNG metrics compared to previous works. OMatG establishes a new state of the art in generative modeling for materials discovery, outperforming purely flow-based and diffusion-based implementations. These results underscore the importance of designing flexible deep learning frameworks to accelerate progress in materials science. The OMatG code is available at https://github.com/FERMat-ML/OMatG.
中文摘要:OMatG是一种创新的生成框架,通过结合随机插值和等变图表示来推动材料发现,在预测和生成稳定无机晶体方面设立了新标杆。
English Summary: OMatG is a novel generative framework that combines stochastic interpolants with equivariant graph representations to advance materials discovery, setting a new benchmark in predicting and generating stable inorganic crystals.
Authors:Antoni Kowalczuk, Jan DubiÅski, Franziska Boenisch, Adam Dziedzic
Abstract:
Image AutoRegressive generation has emerged as a new powerful paradigm with image autoregressive models (IARs) matching state-of-the-art diffusion models (DMs) in image quality (FID: 1.48 vs. 1.58) while allowing for a higher generation speed. However, the privacy risks associated with IARs remain unexplored, raising concerns regarding their responsible deployment. To address this gap, we conduct a comprehensive privacy analysis of IARs, comparing their privacy risks to the ones of DMs as reference points. Concretely, we develop a novel membership inference attack (MIA) that achieves a remarkably high success rate in detecting training images (with a True Positive Rate at False Positive Rate = 1% of 86.38% vs. 6.38% for DMs with comparable attacks). We leverage our novel MIA to provide dataset inference (DI) for IARs, and show that it requires as few as 6 samples to detect dataset membership (compared to 200 for DI in DMs), confirming a higher information leakage in IARs. Finally, we are able to extract hundreds of training data points from an IAR (e.g., 698 from VAR-d30). Our results suggest a fundamental privacy-utility trade-off: while IARs excel in image generation quality and speed, they are empirically significantly more vulnerable to privacy attacks compared to DMs that achieve similar performance. We release the code at https://github.com/sprintml/privacy_attacks_against_iars for reproducibility.
中文: 图像自回归模型在图像质量和生成速度上媲美扩散模型,但隐私风险更高,新型成员推理攻击成功率显著且能提取大量训练数据,揭示了其严重的隐私泄露问题。
English: Image autoregressive models (IARs) match diffusion models in image quality with faster generation but are significantly more vulnerable to privacy attacks, as demonstrated by a novel membership inference attack achieving high success rates and data extraction.
Authors:Depen Morwani, Nikhil Vyas, Hanlin Zhang, Sham Kakade
Abstract:
Recent advancements in deep learning optimization have introduced new algorithms, such as Schedule-Free optimizers, AdEMAMix, MARS and Lion which modify traditional momentum mechanisms. In a separate line of work, theoretical acceleration of stochastic gradient descent (SGD) in noise-dominated regime has been achieved by decoupling the momentum coefficient from the current gradient's weight. In this paper, we establish explicit connections between these two lines of work. We substantiate our theoretical findings with preliminary experiments on a 150m language modeling task. We find that AdEMAMix, which most closely resembles accelerated versions of stochastic gradient descent, exhibits superior performance. Building on these insights, we introduce a modification to AdEMAMix, termed Simplified-AdEMAMix, which maintains the same performance as AdEMAMix across both large and small batch-size settings while eliminating the need for two different momentum terms. The code for Simplified-AdEMAMix is available on the repository: https://github.com/DepenM/Simplified-AdEMAMix/.
中文摘要:本文将新型深度学习优化器与随机梯度下降理论加速相联系,验证了AdEMAMix的优越性能,并提出简化版AdEMAMix,在保持性能的同时精简了动量项设计。
English Summary: This paper connects recent deep learning optimizers with theoretical SGD acceleration, demonstrating AdEMAMix's superior performance and introducing Simplified-AdEMAMix which maintains performance while simplifying momentum terms.
Authors:Alexander Kolesov, Manukhov Stepan, Vladimir V. Palyulin, Alexander Korotin
Abstract:
We propose Electrostatic Field Matching (EFM), a novel method that is suitable for both generative modeling and distribution transfer tasks. Our approach is inspired by the physics of an electrical capacitor. We place source and target distributions on the capacitor plates and assign them positive and negative charges, respectively. Then we learn the electrostatic field of the capacitor using a neural network approximator. To map the distributions to each other, we start at one plate of the capacitor and move the samples along the learned electrostatic field lines until they reach the other plate. We theoretically justify that this approach provably yields the distribution transfer. In practice, we demonstrate the performance of our EFM in toy and image data experiments. Our code is available at https://github.com/justkolesov/FieldMatching
中文摘要:静电匹配场(EFM)是一种受电容器物理启发的新型分布转移方法,通过神经网络学习静电场,使样本沿电场线移动,从而可靠地实现分布间的相互映射。
English Summary: Electrostatic Field Matching (EFM) is a novel distribution transfer method inspired by capacitor physics, using neural networks to learn electrostatic fields that provably map source and target distributions by moving samples along field lines.
Authors:Josua Faller, Jörg Martin
Abstract:
Subspace inference for neural networks assumes that a subspace of their parameter space suffices to produce a reliable uncertainty quantification. In this work, we mathematically derive the optimal subspace model to a Bayesian inference scenario based on the Laplace approximation. We demonstrate empirically that, in the optimal case, often a fraction of parameters less than 1% is sufficient to obtain a reliable estimate of the full Laplace approximation. Since the optimal solution is derived, we can evaluate all other subspace models against a baseline. In addition, we give an approximation of our method that is applicable to larger problem settings, in which the optimal solution is not computable, and compare it to existing subspace models from the literature. In general, our approximation scheme outperforms previous work. Furthermore, we present a metric to qualitatively compare different subspace models even if the exact Laplace approximation is unknown.
中文: 本研究通过拉普拉斯近似数学推导出贝叶斯神经网络推理的最优子空间模型,证明仅需不到1%参数即可可靠估计全模型不确定性,且性能优于现有方法。
English: This study mathematically derives the optimal subspace model for Bayesian neural network inference using Laplace approximation, demonstrating that under 1% of parameters can reliably estimate full-model uncertainty while outperforming existing methods.
Authors:Giovanni Birolo, Ivan Rossi, Flavio Sartori, Cesare Rollo, Tiziana Sanavia, Piero Fariselli
Abstract:
Survival analysis, a foundational tool for modeling time-to-event data, has seen growing integration with machine learning (ML) approaches to handle the complexities of censored data and time-varying risks. Despite these advances, leveraging state-of-the-art survival models remains a challenge due to the fragmented nature of existing implementations, which lack standardized interfaces and require extensive preprocessing. We introduce SurvHive, a Python-based framework designed to unify survival analysis methods within a coherent and extensible interface modeled on scikit-learn. SurvHive integrates classical statistical models with cutting-edge deep learning approaches, including transformer-based architectures and parametric survival models. Using a consistent API, SurvHive simplifies model training, evaluation, and optimization, significantly reducing the barrier to entry for ML practitioners exploring survival analysis. The package includes enhanced support for hyper-parameter tuning, time-dependent risk evaluation metrics, and cross-validation strategies tailored to censored data. With its extensibility and focus on usability, SurvHive provides a bridge between survival analysis and the broader ML community, facilitating advancements in time-to-event modeling across domains. The SurvHive code and documentation are available freely at https://github.com/compbiomed-unito/survhive.
中文: SurvHive是一个基于Python的框架,通过类scikit-learn的统一接口整合了传统统计方法与深度学习技术,显著降低了生存分析的应用门槛,并提供专门处理删失数据的优化工具。
English: SurvHive is a Python framework that unifies classical and modern survival analysis methods with a scikit-learn-inspired interface, simplifying model development and evaluation while providing specialized tools for handling censored data.
Authors:Dexiong Chen, Markus Krimmel, Karsten Borgwardt
Abstract:
We introduce AutoGraph, a scalable autoregressive model for attributed graph generation using decoder-only transformers. By flattening graphs into random sequences of tokens through a reversible process, AutoGraph enables modeling graphs as sequences without relying on additional node features that are expensive to compute, in contrast to diffusion-based approaches. This results in sampling complexity and sequence lengths that scale optimally linearly with the number of edges, making it scalable and efficient for large, sparse graphs. A key success factor of AutoGraph is that its sequence prefixes represent induced subgraphs, creating a direct link to sub-sentences in language modeling. Empirically, AutoGraph achieves state-of-the-art performance on synthetic and molecular benchmarks, with up to 100x faster generation and 3x faster training than leading diffusion models. It also supports substructure-conditioned generation without fine-tuning and shows promising transferability, bridging language modeling and graph generation to lay the groundwork for graph foundation models. Our code is available at https://github.com/BorgwardtLab/AutoGraph.
中文: AutoGraph是一种可扩展的自回归模型,通过将图表示为令牌序列,利用仅解码器变换器高效生成属性图,在合成和分子基准测试中达到最先进性能,训练和生成速度远超扩散模型。
English: AutoGraph is a scalable autoregressive model that uses decoder-only transformers to generate attributed graphs efficiently by representing them as token sequences, achieving state-of-the-art performance with significantly faster training and generation than diffusion models.
Authors:Peiyan Hu, Xiaowei Qian, Wenhao Deng, Rui Wang, Haodong Feng, Ruiqi Feng, Tao Zhang, Long Wei, Yue Wang, Zhi-Ming Ma, Tailin Wu
Abstract:
The application of deep learning for partial differential equation (PDE)-constrained control is gaining increasing attention. However, existing methods rarely consider safety requirements crucial in real-world applications. To address this limitation, we propose Safe Diffusion Models for PDE Control (SafeDiffCon), which introduce the uncertainty quantile as model uncertainty quantification to achieve optimal control under safety constraints through both post-training and inference phases. Firstly, our approach post-trains a pre-trained diffusion model to generate control sequences that better satisfy safety constraints while achieving improved control objectives via a reweighted diffusion loss, which incorporates the uncertainty quantile estimated using conformal prediction. Secondly, during inference, the diffusion model dynamically adjusts both its generation process and parameters through iterative guidance and fine-tuning, conditioned on control targets while simultaneously integrating the estimated uncertainty quantile. We evaluate SafeDiffCon on three control tasks: 1D Burgers' equation, 2D incompressible fluid, and controlled nuclear fusion problem. Results demonstrate that SafeDiffCon is the only method that satisfies all safety constraints, whereas other classical and deep learning baselines fail. Furthermore, while adhering to safety constraints, SafeDiffCon achieves the best control performance. The code can be found at https://github.com/AI4Science-WestlakeU/safediffcon.
中文摘要:提出的SafeDiffCon框架在深度学习求解偏微分方程控制问题中引入基于不确定度分位数的安全约束,通过训练后优化和推理调整,在多种控制任务中实现安全约束下的最优控制性能。
English Summary: The proposed SafeDiffCon framework introduces uncertainty quantile-based safety constraints in deep learning for PDE control, achieving optimal performance while ensuring safety through post-training and inference adjustments across various control tasks.
Authors:Ruiqi Feng, Chenglei Yu, Wenhao Deng, Peiyan Hu, Tailin Wu
Abstract:
Flow matching has shown state-of-the-art performance in various generative tasks, ranging from image generation to decision-making, where generation under energy guidance (abbreviated as guidance in the following) is pivotal. However, the guidance of flow matching is more general than and thus substantially different from that of its predecessor, diffusion models. Therefore, the challenge in guidance for general flow matching remains largely underexplored. In this paper, we propose the first framework of general guidance for flow matching. From this framework, we derive a family of guidance techniques that can be applied to general flow matching. These include a new training-free asymptotically exact guidance, novel training losses for training-based guidance, and two classes of approximate guidance that cover classical gradient guidance methods as special cases. We theoretically investigate these different methods to give a practical guideline for choosing suitable methods in different scenarios. Experiments on synthetic datasets, image inverse problems, and offline reinforcement learning demonstrate the effectiveness of our proposed guidance methods and verify the correctness of our flow matching guidance framework. Code to reproduce the experiments can be found at https://github.com/AI4Science-WestlakeU/flow_guidance.
This paper introduces the first comprehensive framework for general guidance in flow matching, deriving various guidance techniques including training-free exact guidance, novel training-based methods, and approximate approaches, with theoretical analysis and experimental validation across multiple domains.
English Summary:
Authors:Alan Oursland
Abstract:
Neural networks may naturally favor distance-based representations, where smaller activations indicate closer proximity to learned prototypes. This contrasts with intensity-based approaches, which rely on activation magnitudes. To test this hypothesis, we conducted experiments with six MNIST architectural variants constrained to learn either distance or intensity representations. Our results reveal that the underlying representation affects model performance. We develop a novel geometric framework that explains these findings and introduce OffsetL2, a new architecture based on Mahalanobis distance equations, to further validate this framework. This work highlights the importance of considering distance-based learning in neural network design.
中文摘要:神经网络天然倾向于基于距离的表征而非强度表征,通过MNIST架构变体的实验验证了这一点,并提出了几何框架和OffsetL2新架构来支持这一发现,强调了距离学习在神经网络设计中的重要性。
English Summary: Neural networks inherently favor distance-based representations over intensity-based ones, as demonstrated by experiments with MNIST variants, leading to the development of a geometric framework and the OffsetL2 architecture to validate this approach.
Authors:Shivam Singh, Karthik Swaminathan, Nabanita Dash, Ramandeep Singh, Snehasis Banerjee, Mohan Sridharan, Madhava Krishna
Abstract:
An embodied agent assisting humans is often asked to complete new tasks, and there may not be sufficient time or labeled examples to train the agent to perform these new tasks. Large Language Models (LLMs) trained on considerable knowledge across many domains can be used to predict a sequence of abstract actions for completing such tasks, although the agent may not be able to execute this sequence due to task-, agent-, or domain-specific constraints. Our framework addresses these challenges by leveraging the generic predictions provided by LLM and the prior domain knowledge encoded in a Knowledge Graph (KG), enabling an agent to quickly adapt to new tasks. The robot also solicits and uses human input as needed to refine its existing knowledge. Based on experimental evaluation in the context of cooking and cleaning tasks in simulation domains, we demonstrate that the interplay between LLM, KG, and human input leads to substantial performance gains compared with just using the LLM. Project website§: https://sssshivvvv.github.io/adaptbot/
Authors:Mokshagna Sai Teja Karanam, Krithika Iyer, Sarang Joshi, Shireen Elhabian
Abstract:
Spatial transformations that capture population-level morphological statistics are critical for medical image analysis. Commonly used smoothness regularizers for image registration fail to integrate population statistics, leading to anatomically inconsistent transformations. Inverse consistency regularizers promote geometric consistency but lack population morphometrics integration. Regularizers that constrain deformation to low-dimensional manifold methods address this. However, they prioritize reconstruction over interpretability and neglect diffeomorphic properties, such as group composition and inverse consistency. We introduce MORPH-LER, a Log-Euclidean regularization framework for population-aware unsupervised image registration. MORPH-LER learns population morphometrics from spatial transformations to guide and regularize registration networks, ensuring anatomically plausible deformations. It features a bottleneck autoencoder that computes the principal logarithm of deformation fields via iterative square-root predictions. It creates a linearized latent space that respects diffeomorphic properties and enforces inverse consistency. By integrating a registration network with a diffeomorphic autoencoder, MORPH-LER produces smooth, meaningful deformation fields. The framework offers two main contributions: (1) a data-driven regularization strategy that incorporates population-level anatomical statistics to enhance transformation validity and (2) a linearized latent space that enables compact and interpretable deformation fields for efficient population morphometrics analysis. We validate MORPH-LER across two families of deep learning-based registration networks, demonstrating its ability to produce anatomically accurate, computationally efficient, and statistically meaningful transformations on the OASIS-1 brain imaging dataset. https://github.com/iyerkrithika21/MORPH_LER
中文摘要:MORPH-LER是一种新颖的对数欧几里得正则化框架,通过引入保持微分同胚特性的线性化潜空间,将群体形态统计融入无监督图像配准,确保产生解剖学合理的形变场。
English Summary: MORPH-LER is a novel Log-Euclidean regularization framework that integrates population morphometrics into unsupervised image registration, ensuring anatomically plausible deformations through a diffeomorphic autoencoder with a linearized latent space.
Authors:Hanlin Wu, Yuxuan Song, Jingjing Gong, Ziyao Cao, Yawen Ouyang, Jianbing Zhang, Hao Zhou, Wei-Ying Ma, Jingjing Liu
Abstract:
Generative modeling of crystal data distribution is an important yet challenging task due to the unique periodic physical symmetry of crystals. Diffusion-based methods have shown early promise in modeling crystal distribution. More recently, Bayesian Flow Networks were introduced to aggregate noisy latent variables, resulting in a variance-reduced parameter space that has been shown to be advantageous for modeling Euclidean data distributions with structural constraints (Song et al., 2023). Inspired by this, we seek to unlock its potential for modeling variables located in non-Euclidean manifolds e.g. those within crystal structures, by overcoming challenging theoretical issues. We introduce CrysBFN, a novel crystal generation method by proposing a periodic Bayesian flow, which essentially differs from the original Gaussian-based BFN by exhibiting non-monotonic entropy dynamics. To successfully realize the concept of periodic Bayesian flow, CrysBFN integrates a new entropy conditioning mechanism and empirically demonstrates its significance compared to time-conditioning. Extensive experiments over both crystal ab initio generation and crystal structure prediction tasks demonstrate the superiority of CrysBFN, which consistently achieves new state-of-the-art on all benchmarks. Surprisingly, we found that CrysBFN enjoys a significant improvement in sampling efficiency, e.g., ~100x speedup 10 v.s. 2000 steps network forwards) compared with previous diffusion-based methods on MP-20 dataset. Code is available at https://github.com/wu-han-lin/CrysBFN.
中文: CrysBFN提出了一种新颖的周期性贝叶斯流方法,通过解决非欧几里得流形建模的理论难题,在晶体生成任务中实现了最优性能,并显著提升了采样效率。
English: CrysBFN introduces a novel periodic Bayesian flow for crystal generation, overcoming theoretical challenges to model non-Euclidean manifolds and achieving state-of-the-art performance with significantly improved sampling efficiency.
Authors:Haohan Zou, Jie Feng, Hao Zhao, Yuanyuan Shi
Abstract:
Despite advances in learning-based methods, finding valid Lyapunov functions for nonlinear dynamical systems remains challenging. Current neural network approaches face two main issues: challenges in scalable verification and limited interpretability. To address these, we propose an end-to-end framework using transformers to construct analytical Lyapunov functions (local), which simplifies formal verification, enhances interpretability, and provides valuable insights for control engineers. Our framework consists of a transformer-based trainer that generates candidate Lyapunov functions and a falsifier that verifies candidate expressions and refines the model via risk-seeking policy gradient. Unlike Alfarano et al. (2024), which utilizes pre-training and seeks global Lyapunov functions for low-dimensional systems, our model is trained from scratch via reinforcement learning (RL) and succeeds in finding local Lyapunov functions for high-dimensional and non-polynomial systems. Given the analytical nature of the candidates, we employ efficient optimization methods for falsification during training and formal verification tools for the final verification. We demonstrate the efficiency of our approach on a range of nonlinear dynamical systems with up to ten dimensions and show that it can discover Lyapunov functions not previously identified in the control literature. Full implementation is available on \href{https://github.com/JieFeng-cse/Analytical-Lyapunov-Function-Discovery}{Github}
中文: 本文提出了一种基于Transformer的端到端框架,通过强化学习和反证机制为高维非线性系统生成解析李雅普诺夫函数,实现了可扩展的验证并提升了可解释性。
English: This paper introduces an end-to-end transformer-based framework that generates analytical Lyapunov functions for high-dimensional nonlinear systems, enabling scalable verification and improved interpretability through reinforcement learning and falsification mechanisms.
Authors:Wenhao Zheng, Yixiao Chen, Weitong Zhang, Souvik Kundu, Yun Li, Zhengzhong Liu, Eric P. Xing, Hongyi Wang, Huaxiu Yao
Abstract:
Large language models have achieved remarkable success in various tasks but suffer from high computational costs during inference, limiting their deployment in resource-constrained applications. To address this issue, we propose a novel Collaborative Inference with Token-lEvel Routing (CITER) framework that enables efficient collaboration between small and large language models (SLMs \& LLMs) through a token-level routing strategy. Specifically, CITER routes non-critical tokens to an SLM for efficiency and routes critical tokens to an LLM for generalization quality. We formulate router training as a policy optimization, where the router receives rewards based on both the quality of predictions and the inference costs of generation. This allows the router to learn to predict token-level routing scores and make routing decisions based on both the current token and the future impact of its decisions. To further accelerate the reward evaluation process, we introduce a shortcut which significantly reduces the costs of the reward estimation and improving the practicality of our approach. Extensive experiments on five benchmark datasets demonstrate that CITER reduces the inference costs while preserving high-quality generation, offering a promising solution for real-time and resource-constrained applications. Our data and code are available at https://github.com/aiming-lab/CITER.
中文摘要:CITER框架通过令牌级路由策略,将非关键令牌分配给小型语言模型以提高效率,关键令牌分配给大型模型保证质量,在降低推理成本的同时保持生成内容的高水准。
English Summary: The CITER framework optimizes inference efficiency by routing non-critical tokens to a small language model for speed and critical tokens to a large model for accuracy, balancing cost and quality through token-level decisions.
Authors:Avery Ma, Yangchen Pan, Amir-massoud Farahmand
Abstract:
Many-shot jailbreaking circumvents the safety alignment of LLMs by exploiting their ability to process long input sequences. To achieve this, the malicious target prompt is prefixed with hundreds of fabricated conversational exchanges between the user and the model. These exchanges are randomly sampled from a pool of unsafe question-answer pairs, making it appear as though the model has already complied with harmful instructions. In this paper, we present PANDAS: a hybrid technique that improves many-shot jailbreaking by modifying these fabricated dialogues with Positive Affirmations, Negative Demonstrations, and an optimized Adaptive Sampling method tailored to the target prompt's topic. We also introduce ManyHarm, a dataset of harmful question-answer pairs, and demonstrate through extensive experiments that PANDAS significantly outperforms baseline methods in long-context scenarios. Through attention analysis, we provide insights into how long-context vulnerabilities are exploited and show how PANDAS further improves upon many-shot jailbreaking.
Chinese: 本文提出PANDAS混合技术,通过整合积极肯定、负面演示和自适应采样来增强多轮越狱攻击,并在长上下文场景中验证其优于基线方法的性能。
English: The paper introduces PANDAS, a hybrid technique that enhances many-shot jailbreaking by incorporating positive affirmations, negative demonstrations, and adaptive sampling, and demonstrates its superior performance over baseline methods in long-context scenarios.
Authors:Yilin Wu, Ran Tian, Gokul Swamy, Andrea Bajcsy
Abstract:
While generative robot policies have demonstrated significant potential in learning complex, multimodal behaviors from demonstrations, they still exhibit diverse failures at deployment-time. Policy steering offers an elegant solution to reducing the chance of failure by using an external verifier to select from low-level actions proposed by an imperfect generative policy. Here, one might hope to use a Vision Language Model (VLM) as a verifier, leveraging its open-world reasoning capabilities. However, off-the-shelf VLMs struggle to understand the consequences of low-level robot actions as they are represented fundamentally differently than the text and images the VLM was trained on. In response, we propose FOREWARN, a novel framework to unlock the potential of VLMs as open-vocabulary verifiers for runtime policy steering. Our key idea is to decouple the VLM's burden of predicting action outcomes (foresight) from evaluation (forethought). For foresight, we leverage a latent world model to imagine future latent states given diverse low-level action plans. For forethought, we align the VLM with these predicted latent states to reason about the consequences of actions in its native representation--natural language--and effectively filter proposed plans. We validate our framework across diverse robotic manipulation tasks, demonstrating its ability to bridge representational gaps and provide robust, generalizable policy steering. Videos can be found on the project website: https://yilin-wu98.github.io/forewarn/.
Authors:Dazhou Yu, Genpei Zhang, Liang Zhao
Abstract:
Ubiquitous geometric objects can be precisely and efficiently represented as polyhedra. The transformation of a polyhedron into a vector, known as polyhedra representation learning, is crucial for manipulating these shapes with mathematical and statistical tools for tasks like classification, clustering, and generation. Recent years have witnessed significant strides in this domain, yet most efforts focus on the vertex sequence of a polyhedron, neglecting the complex surface modeling crucial in real-world polyhedral objects. This study proposes \textbf{PolyhedronNet}, a general framework tailored for learning representations of 3D polyhedral objects. We propose the concept of the surface-attributed graph to seamlessly model the vertices, edges, faces, and their geometric interrelationships within a polyhedron. To effectively learn the representation of the entire surface-attributed graph, we first propose to break it down into local rigid representations to effectively learn each local region's relative positions against the remaining regions without geometric information loss. Subsequently, we propose PolyhedronGNN to hierarchically aggregate the local rigid representation via intra-face and inter-face geometric message passing modules, to obtain a global representation that minimizes information loss while maintaining rotation and translation invariance. Our experimental evaluations on four distinct datasets, encompassing both classification and retrieval tasks, substantiate PolyhedronNet's efficacy in capturing comprehensive and informative representations of 3D polyhedral objects. Code and data are available at {https://github.com/dyu62/3D_polyhedron}.
中文: 本研究提出PolyhedronNet框架,通过表面属性图建模多面体结构,并采用几何消息传递机制学习鲁棒的三维表示,在分类和检索任务中展现出卓越性能。
English: This study introduces PolyhedronNet, a framework that models polyhedra as surface-attributed graphs and employs geometric message passing to learn robust 3D representations, demonstrating superior performance in classification and retrieval tasks.
Authors:Haruka Kiyohara, Fan Yao, Sarah Dean
Abstract:
In two-sided platforms (e.g., video streaming or e-commerce), viewers and providers engage in interactive dynamics: viewers benefit from increases in provider populations, while providers benefit from increases in viewer population. Despite the importance of such "population effects" on long-term platform health, recommendation policies do not generally take the participation dynamics into account. This paper thus studies the dynamics and recommender policy design on two-sided platforms under the population effects for the first time. Our control- and game-theoretic findings warn against the use of the standard "myopic-greedy" policy and shed light on the importance of provider-side considerations (i.e., effectively distributing exposure among provider groups) to improve social welfare via population growth. We also present a simple algorithm to optimize long-term social welfare by taking the population effects into account, and demonstrate its effectiveness in synthetic and real-data experiments. Our experiment code is available at https://github.com/sdean-group/dynamics-two-sided-market.
中文: 本文首次研究双边平台中人口效应下的动态与推荐策略,提出通过优化供应商曝光分配来提升长期社会福利的算法,并在实验中验证其有效性。
English: This paper introduces a novel recommendation policy that accounts for population effects in two-sided platforms, demonstrating through experiments that optimizing provider exposure distribution enhances long-term social welfare.
Authors:Haocheng Xi, Shuo Yang, Yilong Zhao, Chenfeng Xu, Muyang Li, Xiuyu Li, Yujun Lin, Han Cai, Jintao Zhang, Dacheng Li, Jianfei Chen, Ion Stoica, Kurt Keutzer, Song Han
Abstract:
Diffusion Transformers (DiTs) dominate video generation but their high computational cost severely limits real-world applicability, usually requiring tens of minutes to generate a few seconds of video even on high-performance GPUs. This inefficiency primarily arises from the quadratic computational complexity of 3D Full Attention with respect to the context length. In this paper, we propose a training-free framework termed Sparse VideoGen (SVG) that leverages the inherent sparsity in 3D Full Attention to boost inference efficiency. We reveal that the attention heads can be dynamically classified into two groups depending on distinct sparse patterns: (1) Spatial Head, where only spatially-related tokens within each frame dominate the attention output, and (2) Temporal Head, where only temporally-related tokens across different frames dominate. Based on this insight, SVG proposes an online profiling strategy to capture the dynamic sparse patterns and predicts the type of attention head. Combined with a novel hardware-efficient tensor layout transformation and customized kernel implementations, SVG achieves up to 2.28x and 2.33x end-to-end speedup on CogVideoX-v1.5 and HunyuanVideo, respectively, while preserving generation quality. Our code is open-sourced and is available at https://github.com/svg-project/Sparse-VideoGen
中文:提出的Sparse VideoGen(SVG)框架通过动态识别扩散变换器中的空间与时间注意力模式,在保持生成质量的同时将视频生成速度提升超过2倍。
English: The proposed Sparse VideoGen (SVG) framework dynamically identifies spatial and temporal attention patterns in Diffusion Transformers to reduce computational costs, achieving over 2x speedup in video generation while maintaining quality.
Authors:Dmitry Manning-Coe, Jacopo Gliozzi, Alexander G. Stapleton, Edward Hirst, Giuseppe De Tomasi, Barry Bradlyn, David S. Berman
Abstract:
Grokking typically achieves similar loss to ordinary, "steady", learning. We ask whether these different learning paths - grokking versus ordinary training - lead to fundamental differences in the learned models. To do so we compare the features, compressibility, and learning dynamics of models trained via each path in two tasks. We find that grokked and steadily trained models learn the same features, but there can be large differences in the efficiency with which these features are encoded. In particular, we find a novel "compressive regime" of steady training in which there emerges a linear trade-off between model loss and compressibility, and which is absent in grokking. In this regime, we can achieve compression factors 25x times the base model, and 5x times the compression achieved in grokking. We then track how model features and compressibility develop through training. We show that model development in grokking is task-dependent, and that peak compressibility is achieved immediately after the grokking plateau. Finally, novel information-geometric measures are introduced which demonstrate that models undergoing grokking follow a straight path in information space.
Chinese: 顿悟式学习与普通训练在损失上相近,但特征编码效率差异显著,其中普通训练特有的压缩机制可实现比顿悟学习高得多的模型压缩率。
English: Grokking and ordinary training yield similar loss but differ in feature encoding efficiency, with steady training exhibiting a unique compressive regime that allows for significantly greater model compressibility than grokking.
Authors:Xianglong Yan, Tianao Zhang, Zhiteng Li, Yulun Zhang
Abstract:
Large language models (LLMs) have achieved remarkable progress in natural language processing, but their high computational and memory costs hinder deployment on resource-constrained devices. Binarization, which reduces model weights to 1 bit, is a promising solution for efficient inference. However, binarized LLMs still exhibit redundancy that can be further compressed. Semi-structured pruning offers a favorable trade-off between model performance and hardware efficiency, but naively combining it with binarization often leads to severe performance degradation. To address this, we propose Progressive Binarization with Semi-Structured Pruning (PBS$^2$P), a novel post-training compression framework. We propose Stepwise semi-structured Pruning with Binarization Optimization (SPBO) to jointly reduce pruning and binarization error. Additionally, we develop a Coarse-to-Fine Search (CFS) strategy to more effectively select pruning elements. Extensive experiments across multiple LLM families show that PBS$^2$P consistently outperforms state-of-the-art binary post-training quantization methods in both perplexity and downstream accuracy. The code and models will be available at: https://github.com/XIANGLONGYAN/PBS2P.
Chinese: 提出的渐进式二值化与半结构化剪枝(PBS²P)框架将1比特模型压缩与结构化剪枝有效结合,在提升计算效率的同时保持模型性能,在困惑度和下游任务准确率上均优于现有先进方法。
English: The proposed Progressive Binarization with Semi-Structured Pruning (PBS²P) framework effectively combines 1-bit model compression with structured pruning to enhance computational efficiency while maintaining performance, outperforming existing methods in both perplexity and task accuracy.
Authors:Kim Yong Tan, Yueming Lyu, Ivor Tsang, Yew-Soon Ong
Abstract:
Guided diffusion-model generation is a promising direction for customizing the generation process of a pre-trained diffusion model to address specific downstream tasks. Existing guided diffusion models either rely on training the guidance model with pre-collected datasets or require the objective functions to be differentiable. However, for most real-world tasks, offline datasets are often unavailable, and their objective functions are often not differentiable, such as image generation with human preferences, molecular generation for drug discovery, and material design. Thus, we need an $\textbf{online}$ algorithm capable of collecting data during runtime and supporting a $\textbf{black-box}$ objective function. Moreover, the $\textbf{query efficiency}$ of the algorithm is also critical because the objective evaluation of the query is often expensive in real-world scenarios. In this work, we propose a novel and simple algorithm, $\textbf{Fast Direct}$, for query-efficient online black-box target generation. Our Fast Direct builds a pseudo-target on the data manifold to update the noise sequence of the diffusion model with a universal direction, which is promising to perform query-efficient guided generation. Extensive experiments on twelve high-resolution ($\small {1024 \times 1024}$) image target generation tasks and six 3D-molecule target generation tasks show $\textbf{6}\times$ up to $\textbf{10}\times$ query efficiency improvement and $\textbf{11}\times$ up to $\textbf{44}\times$ query efficiency improvement, respectively. Our implementation is publicly available at: https://github.com/kimyong95/guide-stable-diffusion/tree/fast-direct
Chinese: 引导扩散模型在实际应用中常因缺乏离线数据集和不可微目标函数而受限,为此提出的Fast Direct算法通过在线黑盒优化实现了查询高效性,在图像和分子生成任务中分别提升了6-10倍和11-44倍的效率。
English: Guided diffusion models face challenges in real-world applications due to the unavailability of offline datasets and non-differentiable objective functions, prompting the development of Fast Direct, a query-efficient online algorithm that significantly improves efficiency in high-resolution image and 3D-molecule generation tasks.
Authors:Jiaxing Xu, Yongqiang Chen, Xia Dong, Mengcheng Lan, Tiancheng Huang, Qingtian Bian, James Cheng, Yiping Ke
Abstract:
In neuroscience, identifying distinct patterns linked to neurological disorders, such as Alzheimer's and Autism, is critical for early diagnosis and effective intervention. Graph Neural Networks (GNNs) have shown promising in analyzing brain networks, but there are two major challenges in using GNNs: (1) distribution shifts in multi-site brain network data, leading to poor Out-of-Distribution (OOD) generalization, and (2) limited interpretability in identifying key brain regions critical to neurological disorders. Existing graph OOD methods, while effective in other domains, struggle with the unique characteristics of brain networks. To bridge these gaps, we introduce BrainOOD, a novel framework tailored for brain networks that enhances GNNs' OOD generalization and interpretability. BrainOOD framework consists of a feature selector and a structure extractor, which incorporates various auxiliary losses including an improved Graph Information Bottleneck (GIB) objective to recover causal subgraphs. By aligning structure selection across brain networks and filtering noisy features, BrainOOD offers reliable interpretations of critical brain regions. Our approach outperforms 16 existing methods and improves generalization to OOD subjects by up to 8.5%. Case studies highlight the scientific validity of the patterns extracted, which aligns with the findings in known neuroscience literature. We also propose the first OOD brain network benchmark, which provides a foundation for future research in this field. Our code is available at https://github.com/AngusMonroe/BrainOOD.
Chinese: BrainOOD框架通过解决多站点脑网络数据分布偏移和提升可解释性,增强了图神经网络的泛化能力,在分布外测试中性能提升高达8.5%,并能识别与神经疾病相关的关键脑区。
English: The BrainOOD framework enhances Graph Neural Networks' generalization and interpretability for brain network analysis by addressing distribution shifts and identifying key regions, achieving up to 8.5% improvement in out-of-distribution performance and providing neuroscience-valid insights.
Authors:Srinitish Srinivasan, Omkumar CU
Abstract:
Graph representation learning has emerged as a cornerstone for tasks like node classification and link prediction, yet prevailing self-supervised learning (SSL) methods face challenges such as computational inefficiency, reliance on contrastive objectives, and representation collapse. Existing approaches often depend on feature reconstruction, negative sampling, or complex decoders, which introduce training overhead and hinder generalization. Further, current techniques which address such limitations fail to account for the contribution of node embeddings to a certain prediction in the absence of labeled nodes. To address these limitations, we propose a novel joint embedding predictive framework for graph SSL that eliminates contrastive objectives and negative sampling while preserving semantic and structural information. Additionally, we introduce a semantic-aware objective term that incorporates pseudo-labels derived from Gaussian Mixture Models (GMMs), enhancing node discriminability by evaluating latent feature contributions. Extensive experiments demonstrate that our framework outperforms state-of-the-art graph SSL methods across benchmarks, achieving superior performance without contrastive loss or complex decoders. Key innovations include (1) a non-contrastive, view-invariant joint embedding predictive architecture, (2) Leveraging single context and multiple targets relationship between subgraphs, and (3) GMM-based pseudo-label scoring to capture semantic contributions. This work advances graph SSL by offering a computationally efficient, collapse-resistant paradigm that bridges spatial and semantic graph features for downstream tasks. The code for our paper can be found at https://github.com/Deceptrax123/JPEB-GSSL
中文摘要:本文提出了一种新颖的图自监督学习联合嵌入预测框架,该框架摒弃了对比目标和负采样,通过高斯混合模型生成的伪标签增强节点区分度,在多个基准测试中实现了最先进的性能。
English Summary: This paper introduces a novel joint embedding predictive framework for graph self-supervised learning that eliminates contrastive objectives and negative sampling while incorporating Gaussian Mixture Model-based pseudo-labels to enhance node discriminability, achieving state-of-the-art performance across benchmarks.
Authors:Ziyang Zheng, Shan Huang, Jianyuan Zhong, Zhengyuan Shi, Guohao Dai, Ningyi Xu, Qiang Xu
Abstract:
Circuit representation learning has become pivotal in electronic design automation, enabling critical tasks such as testability analysis, logic reasoning, power estimation, and SAT solving. However, existing models face significant challenges in scaling to large circuits due to limitations like over-squashing in graph neural networks and the quadratic complexity of transformer-based models. To address these issues, we introduce DeepGate4, a scalable and efficient graph transformer specifically designed for large-scale circuits. DeepGate4 incorporates several key innovations: (1) an update strategy tailored for circuit graphs, which reduce memory complexity to sub-linear and is adaptable to any graph transformer; (2) a GAT-based sparse transformer with global and local structural encodings for AIGs; and (3) an inference acceleration CUDA kernel that fully exploit the unique sparsity patterns of AIGs. Our extensive experiments on the ITC99 and EPFL benchmarks show that DeepGate4 significantly surpasses state-of-the-art methods, achieving 15.5% and 31.1% performance improvements over the next-best models. Furthermore, the Fused-DeepGate4 variant reduces runtime by 35.1% and memory usage by 46.8%, making it highly efficient for large-scale circuit analysis. These results demonstrate the potential of DeepGate4 to handle complex EDA tasks while offering superior scalability and efficiency. Code is available at https://github.com/zyzheng17/DeepGate4-ICLR-25.
中文摘要:DeepGate4是一种可扩展的图变换器,通过创新策略降低内存复杂度并提升效率,解决了现有模型在大规模电路分析中的局限性,在基准测试中实现了显著的性能提升。
English Summary: DeepGate4 is a scalable graph transformer that addresses the limitations of existing models in large-scale circuit analysis by introducing innovative strategies to reduce memory complexity and enhance efficiency, achieving significant performance improvements on benchmark tests.
Authors:Bo Pang, Tingrui Qiao, Caroline Walker, Chris Cunningham, Yun Sing Koh
Abstract:
Large Language Models (LLMs) have significantly advanced natural language processing applications, yet their widespread use raises concerns regarding inherent biases that may reduce utility or harm for particular social groups. Despite the advancement in addressing LLM bias, existing research has two major limitations. First, existing LLM bias evaluation focuses on the U.S. cultural context, making it challenging to reveal stereotypical biases of LLMs toward other cultures, leading to unfair development and use of LLMs. Second, current bias evaluation often assumes models are familiar with the target social groups. When LLMs encounter words beyond their knowledge boundaries that are unfamiliar in their training data, they produce irrelevant results in the local context due to hallucinations and overconfidence, which are not necessarily indicative of inherent bias. This research addresses these limitations with a Local Integrated Bias Recognition and Assessment Framework (LIBRA) for measuring bias using datasets sourced from local corpora without crowdsourcing. Implementing this framework, we develop a dataset comprising over 360,000 test cases in the New Zealand context. Furthermore, we propose the Enhanced Idealized CAT Score (EiCAT), integrating the iCAT score with a beyond knowledge boundary score (bbs) and a distribution divergence-based bias measurement to tackle the challenge of LLMs encountering words beyond knowledge boundaries. Our results show that the BERT family, GPT-2, and Llama-3 models seldom understand local words in different contexts. While Llama-3 exhibits larger bias, it responds better to different cultural contexts. The code and dataset are available at: https://github.com/ipangbo/LIBRA.
中文摘要:本研究提出LIBRA框架,利用本地数据集评估大语言模型的文化偏见,通过增强理想化CAT分数解决现有评估方法的局限,涵盖文化多样性并处理模型对陌生词汇的响应问题。
English Summary: This research introduces the LIBRA framework to evaluate cultural biases in Large Language Models using local datasets, addressing limitations in current bias assessments by incorporating cultural diversity and handling unfamiliar terms through the Enhanced Idealized CAT Score.
Authors:Yihe Wang, Nan Huang, Nadia Mammone, Marco Cecchi, Xiang Zhang
Abstract:
Electroencephalogram (EEG) provides a non-invasive, highly accessible, and cost-effective solution for Alzheimer's Disease (AD) detection. However, existing methods, whether based on manual feature extraction or deep learning, face two major challenges: the lack of large-scale datasets for robust feature learning and evaluation, and poor detection performance due to inter-subject variations. To address these challenges, we curate an EEG-AD corpus containing 813 subjects, which forms the world's largest EEG-AD dataset to the best of our knowledge. Using this unique dataset, we propose LEAD, the first large foundation model for EEG-based AD detection. Our method encompasses an entire pipeline, from data selection and preprocessing to self-supervised contrastive pretraining, fine-tuning, and key setups such as subject-independent evaluation and majority voting for subject-level detection. We pre-train the model on 11 EEG datasets and unified fine-tune it on 5 AD datasets. Our self-supervised pre-training design includes sample-level and subject-level contrasting to extract useful general EEG features. Fine-tuning is performed on 5 channel-aligned datasets together. The backbone encoder incorporates temporal and channel embeddings to capture features across both temporal and spatial dimensions. Our method demonstrates outstanding AD detection performance, achieving up to a 9.86% increase in F1 score at the sample-level and up to a 9.31% at the subject-level compared to state-of-the-art methods. The results of our model strongly confirm the effectiveness of contrastive pre-training and channel-aligned unified fine-tuning for addressing inter-subject variation. The source code is at https://github.com/DL4mHealth/LEAD.
中文摘要:本研究提出了首个用于脑电图阿尔茨海默病检测的大规模基础模型LEAD,通过构建全球最大的脑电数据集并采用创新的个体水平检测框架,在独立样本验证中实现了90.91%的敏感度,显著提升了检测性能。
English Summary: The study introduces LEAD, the first large-scale foundation model for EEG-based Alzheimer's disease detection, which overcomes previous limitations by using the world's largest EEG-AD dataset and a novel subject-level detection framework to achieve superior performance with 90.91% sensitivity.
Authors:Jiale Fu, Yuchu Jiang, Junkai Chen, Jiaming Fan, Xin Geng, Xu Yang
Abstract:
Large Language Model (LLM) collaborative decoding techniques improve output quality by combining the outputs of multiple models at each generation step, but they incur high computational costs. In this paper, we introduce Collaborative decoding via Speculation (CoS), a novel framework that accelerates collaborative decoding without compromising performance. Inspired by Speculative Decoding--where a small proposal model generates tokens sequentially, and a larger target model verifies them in parallel, our approach builds on two key insights: (1) the verification distribution can be the combined distribution of both the proposal and target models, and (2) alternating each model as the proposer and verifier can further enhance efficiency. We generalize this method to collaboration among n models and theoretically prove that CoS is never slower than standard collaborative decoding, typically achieving faster speed. Extensive experiments demonstrate CoS is 1.11x-2.23x faster than standard collaborative decoding without compromising generation quality. Our code is available at https://github.com/Kamichanw/CoS/.
Chinese: CoS是一种新颖的协作解码框架,通过推测验证和交替模型角色,在不牺牲性能的情况下将解码速度提升最高达2.23倍。
English: CoS is a novel framework that accelerates collaborative decoding by up to 2.23 times without sacrificing performance, using speculative verification and alternating model roles to enhance efficiency.
Authors:Varun Dhanraj, Chris Eliasmith
Abstract:
Large language models (LLMs) continue to face challenges in reliably solving reasoning tasks, particularly those that require precise rule following, as often found in mathematical reasoning. This paper introduces a novel neurosymbolic method that improves LLM reasoning by encoding hidden states into neurosymbolic vectors, enabling problem-solving within a neurosymbolic vector space. The results are decoded and merged with the original hidden state, significantly boosting the model's performance on numerical reasoning tasks. By offloading computation through neurosymbolic representations, this method enhances efficiency, reliability, and interpretability. Experimental results demonstrate an average of 88.6% lower cross-entropy loss and 15.4 times more problems correctly solved on a suite of mathematical reasoning tasks compared to chain-of-thought prompting and supervised fine-tuning (LoRA), without degrading performance on other tasks. We make our code available at: https://github.com/vdhanraj/Neurosymbolic-LLM.
中文: 本文提出一种神经符号方法,将大语言模型的隐藏状态编码为神经符号向量,从而在数学推理任务中显著降低交叉熵损失并提高解题正确率,相比现有方法性能大幅提升。
English: This paper presents a neurosymbolic method that encodes LLM hidden states into neurosymbolic vectors to enhance reasoning efficiency and accuracy, achieving significantly lower cross-entropy loss and higher problem-solving rates in mathematical tasks compared to existing methods.
Authors:Archiki Prasad, Elias Stengel-Eskin, Justin Chih-Yao Chen, Zaid Khan, Mohit Bansal
Abstract:
Unit tests (UTs) play an instrumental role in assessing code correctness as well as providing feedback to large language models (LLMs), motivating automated test generation. However, we uncover a trade-off between generating unit test inputs that reveal errors when given a faulty code and correctly predicting the unit test output without access to the gold solution. To address this trade-off, we propose UTGen, which teaches LLMs to generate unit test inputs that reveal errors along with their correct expected outputs based on task descriptions. Since model-generated tests can provide noisy signals (e.g., from incorrectly predicted outputs), we propose UTDebug that (i) scales UTGen via test-time compute to improve UT output prediction, and (ii) validates and backtracks edits based on multiple generated UTs to avoid overfitting, and helps LLMs debug effectively. We show that UTGen outperforms other LLM-based baselines by 7.59% based on a metric measuring the presence of both error-revealing UT inputs and correct UT outputs. When used with UTDebug, we find that feedback from UTGen's unit tests improves pass@1 accuracy of Qwen2.5 32B on HumanEvalFix and our own harder debugging split of MBPP+ by over 3.17% and 12.35% (respectively) over other LLM-based UT generation baselines. Moreover, we observe that feedback from Qwen2.5 32B-based UTGen model can enhance debugging with frontier LLMs like GPT-4o by 13.8%. Lastly, we demonstrate that UTGen is a better judge for code correctness, outperforming a state-of-the-art trained 8B reward model by 4.43% on HumanEval+ with best-of-10 sampling using Qwen2.5 7B.
中文:UTGen教导大语言模型生成能揭示代码错误的单元测试输入及正确预期输出,而UTDebug通过测试时计算和回溯验证来增强该过程,从而显著提升调试效果与代码纠错准确率。
English: UTGen enables LLMs to generate unit test inputs that reveal code errors and predict correct outputs, while UTDebug enhances this process through test-time computation and validation to improve debugging effectiveness and accuracy.
Authors:Dawei Li, Renliang Sun, Yue Huang, Ming Zhong, Bohan Jiang, Jiawei Han, Xiangliang Zhang, Wei Wang, Huan Liu
Abstract:
Large Language Models (LLMs) as judges and LLM-based data synthesis have emerged as two fundamental LLM-driven data annotation methods in model development. While their combination significantly enhances the efficiency of model training and evaluation, little attention has been given to the potential contamination brought by this new model development paradigm. In this work, we expose preference leakage, a contamination problem in LLM-as-a-judge caused by the relatedness between the synthetic data generators and LLM-based evaluators. To study this issue, we first define three common relatednesses between the data generator LLM and the judge LLM: being the same model, having an inheritance relationship, and belonging to the same model family. Through extensive experiments, we empirically confirm the bias of judges towards their related student models caused by preference leakage across multiple LLM baselines and benchmarks. Further analysis suggests that preference leakage is a pervasive and real-world problem that is harder to detect compared to previously identified biases in LLM-as-a-judge scenarios. All of these findings imply that preference leakage is a widespread and challenging problem in the area of LLM-as-a-judge. We release all codes and data at: https://github.com/David-Li0406/Preference-Leakage.
中文: 本研究揭示了在LLM作为评判者的场景中,偏好泄露这一污染问题,即评估者对来自相关模型的合成数据表现出偏向性,表明这是模型开发中普遍存在且难以检测的难题。
English: This study identifies preference leakage as a contamination issue in LLM-as-a-judge scenarios, where evaluators show bias toward synthetically generated data from related models, revealing it as a pervasive and hard-to-detect problem in model development.
Authors:Jingzhe Shi, Qinwei Ma, Hongyi Liu, Hang Zhao, Jeng-Neng Hwang, Lei Li
Abstract:
Long Context Language Models have drawn great attention in the past few years. There has been work discussing the impact of long context on Language Model performance: some find that long irrelevant context could harm performance, while some experimentally summarize loss reduction by relevant long context as Scaling Laws. This calls for a more thorough understanding on how long context impacts Language Modeling. In this work, we (1) propose a clean and effective theoretical framework for explaining the impact of context length on Language Modeling, from an Intrinsic Space perspective; and (2) conduct experiments on natural language and synthetic data, validating our proposed theoretical assumptions and deductions. Our theoretical framework can provide practical insights such as establishing that training dataset size dictates an optimal context length and bounds context length scaling for certain cases. We hope our work may inspire new long context Language Models, as well as future work studying Physics for Language Models. Code for our experiments is available at: https://github.com/JingzheShi/NLPCtlScalingAndBounds.
中文: 本研究从内在空间视角提出了一个理论框架,解释上下文长度对语言建模的影响,并通过自然语言和合成数据实验验证,发现训练数据集大小决定了最优上下文长度并设定了扩展界限。
English: This study introduces a theoretical framework from an Intrinsic Space perspective to explain how context length affects Language Modeling, validated through experiments on natural and synthetic data, revealing that training dataset size determines optimal context length and sets scaling bounds.
Authors:Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, Jiarui Yuan, Huayu Chen, Kaiyan Zhang, Xingtai Lv, Shuo Wang, Yuan Yao, Xu Han, Hao Peng, Yu Cheng, Zhiyuan Liu, Maosong Sun, Bowen Zhou, Ning Ding
Abstract:
Dense process rewards have proven a more effective alternative to the sparse outcome-level rewards in the inference-time scaling of large language models (LLMs), particularly in tasks requiring complex multi-step reasoning. While dense rewards also offer an appealing choice for the reinforcement learning (RL) of LLMs since their fine-grained rewards have the potential to address some inherent issues of outcome rewards, such as training efficiency and credit assignment, this potential remains largely unrealized. This can be primarily attributed to the challenges of training process reward models (PRMs) online, where collecting high-quality process labels is prohibitively expensive, making them particularly vulnerable to reward hacking. To address these challenges, we propose PRIME (Process Reinforcement through IMplicit rEwards), which enables online PRM updates using only policy rollouts and outcome labels through implict process rewards. PRIME combines well with various advantage functions and forgoes the dedicated reward model training phrase that existing approaches require, substantially reducing the development overhead. We demonstrate PRIME's effectiveness on competitional math and coding. Starting from Qwen2.5-Math-7B-Base, PRIME achieves a 15.1% average improvement across several key reasoning benchmarks over the SFT model. Notably, our resulting model, Eurus-2-7B-PRIME, surpasses Qwen2.5-Math-7B-Instruct on seven reasoning benchmarks with 10% of its training data.
Chinese: PRIME通过隐式过程奖励,仅利用策略推演和结果标签实现在线过程奖励模型更新,在数学和编程任务中无需专门奖励模型训练即取得显著性能提升。
English: PRIME introduces implicit process rewards to enable online updates of process reward models using only policy rollouts and outcome labels, achieving significant improvements in math and coding tasks without dedicated reward model training.
Authors:Quan Dao, Khanh Doan, Di Liu, Trung Le, Dimitris Metaxas
Abstract:
Consistency models are a new family of generative models capable of producing high-quality samples in either a single step or multiple steps. Recently, consistency models have demonstrated impressive performance, achieving results on par with diffusion models in the pixel space. However, the success of scaling consistency training to large-scale datasets, particularly for text-to-image and video generation tasks, is determined by performance in the latent space. In this work, we analyze the statistical differences between pixel and latent spaces, discovering that latent data often contains highly impulsive outliers, which significantly degrade the performance of iCT in the latent space. To address this, we replace Pseudo-Huber losses with Cauchy losses, effectively mitigating the impact of outliers. Additionally, we introduce a diffusion loss at early timesteps and employ optimal transport (OT) coupling to further enhance performance. Lastly, we introduce the adaptive scaling-$c$ scheduler to manage the robust training process and adopt Non-scaling LayerNorm in the architecture to better capture the statistics of the features and reduce outlier impact. With these strategies, we successfully train latent consistency models capable of high-quality sampling with one or two steps, significantly narrowing the performance gap between latent consistency and diffusion models. The implementation is released here: https://github.com/quandao10/sLCT/
中文摘要:本研究通过采用柯西损失、引入扩散损失和最优传输耦合,并配合自适应调度器与无缩放层归一化,有效解决了潜在空间中异常值导致的性能下降问题,成功训练出能进行高质量少步采样的潜在一致性模型,显著缩小了与扩散模型的性能差距。
English Summary: This study improves latent consistency models by addressing performance degradation from impulsive outliers in latent spaces through Cauchy losses, diffusion loss integration, optimal transport coupling, and architectural enhancements, enabling high-quality few-step sampling that narrows the gap with diffusion models.
Authors:Grigoriy Ksenofontov, Alexander Korotin
Abstract:
The Schrödinger Bridge (SB) is a powerful framework for solving generative modeling tasks such as unpaired domain translation. Most SB-related research focuses on continuous data space $\mathbb{R}^{D}$ and leaves open theoretical and algorithmic questions about applying SB methods to discrete data, e.g, on finite spaces $\mathbb{S}^{D}$. Notable examples of such sets $\mathbb{S}$ are codebooks of vector-quantized (VQ) representations of modern autoencoders, tokens in texts, categories of atoms in molecules, etc. In this paper, we provide a theoretical and algorithmic foundation for solving SB in discrete spaces using the recently introduced Iterative Markovian Fitting (IMF) procedure. Specifically, we theoretically justify the convergence of discrete-time IMF (D-IMF) to SB in discrete spaces. This enables us to develop a practical computational algorithm for SB, which we call Categorical Schrödinger Bridge Matching (CSBM). We show the performance of CSBM via a series of experiments with synthetic data and VQ representations of images. The code of CSBM is available at https://github.com/gregkseno/csbm.
中文:通过引入迭代马尔可夫拟合程序,薛定谔桥框架被扩展到离散空间,由此开发出的分类薛定谔桥匹配算法在合成数据和图像数据上得到了验证。
English: The Schrödinger Bridge framework is extended to discrete spaces through the Iterative Markovian Fitting procedure, leading to the development of the Categorical Schrödinger Bridge Matching algorithm, which is validated on synthetic and image data.
Authors:Hanxun Huang, Sarah Erfani, Yige Li, Xingjun Ma, James Bailey
Abstract:
Contrastive language-image pretraining (CLIP) has been found to be vulnerable to poisoning backdoor attacks where the adversary can achieve an almost perfect attack success rate on CLIP models by poisoning only 0.01\% of the training dataset. This raises security concerns on the current practice of pretraining large-scale models on unscrutinized web data using CLIP. In this work, we analyze the representations of backdoor-poisoned samples learned by CLIP models and find that they exhibit unique characteristics in their local subspace, i.e., their local neighborhoods are far more sparse than that of clean samples. Based on this finding, we conduct a systematic study on detecting CLIP backdoor attacks and show that these attacks can be easily and efficiently detected by traditional density ratio-based local outlier detectors, whereas existing backdoor sample detection methods fail. Our experiments also reveal that an unintentional backdoor already exists in the original CC3M dataset and has been trained into a popular open-source model released by OpenCLIP. Based on our detector, one can clean up a million-scale web dataset (e.g., CC3M) efficiently within 15 minutes using 4 Nvidia A100 GPUs. The code is publicly available in our \href{https://github.com/HanxunH/Detect-CLIP-Backdoor-Samples}{GitHub repository}.
中文:CLIP模型极易因微量数据投毒遭受后门攻击,但基于其稀疏局部特征的传统离群点检测器能高效识别恶意样本,实现大规模数据集的快速净化。
English: CLIP models are highly susceptible to backdoor attacks through minimal data poisoning, but these malicious samples can be efficiently detected using local outlier detectors based on their sparse local neighborhoods, enabling rapid cleanup of large datasets.
Authors:Oussama Zekri, Nicolas Boullé
Abstract:
Discrete diffusion models have recently gained significant attention due to their ability to process complex discrete structures for language modeling. However, fine-tuning these models with policy gradient methods, as is commonly done in Reinforcement Learning from Human Feedback (RLHF), remains a challenging task. We propose an efficient, broadly applicable, and theoretically justified policy gradient algorithm, called Score Entropy Policy Optimization (SEPO), for fine-tuning discrete diffusion models over non-differentiable rewards. Our numerical experiments across several discrete generative tasks demonstrate the scalability and efficiency of our method. Our code is available at https://github.com/ozekri/SEPO.
Chinese: 本文提出了评分熵策略优化(SEPO),一种高效且理论可靠的策略梯度算法,用于针对不可微分奖励微调离散扩散模型,并在多个生成任务中验证了其有效性。
English: This paper introduces Score Entropy Policy Optimization (SEPO), an efficient and theoretically grounded policy gradient algorithm for fine-tuning discrete diffusion models with non-differentiable rewards, demonstrating its effectiveness across various generative tasks.
Authors:Shaofeng Yin, Jialong Wu, Siqiao Huang, Xingjian Su, Xu He, Jianye Hao, Mingsheng Long
Abstract:
Heterogeneity in sensors and actuators across environments poses a significant challenge to building large-scale pre-trained world models on top of this low-dimensional sensor information. In this work, we explore pre-training world models for heterogeneous environments by addressing key transfer barriers in both data diversity and model flexibility. We introduce UniTraj, a unified dataset comprising over one million trajectories from 80 environments, designed to scale data while preserving critical diversity. Additionally, we propose TrajWorld, a novel architecture capable of flexibly handling varying sensor and actuator information and capturing environment dynamics in-context. Pre-training TrajWorld on UniTraj yields substantial gains in transition prediction, achieves a new state-of-the-art for off-policy evaluation, and also delivers superior online performance of model predictive control. To the best of our knowledge, this work, for the first time, demonstrates the transfer benefits of world models across heterogeneous and complex control environments. Code and data are available at https://github.com/thuml/TrajWorld.
中文: 本研究提出包含80个环境中超百万轨迹的统一数据集UniTraj和灵活架构TrajWorld,通过克服数据多样性与模型灵活性障碍,首次实现了异构环境下世界模型的预训练迁移,在预测与控制任务中达到最优性能。
English: This study introduces UniTraj, a unified dataset with over one million trajectories, and TrajWorld, a flexible architecture that overcomes data and model barriers to enable world model pre-training across heterogeneous environments, achieving state-of-the-art performance in prediction and control tasks.
Authors:Nikita Gushchin, David Li, Daniil Selikhanovych, Evgeny Burnaev, Dmitry Baranchuk, Alexander Korotin
Abstract:
Learning diffusion bridge models is easy; making them fast and practical is an art. Diffusion bridge models (DBMs) are a promising extension of diffusion models for applications in image-to-image translation. However, like many modern diffusion and flow models, DBMs suffer from the problem of slow inference. To address it, we propose a novel distillation technique based on the inverse bridge matching formulation and derive the tractable objective to solve it in practice. Unlike previously developed DBM distillation techniques, the proposed method can distill both conditional and unconditional types of DBMs, distill models in a one-step generator, and use only the corrupted images for training. We evaluate our approach for both conditional and unconditional types of bridge matching on a wide set of setups, including super-resolution, JPEG restoration, sketch-to-image, and other tasks, and show that our distillation technique allows us to accelerate the inference of DBMs from 4x to 100x and even provide better generation quality than used teacher model depending on particular setup. We provide the code at https://github.com/ngushchin/IBMD
Chinese Summary: 作者提出了一种基于逆向桥匹配的新型蒸馏技术,可将扩散桥模型的推理速度提升4至100倍,并在多种图像处理任务中保持或超越原始模型的生成质量。
English Summary: The authors introduce a novel distillation technique using inverse bridge matching to significantly accelerate diffusion bridge models' inference by 4x to 100x while maintaining or improving generation quality across various image tasks.
Authors:Nimisha Ghosh, Pratik Dutta, Daniele Santoni
Abstract:
Transcription factors are proteins that regulate the expression of genes by binding to specific genomic regions known as Transcription Factor Binding Sites (TFBSs), typically located in the promoter regions of those genes. Accurate prediction of these binding sites is essential for understanding the complex gene regulatory networks underlying various cellular functions. In this regard, many deep learning models have been developed for such prediction, but there is still scope of improvement. In this work, we have developed a deep learning model which uses pre-trained DNABERT, a Convolutional Neural Network (CNN) module, a Modified Convolutional Block Attention Module (MCBAM), a Multi-Scale Convolutions with Attention (MSCA) module and an output module. The pre-trained DNABERT is used for sequence embedding, thereby capturing the long-term dependencies in the DNA sequences while the CNN, MCBAM and MSCA modules are useful in extracting higher-order local features. TFBS-Finder is trained and tested on 165 ENCODE ChIP-seq datasets. We have also performed ablation studies as well as cross-cell line validations and comparisons with other models. The experimental results show the superiority of the proposed method in predicting TFBSs compared to the existing methodologies. The codes and the relevant datasets are publicly available at https://github.com/NimishaGhosh/TFBS-Finder/.
中文: 本研究提出了TFBS-Finder深度学习模型,它结合DNABERT、CNN和注意力机制来改进转录因子结合位点的预测,并通过全面测试验证了其优于现有方法的性能。
English: This study introduces TFBS-Finder, a deep learning model that integrates DNABERT, CNN, and attention mechanisms to enhance transcription factor binding site prediction, demonstrating superior performance over existing methods through comprehensive testing and validation.
Authors:Ismail Khalfaoui-Hassani, Stefan Kesselheim
Abstract:
Which functions can be used as activations in deep neural networks? This article explores families of functions based on orthonormal bases, including the Hermite polynomial basis and the Fourier trigonometric basis, as well as a basis resulting from the tropicalization of a polynomial basis. Our study shows that, through simple variance-preserving initialization and without additional clamping mechanisms, these activations can successfully be used to train deep models, such as GPT-2 for next-token prediction on OpenWebText and ConvNeXt for image classification on ImageNet. Our work addresses the issue of exploding and vanishing activations and gradients, particularly prevalent with polynomial activations, and opens the door for improving the efficiency of large-scale learning tasks. Furthermore, our approach provides insight into the structure of neural networks, revealing that networks with polynomial activations can be interpreted as multivariate polynomial mappings. Finally, using Hermite interpolation, we show that our activations can closely approximate classical ones in pre-trained models by matching both the function and its derivative, making them especially useful for fine-tuning tasks. These activations are available in the torchortho library, which can be accessed via: https://github.com/K-H-Ismail/torchortho.
中文摘要:本文研究表明,基于正交基(如埃尔米特多项式和傅里叶基)的激活函数能有效用于深度神经网络训练,不仅解决了梯度爆炸/消失问题,还能通过插值逼近经典激活函数,特别适用于微调任务。
English Summary: This article demonstrates that orthonormal basis functions, such as Hermite polynomials and Fourier bases, can effectively serve as activations in deep neural networks, enabling stable training without special mechanisms while providing insights into network structure and approximation capabilities.
Authors:Yuanhe Zhang, Fanghui Liu, Yudong Chen
Abstract:
This paper explores how theory can guide and enhance practical algorithms, using Low-Rank Adaptation (LoRA, Hu et al. 2022) in large language models as a case study. We rigorously prove that, under gradient descent, LoRA adapters align with specific singular subspaces of the one-step full fine-tuning gradient. This result suggests that, by properly initializing the adapters using the one-step full gradient, subspace alignment can be achieved immediately and applicable to both linear and nonlinear models. Building on our theory, we propose a theory-driven algorithm, LoRA-One, where the linear convergence (as well as generalization) is built and incorporating preconditioners theoretically helps mitigate the effects of ill-conditioning. Besides, our theory reveals connections between LoRA-One and other gradient-alignment-based methods, helping to clarify misconceptions in the design of such algorithms. LoRA-One achieves significant empirical improvements over LoRA and its variants across benchmarks in natural language understanding, mathematical reasoning, and code generation. Code is available at: https://github.com/YuanheZ/LoRA-One.
中文: 本文通过理论证明LoRA适配器在微调过程中会与梯度特定子空间对齐,并提出理论驱动的LoRA-One算法,该算法在多个基准测试中实现了更快的收敛速度和更优的性能表现。
English: This paper demonstrates that LoRA adapters align with specific gradient subspaces during fine-tuning and introduces LoRA-One, a theory-driven algorithm that achieves faster convergence and better performance across various benchmarks.
Authors:Anam Zahid, Abdur Rehman Ali, Shaina Raza, Rai Shahnawaz, Faisal Kamiran, Asim Karim
Abstract:
Training data used for developing machine learning classifiers can exhibit biases against specific protected attributes. Such biases typically originate from historical discrimination or certain underlying patterns that disproportionately under-represent minority groups, such as those identified by their gender, religion, or race. In this paper, we propose a novel approach, FairUDT, a fairness-aware Uplift-based Decision Tree for discrimination identification. FairUDT demonstrates how the integration of uplift modeling with decision trees can be adapted to include fair splitting criteria. Additionally, we introduce a modified leaf relabeling approach for removing discrimination. We divide our dataset into favored and deprived groups based on a binary sensitive attribute, with the favored dataset serving as the treatment group and the deprived dataset as the control group. By applying FairUDT and our leaf relabeling approach to preprocess three benchmark datasets, we achieve an acceptable accuracy-discrimination tradeoff. We also show that FairUDT is inherently interpretable and can be utilized in discrimination detection tasks. The code for this project is available https://github.com/ara-25/FairUDT
中文摘要:本文提出FairUDT方法,通过将提升建模与公平分割标准相结合,并采用改进的叶节点重标记技术,在保持可接受准确率的同时有效识别和消除机器学习分类器中的歧视问题。
English Summary: This paper introduces FairUDT, a fairness-aware uplift-based decision tree that integrates uplift modeling with fair splitting criteria and a modified leaf relabeling approach to mitigate discrimination in machine learning classifiers while maintaining acceptable accuracy.
Authors:Wen Lai, Alexander Fraser, Ivan Titov
Abstract:
Parameter-efficient fine-tuning (PEFT) methods, such as LoRA, are commonly used to adapt LLMs. However, the effectiveness of standard PEFT methods is limited in low-resource scenarios with only a few hundred examples. Recent advances in interpretability research have inspired the emergence of activation editing (or steering) techniques, which modify the activations of specific model components. Due to their extremely small parameter counts, these methods show promise for small datasets. However, their performance is highly dependent on identifying the correct modules to edit and often lacks stability across different datasets. In this paper, we propose Joint Localization and Activation Editing (JoLA), a method that jointly learns (1) which heads in the Transformer to edit (2) whether the intervention should be additive, multiplicative, or both and (3) the intervention parameters themselves - the vectors applied as additive offsets or multiplicative scalings to the head output. Through evaluations on three benchmarks spanning commonsense reasoning, natural language understanding, and natural language generation, we demonstrate that JoLA consistently outperforms existing methods. The code for the method is released at https://github.com/wenlai-lavine/jola.
Chinese: 提出的JoLA方法联合学习需要编辑的Transformer头部、干预类型(加法/乘法)及干预参数,在有限训练数据下,于多项基准测试中持续超越现有方法。
English: The proposed JoLA method jointly learns which Transformer heads to edit, the type of intervention (additive/multiplicative), and the intervention parameters, consistently outperforming existing methods across multiple benchmarks despite limited training data.
Authors:Erpai Luo, Xinran Wei, Lin Huang, Yunyang Li, Han Yang, Zaishuo Xia, Zun Wang, Chang Liu, Bin Shao, Jia Zhang
Abstract:
Hamiltonian matrix prediction is pivotal in computational chemistry, serving as the foundation for determining a wide range of molecular properties. While SE(3) equivariant graph neural networks have achieved remarkable success in this domain, their substantial computational cost--driven by high-order tensor product (TP) operations--restricts their scalability to large molecular systems with extensive basis sets. To address this challenge, we introduce SPHNet, an efficient and scalable equivariant network, that incorporates adaptive SParsity into Hamiltonian prediction. SPHNet employs two innovative sparse gates to selectively constrain non-critical interaction combinations, significantly reducing tensor product computations while maintaining accuracy. To optimize the sparse representation, we develop a Three-phase Sparsity Scheduler, ensuring stable convergence and achieving high performance at sparsity rates of up to 70%. Extensive evaluations on QH9 and PubchemQH datasets demonstrate that SPHNet achieves state-of-the-art accuracy while providing up to a 7x speedup over existing models. Beyond Hamiltonian prediction, the proposed sparsification techniques also hold significant potential for improving the efficiency and scalability of other SE(3) equivariant networks, further broadening their applicability and impact. Our code can be found at https://github.com/microsoft/SPHNet.
中文摘要:SPHNet通过引入自适应稀疏性和三阶段调度器,在保持哈密顿矩阵预测精度的同时大幅降低计算成本,在基准数据集上实现了高达7倍的加速和最优性能。
English Summary: SPHNet introduces adaptive sparsity and a three-phase scheduler to significantly reduce computational costs while maintaining accuracy in Hamiltonian matrix prediction, achieving up to 7x speedup and state-of-the-art performance on benchmark datasets.
Authors:Daniel Sliwowski, Dongheui Lee
Abstract:
The introduction of robots into everyday scenarios necessitates algorithms capable of monitoring the execution of tasks. In this paper, we propose ConditionNET, an approach for learning the preconditions and effects of actions in a fully data-driven manner. We develop an efficient vision-language model and introduce additional optimization objectives during training to optimize for consistent feature representations. ConditionNET explicitly models the dependencies between actions, preconditions, and effects, leading to improved performance. We evaluate our model on two robotic datasets, one of which we collected for this paper, containing 406 successful and 138 failed teleoperated demonstrations of a Franka Emika Panda robot performing tasks like pouring and cleaning the counter. We show in our experiments that ConditionNET outperforms all baselines on both anomaly detection and phase prediction tasks. Furthermore, we implement an action monitoring system on a real robot to demonstrate the practical applicability of the learned preconditions and effects. Our results highlight the potential of ConditionNET for enhancing the reliability and adaptability of robots in real-world environments. The data is available on the project website: https://dsliwowski1.github.io/ConditionNET_page.
Authors:Chenyue Li, Wen Deng, Mengqian Lu, Binhang Yuan
Abstract:
The rapid advancements in large language models (LLMs), particularly in their reasoning capabilities, hold transformative potential for addressing complex challenges in atmospheric science. However, leveraging LLMs effectively in this domain requires a robust and comprehensive evaluation benchmark. Toward this end, we present AtmosSci-Bench, a novel benchmark designed to systematically assess LLM performance across five core categories of atmospheric science problems: hydrology, atmospheric dynamics, atmospheric physics, geophysics, and physical oceanography. AtmosSci-Bench features a dual-format design comprising both multiple-choice questions (MCQs) and open-ended questions (OEQs), enabling scalable automated evaluation alongside deeper analysis of conceptual understanding. We employ a template-based MCQ generation framework to create diverse, graduate-level problems with symbolic perturbation, while OEQs are used to probe open-ended reasoning. We conduct a comprehensive evaluation of representative LLMs, categorized into four groups: instruction-tuned models, advanced reasoning models, math-augmented models, and domain-specific climate models. Our analysis provides some interesting insights into the reasoning and problem-solving capabilities of LLMs in atmospheric science. We believe AtmosSci-Bench can serve as a critical step toward advancing LLM applications in climate service by offering a standard and rigorous evaluation framework. Our source codes are currently available at Our source codes are currently available at https://github.com/Relaxed-System-Lab/AtmosSci-Bench.
中文: AtmosSci-Bench是一个新颖的基准测试,通过双格式问题系统评估大语言模型在五大核心大气科学领域的表现,为推进LLM在气候服务中的应用提供了关键评估框架。
English: AtmosSci-Bench is a novel benchmark designed to systematically evaluate large language models' performance across five core atmospheric science categories through dual-format questions, providing crucial insights for advancing LLM applications in climate services.
Authors:Charilaos I. Kanatsoulis, Evelyn Choi, Stephanie Jegelka, Jure Leskovec, Alejandro Ribeiro
Abstract:
Positional encodings (PEs) are essential for effective graph representation learning because they provide position awareness in inherently position-agnostic transformer architectures and increase the expressive capacity of Graph Neural Networks (GNNs). However, designing powerful and efficient PEs for graphs poses significant challenges due to the absence of canonical node ordering and the scale of the graph. {In this work, we identify four key properties that graph PEs should satisfy}: stability, expressive power, scalability, and genericness. We find that existing eigenvector-based PE methods often fall short of jointly satisfying these criteria. To address this gap, we introduce PEARL, a novel framework of learnable PEs for graphs. Our primary insight is that message-passing GNNs function as nonlinear mappings of eigenvectors, enabling the design of GNN architectures for generating powerful and efficient PEs. A crucial challenge lies in initializing node attributes in a manner that is both expressive and permutation equivariant. We tackle this by initializing GNNs with random node inputs or standard basis vectors, thereby unlocking the expressive power of message-passing operations, while employing statistical pooling functions to maintain permutation equivariance. Our analysis demonstrates that PEARL approximates equivariant functions of eigenvectors with linear complexity, while rigorously establishing its stability and high expressive power. Experimental evaluations show that PEARL outperforms lightweight versions of eigenvector-based PEs and achieves comparable performance to full eigenvector-based PEs, but with one or two orders of magnitude lower complexity. Our code is available at https://github.com/ehejin/Pearl-PE.
中文: PEARL是一种创新的可学习位置编码框架,利用消息传递图神经网络生成高效强大的图位置编码,在保持与特征向量方法相当性能的同时显著降低了计算复杂度。
English: PEARL is a novel learnable positional encoding framework that leverages message-passing GNNs to generate powerful and efficient graph encodings, achieving comparable performance to eigenvector-based methods with significantly lower complexity.
Authors:Guanlin Li, Kangjie Chen, Shangwei Guo, Jie Zhang, Han Qiu, Chao Zhang, Guoyin Wang, Tianwei Zhang, Jiwei Li
Abstract:
Large language models (LLMs) have emerged as powerful tools for addressing a wide range of general inquiries and tasks. Despite this, fine-tuning aligned LLMs on smaller, domain-specific datasets, critical to adapting them to specialized tasks, can inadvertently degrade their safety alignment, even when the datasets are benign. This phenomenon makes models more susceptible to providing inappropriate responses. In this study, we systematically examine the factors contributing to safety alignment degradation in benign fine-tuning scenarios. Our analysis identifies three critical factors affecting aligned LLMs: answer structure, identity calibration, and role-play. Additionally, we evaluate the reliability of state-of-the-art reward models (RMs), which are often used to guide alignment processes. Our findings reveal that these RMs frequently fail to accurately reflect human preferences regarding safety, underscoring their limitations in practical applications. By uncovering these challenges, our work highlights the complexities of maintaining safety alignment during fine-tuning and offers guidance to help developers balance utility and safety in LLMs. Datasets and fine-tuning code used in our experiments can be found in https://github.com/GuanlinLee/llm_instruction_tuning.
Chinese: 在领域特定数据集上微调对齐的大型语言模型会削弱其安全对齐性,研究发现三个关键因素及当前奖励模型在保障安全方面的局限性。
English: Fine-tuning aligned large language models on domain-specific datasets can compromise their safety alignment, revealing three key factors and limitations in current reward models for maintaining safety.
Authors:Jiali Cheng, Hadi Amiri
Abstract:
Tool-augmented large language models (LLMs) are often trained on datasets of query-response pairs, which embed the ability to use tools or APIs directly into the parametric knowledge of LLMs. Tool-augmented LLMs need the ability to forget learned tools due to security vulnerabilities, privacy regulations, or tool deprecations. However, ``tool unlearning'' has not been investigated in unlearning literature. We introduce this novel task, which requires addressing distinct challenges compared to traditional unlearning: knowledge removal rather than forgetting individual samples, the high cost of optimizing LLMs, and the need for principled evaluation metrics. To bridge these gaps, we propose ToolDelete, the first approach for unlearning tools from tool-augmented LLMs. It implements three key properties to address the above challenges for effective tool unlearning and introduces a new membership inference attack (MIA) model for effective evaluation. Extensive experiments on multiple tool learning datasets and tool-augmented LLMs show that ToolDelete effectively unlearns randomly selected tools, while preserving the LLM's knowledge on non-deleted tools and maintaining performance on general tasks.
Authors:Dongwon Jo, Jiwon Song, Yulhwa Kim, Jae-Joon Kim
Abstract:
While large language models (LLMs) excel at handling long-context sequences, they require substantial key-value (KV) caches to store contextual information, which can heavily burden computational efficiency and memory usage. Previous efforts to compress these KV caches primarily focused on reducing memory demands but were limited in enhancing latency. To address this issue, we introduce FastKV, a KV cache compression method designed to reduce latency for long-context inference. FastKV improves processing speed while preserving accuracy by adopting Token-Selective Propagation (TSP). This approach preserves full-context information in early layers of LLMs and selectively propagates only a portion of this information in later layers. This design enables FastKV to minimize redundant computation without sacrificing contextual fidelity. Our experimental results show that FastKV achieves up to 1.97$\times$ and 4.82$\times$ improvements in time-to-first-token (TTFT) and throughput, respectively, compared to baseline without KV cache compression. Moreover, FastKV successfully maintains accuracy within 1\% of the baseline on long-context benchmarks. Our code is available at https://github.com/dongwonjo/FastKV.
中文: FastKV是一种KV缓存压缩方法,通过令牌选择性传播技术,在保持长上下文基准测试准确率接近基线1%以内的同时,显著提升了首个令牌生成时间和系统吞吐量。
English: FastKV is a KV cache compression method that uses Token-Selective Propagation to significantly reduce latency and improve throughput in long-context LLM inference while maintaining accuracy within 1% of baseline performance.
Authors:Peixuan Han, Cheng Qian, Xiusi Chen, Yuji Zhang, Heng Ji, Denghui Zhang
Abstract:
Large language models (LLMs) exhibit exceptional capabilities across various tasks but also pose risks by generating harmful content. Existing safety mechanisms, while improving model safety, often lead to overly cautious behavior and fail to fully leverage LLMs' internal cognitive processes. Inspired by humans' reflective thinking capability, we first show that LLMs can similarly perform internal assessments about safety in their internal states. Building on this insight, we propose SafeSwitch, a dynamic framework that regulates unsafe outputs by utilizing the prober-based internal state monitor that actively detects harmful intentions, and activates a safety head that leads to safer and more conservative responses only when necessary. SafeSwitch reduces harmful outputs by approximately 80% on harmful queries while maintaining strong utility, reaching a Pareto optimal among several methods. Our method is also advantageous over traditional methods in offering more informative, context-aware refusals, and achieves these benefits while only tuning less than 6% of the original parameters. SafeSwitch demonstrates large language models' capacity for self-awareness and reflection regarding safety, offering a promising approach to more nuanced and effective safety controls. Codes for this work are available at https://github.com/Hanpx20/SafeSwitch.
中文摘要:SafeSwitch框架通过动态监控大语言模型的内部状态来检测有害意图,仅在必要时启动安全机制,仅调整不到6%的参数即可将有害输出减少约80%,同时保持模型实用性。
English Summary: The SafeSwitch framework dynamically monitors large language models' internal states to detect harmful intentions and activates safety mechanisms only when necessary, reducing harmful outputs by 80% while maintaining utility through minimal parameter adjustments.
Authors:Siqi Zeng, Yifei He, Weiqiu You, Yifan Hao, Yao-Hung Hubert Tsai, Makoto Yamada, Han Zhao
Abstract:
Task vectors, which are derived from the difference between pre-trained and fine-tuned model weights, enable flexible task adaptation and model merging through arithmetic operations such as addition and negation. However, existing approaches often rely on heuristics with limited theoretical support, often leading to performance gaps comparing to direct task fine tuning. Meanwhile, although it is easy to manipulate saved task vectors with arithmetic for different purposes, such compositional flexibility demands high memory usage, especially when dealing with a huge number of tasks, limiting scalability. This work addresses these issues with a theoretically grounded framework that explains task vector arithmetic and introduces the task vector bases framework. Building upon existing task arithmetic literature, our method significantly reduces the memory cost for downstream arithmetic with little effort, while achieving competitive performance and maintaining compositional advantage, providing a practical solution for large-scale task arithmetic. The code is available at https://github.com/uiuctml/TaskVectorBasis.
中文摘要:任务向量基是一种压缩框架,通过将多个任务向量表示为少量基向量的线性组合,在保持任务算术功能和性能的同时显著降低了存储与计算成本。
English Summary: Task Vector Bases is a compression framework that reduces storage and computation costs by representing multiple task vectors as linear combinations of fewer basis vectors while maintaining task arithmetic functionality and performance.
Authors:Minh Ngoc Nguyen, Khai Le-Duc, Tan-Hanh Pham, Trang Nguyen, Quang Minh Luu, Ba Kien Tran, Truong-Son Hy, Viktor Dremin, Sergei Sokolovsky, Edik Rafailov
Abstract:
In this study, we introduce a novel method to predict mental health by building machine learning models for a non-invasive wearable device equipped with Laser Doppler Flowmetry (LDF) and Fluorescence Spectroscopy (FS) sensors. Besides, we present the corresponding dataset to predict mental health, e.g. depression, anxiety, and stress levels via the DAS-21 questionnaire. To our best knowledge, this is the world's largest and the most generalized dataset ever collected for both LDF and FS studies. The device captures cutaneous blood microcirculation parameters, and wavelet analysis of the LDF signal extracts key rhythmic oscillations. The dataset, collected from 132 volunteers aged 18-94 from 19 countries, explores relationships between physiological features, demographics, lifestyle habits, and health conditions. We employed a variety of machine learning methods to classify stress detection, in which LightGBM is identified as the most effective model for stress detection, achieving a ROC AUC of 0.7168 and a PR AUC of 0.8852. In addition, we also incorporated Explainable Artificial Intelligence (XAI) techniques into our analysis to investigate deeper insights into the model's predictions. Our results suggest that females, younger individuals and those with a higher Body Mass Index (BMI) or heart rate have a greater likelihood of experiencing mental health conditions like stress and anxiety. All related code and data are published online: https://github.com/leduckhai/Wearable_LDF-FS.
中文: 本研究通过结合激光多普勒血流仪和荧光光谱传感器的可穿戴设备,开发了预测心理健康的新方法,建立了全球最大相关数据集并确定LightGBM为最佳预测模型,同时利用可解释人工智能揭示了女性、年轻群体及高BMI人群更易出现心理问题。
English: This study presents a novel wearable device using LDF and FS sensors to predict mental health conditions, creating the world's largest dataset and identifying LightGBM as the most effective model with XAI insights revealing demographic risk factors.
Authors:Anuj Singh, Sayak Mukherjee, Ahmad Beirami, Hadi Jamali-Rad
Abstract:
Aligning diffusion models to downstream tasks often requires finetuning new models or gradient-based guidance at inference time to enable sampling from the reward-tilted posterior. In this work, we explore a simple inference-time gradient-free guidance approach, called controlled denoising (CoDe), that circumvents the need for differentiable guidance functions and model finetuning. CoDe is a blockwise sampling method applied during intermediate denoising steps, allowing for alignment with downstream rewards. Our experiments demonstrate that, despite its simplicity, CoDe offers a favorable trade-off between reward alignment, prompt instruction following, and inference cost, achieving a competitive performance against the state-of-the-art baselines. Our code is available at: https://github.com/anujinho/code.
Chinese: 提出的受控去噪(CoDe)方法通过在推理过程中采用无需梯度的分块采样策略,使扩散模型能够与下游奖励对齐,无需模型微调或可微分指导,同时保持有竞争力的性能。
English: The proposed controlled denoising (CoDe) method enables alignment of diffusion models with downstream rewards through a gradient-free, blockwise sampling approach during inference, eliminating the need for model finetuning or differentiable guidance while maintaining competitive performance.
Authors:Mauricio Soroco, Jialin Song, Mengzhou Xia, Kye Emond, Weiran Sun, Wuyang Chen
Abstract:
While recent AI-for-math has made strides in pure mathematics, areas of applied mathematics, particularly PDEs, remain underexplored despite their significant real-world applications. We present PDE-Controller, a framework that enables large language models (LLMs) to control systems governed by partial differential equations (PDEs). Our approach enables LLMs to transform informal natural language instructions into formal specifications, and then execute reasoning and planning steps to improve the utility of PDE control. We build a holistic solution comprising datasets (both human-written cases and 2 million synthetic samples), math-reasoning models, and novel evaluation metrics, all of which require significant effort. Our PDE-Controller significantly outperforms prompting the latest open source and GPT models in reasoning, autoformalization, and program synthesis, achieving up to a 62% improvement in utility gain for PDE control. By bridging the gap between language generation and PDE systems, we demonstrate the potential of LLMs in addressing complex scientific and engineering challenges. We release all data, model checkpoints, and code at https://pde-controller.github.io/.
Authors:Moritz Wolter, Lokesh Veeramacheneni, Charles Tapley Hoyt
Abstract:
While experimental reproduction remains a pillar of the scientific method, we observe that the software best practices supporting the reproduction of machine learning ( ML ) research are often undervalued or overlooked, leading both to poor reproducibility and damage to trust in the ML community. We quantify these concerns by surveying the usage of software best practices in software repositories associated with publications at major ML conferences and journals such as NeurIPS, ICML, ICLR, TMLR, and MLOSS within the last decade. We report the results of this survey that identify areas where software best practices are lacking and areas with potential for growth in the ML community. Finally, we discuss the implications and present concrete recommendations on how we, as a community, can improve reproducibility in ML research.
中文: 摘要指出,机器学习研究中软件最佳实践的缺失损害了研究的可复现性和社区信任,通过对主要会议和期刊代码库的调查结果,提出了促进社区改进的具体建议。
English: The abstract highlights that inadequate adoption of software best practices in machine learning research undermines reproducibility and trust, as evidenced by a survey of repositories from major ML conferences and journals, and it proposes recommendations for community-wide improvements.
Authors:Teng Xiao, Yige Yuan, Zhengyu Chen, Mingxiao Li, Shangsong Liang, Zhaochun Ren, Vasant G Honavar
Abstract:
Existing preference optimization objectives for language model alignment require additional hyperparameters that must be extensively tuned to achieve optimal performance, increasing both the complexity and time required for fine-tuning large language models. In this paper, we propose a simple yet effective hyperparameter-free preference optimization algorithm for alignment. We observe that promising performance can be achieved simply by optimizing inverse perplexity, which is calculated as the inverse of the exponentiated average log-likelihood of the chosen and rejected responses in the preference dataset. The resulting simple learning objective, SimPER, is easy to implement and eliminates the need for expensive hyperparameter tuning and a reference model, making it both computationally and memory efficient. Extensive experiments on widely used real-world benchmarks, including MT-Bench, AlpacaEval 2, and 10 key benchmarks of the Open LLM Leaderboard with 5 base models, demonstrate that SimPER consistently and significantly outperforms existing approaches-even without any hyperparameters or a reference model . For example, despite its simplicity, SimPER outperforms state-of-the-art methods by up to 5.7 points on AlpacaEval 2 and achieves the highest average ranking across 10 benchmarks on the Open LLM Leaderboard. The source code for SimPER is publicly available at: https://github.com/tengxiao1/SimPER.
Chinese: 提出的SimPER算法采用无需超参数调优的偏好优化方法,通过逆困惑度对齐语言模型,在多个基准测试中无需参考模型即实现卓越性能。
English: The proposed SimPER algorithm introduces a hyperparameter-free preference optimization method that uses inverse perplexity to align language models, achieving superior performance across multiple benchmarks without requiring costly tuning or reference models.
Authors:Yongqiang Huang, Zerui Shao, Ziyuan Yang, Zexin Lu, Yi Zhang
Abstract:
Mobile and Web-of-Things (WoT) devices at the network edge generate vast amounts of data for machine learning applications, yet privacy concerns hinder centralized model training. Federated Learning (FL) allows clients (devices) to collaboratively train a shared model coordinated by a central server without transfer private data, but inherent statistical heterogeneity among clients presents challenges, often leading to a dilemma between clients' needs for personalized local models and the server's goal of building a generalized global model. Existing FL methods typically prioritize either global generalization or local personalization, resulting in a trade-off between these two objectives and limiting the full potential of diverse client data. To address this challenge, we propose a novel framework that simultaneously enhances global generalization and local personalization by Rethinking Information Representation in the Federated learning process (FedRIR). Specifically, we introduce Masked Client-Specific Learning (MCSL), which isolates and extracts fine-grained client-specific features tailored to each client's unique data characteristics, thereby enhancing personalization. Concurrently, the Information Distillation Module (IDM) refines the global shared features by filtering out redundant client-specific information, resulting in a purer and more robust global representation that enhances generalization. By integrating the refined global features with the isolated client-specific features, we construct enriched representations that effectively capture both global patterns and local nuances, thereby improving the performance of downstream tasks on the client. The code is available at https://github.com/Deep-Imaging-Group/FedRIR.
中文:联邦学习面临全局泛化与本地个性化的权衡,而提出的FedRIR框架通过掩码客户端特定学习隔离个性化特征,并结合信息蒸馏模块优化全局表示,从而同时提升了两方面的性能。
English: Federated Learning faces a trade-off between global generalization and local personalization, but the proposed FedRIR framework simultaneously enhances both by isolating client-specific features through Masked Client-Specific Learning and refining global representations via an Information Distillation Module.
Authors:Xixi Wu, Yifei Shen, Fangzhou Ge, Caihua Shan, Yizhu Jiao, Xiangguo Sun, Hong Cheng
Abstract:
Node classification is a fundamental task in graph analysis, with broad applications across various fields. Recent breakthroughs in Large Language Models (LLMs) have enabled LLM-based approaches for this task. Although many studies demonstrate the impressive performance of LLM-based methods, the lack of clear design guidelines may hinder their practical application. In this work, we aim to establish such guidelines through a fair and systematic comparison of these algorithms. As a first step, we developed LLMNodeBed, a comprehensive codebase and testbed for node classification using LLMs. It includes 10 homophilic datasets, 4 heterophilic datasets, 8 LLM-based algorithms, 8 classic baselines, and 3 learning paradigms. Subsequently, we conducted extensive experiments, training and evaluating over 2,700 models, to determine the key settings (e.g., learning paradigms and homophily) and components (e.g., model size and prompt) that affect performance. Our findings uncover 8 insights, e.g., (1) LLM-based methods can significantly outperform traditional methods in a semi-supervised setting, while the advantage is marginal in a supervised setting; (2) Graph Foundation Models can beat open-source LLMs but still fall short of strong LLMs like GPT-4o in a zero-shot setting. We hope that the release of LLMNodeBed, along with our insights, will facilitate reproducible research and inspire future studies in this field. Codes and datasets are released at \href{https://llmnodebed.github.io/}{\texttt{https://llmnodebed.github.io/}}.
Authors:Yong Liu, Guo Qin, Zhiyuan Shi, Zhi Chen, Caiyin Yang, Xiangdong Huang, Jianmin Wang, Mingsheng Long
Abstract:
We introduce Sundial, a family of native, flexible, and scalable time series foundation models. To predict the next-patch's distribution, we propose a TimeFlow Loss based on flow-matching, which facilitates native pre-training of Transformers on continuous-valued time series without discrete tokenization. Conditioned on arbitrary-length time series, our models are pre-trained without specifying any prior distribution and can generate multiple probable predictions, achieving more flexibility in representation learning than using parametric densities. Towards time series foundation models, we leverage minimal but crucial adaptations of Transformers and curate TimeBench with one trillion time points, comprising mostly real-world datasets and synthetic data. By mitigating mode collapse via TimeFlow Loss, we pre-train a family of Sundial models on TimeBench, which achieve unprecedented model capacity and generalization performance. In addition to excellent scalability, Sundial achieves state-of-the-art results on both point and probabilistic forecasting benchmarks with a just-in-time inference speed, i.e., making zero-shot predictions within a few milliseconds. We believe that Sundial's pioneering generative forecasting capability can improve model reliability in real-world decision-making. Code is available at: https://github.com/thuml/Sundial.
中文: Sundial是一系列原生时间序列基础模型,采用创新的TimeFlow损失函数在连续数据上进行灵活、可扩展的预训练,以快速推理实现了顶尖的预测性能。
English: Sundial is a family of native time series foundation models that utilize a novel TimeFlow Loss for flexible, scalable pre-training on continuous data, achieving state-of-the-art forecasting performance with rapid inference.
Authors:Divya Jyoti Bajpai, Manjesh Kumar Hanawal
Abstract:
Early Exit (EE) techniques have emerged as a means to reduce inference latency in Deep Neural Networks (DNNs). The latency improvement and accuracy in these techniques crucially depend on the criteria used to make exit decisions. We propose a new decision criterion where exit classifiers are treated as experts BEEM and aggregate their confidence scores. The confidence scores are aggregated only if neighbouring experts are consistent in prediction as the samples pass through them, thus capturing their ensemble effect. A sample exits when the aggregated confidence value exceeds a threshold. The threshold is set using the error rates of the intermediate exits aiming to surpass the performance of conventional DNN inference. Experimental results on the COCO dataset for Image captioning and GLUE datasets for various language tasks demonstrate that our method enhances the performance of state-of-the-art EE methods, achieving improvements in speed-up by a factor 1.5x to 2.1x. When compared to the final layer, its accuracy is comparable in harder Image Captioning and improves in the easier language tasks. The source code for this work is publicly available at https://github.com/Div290/BEEM1/tree/main
中文: BEEM提出通过聚合相邻分类器在预测一致时的置信度作为早退新标准,在图像描述和语言任务中实现1.5-2.1倍加速,同时保持或提升模型精度。
English: BEEM introduces a novel early exit criterion that aggregates confidence scores from neighboring classifiers when predictions are consistent, achieving 1.5x-2.1x speed-up while maintaining comparable or improved accuracy across vision and language tasks.
Authors:Hyeong Kyu Choi, Maxim Khanov, Hongxin Wei, Yixuan Li
Abstract:
Dataset contamination, where evaluation datasets overlap with pre-training corpora, inflates performance metrics and undermines the reliability of model evaluations. Measuring dataset contamination thus becomes essential to ensure that performance evaluations genuinely reflect a model's ability to generalize to unseen data, rather than relying on memorized examples. To address this problem, we propose Kernel Divergence Score (KDS), a novel method that evaluates dataset contamination by computing the divergence between the kernel similarity matrix of sample embeddings, before and after fine-tuning on the benchmark dataset. Leveraging the insight that fine-tuning affects unseen samples more significantly than seen ones, KDS provides a reliable measure of contamination. Through extensive experiments on controlled contamination scenarios, KDS demonstrates a near-perfect correlation with contamination levels and outperforms existing baselines. Additionally, we perform comprehensive ablation studies to analyze the impact of key design choices, providing deeper insights into the components and effectiveness of KDS. These ablations highlight the importance of leveraging fine-grained kernel-based information and confirm the reliability of the proposed framework across diverse datasets and settings. Code is released in https://github.com/deeplearning-wisc/kernel-divergence-score.
中文: 提出的核散度评分(KDS)通过比较微调前后样本嵌入的核相似性矩阵,有效评估数据集污染问题,在受控实验中展现出与污染程度近乎完美的相关性,并优于现有基线方法。
English: The proposed Kernel Divergence Score (KDS) effectively measures dataset contamination by comparing kernel similarity matrices of sample embeddings before and after fine-tuning, demonstrating superior correlation with contamination levels and outperforming existing methods in controlled experiments.
Authors:Mingyu Chen, Yiding Chen, Wen Sun, Xuezhou Zhang
Abstract:
Reinforcement Learning from Human Feedback (RLHF) has emerged as a pivotal technique for large language model (LLM) alignment. This paper studies the setting of online RLHF and focus on improving sample efficiency. All existing algorithms in online RLHF, whether doing passive exploration or active exploration, suffer from a sample complexity that scales exponentially with the scale of the reward function. This fundamental limitation hinders their effectiveness in scenarios with heavily skewed preferences, e.g. questions with a unique correct solution. To address this, we introduce Self-Exploring Preference-Incentive Online Preference Optimization (SE-POPO), an online RLHF algorithm that for the first time achieves a sample complexity that scales polynomially with the reward scale, answering an open problem raised by Xie et al. (2024).. Theoretically, we demonstrate that the sample complexity of SE-POPO dominates that of existing exploration algorithms. Empirically, our systematic evaluation confirms that SE-POPO is more sample-efficient than both exploratory and non-exploratory baselines, in two primary application scenarios of RLHF as well as on public benchmarks, marking a significant step forward in RLHF algorithm design. The code is available at https://github.com/MYC000801/SE-POPO.
Chinese: 本文提出SE-POPO算法,首次实现了样本复杂度与奖励规模的多项式关系,突破了现有在线RLHF方法的指数级增长瓶颈,在理论和实验验证中均展现出更优的样本效率。
English: This paper introduces SE-POPO, a novel online RLHF algorithm that achieves polynomial sample complexity scaling with reward size, solving a key limitation of existing methods and demonstrating superior efficiency in both theoretical analysis and empirical evaluations.
Authors:Samiran Dey, Christopher R. S. Banerji, Partha Basuchowdhuri, Sanjoy K. Saha, Deepak Parashar, Tapabrata Chakraborti
Abstract:
Emerging research has highlighted that artificial intelligence based multimodal fusion of digital pathology and transcriptomic features can improve cancer diagnosis (grading/subtyping) and prognosis (survival risk) prediction. However, such direct fusion for joint decision is impractical in real clinical settings, where histopathology is still the gold standard for diagnosis and transcriptomic tests are rarely requested, at least in the public healthcare system. With our novel diffusion based crossmodal generative AI model PathGen, we show that genomic expressions synthesized from digital histopathology jointly predicts cancer grading and patient survival risk with high accuracy (state-of-the-art performance), certainty (through conformal coverage guarantee) and interpretability (through distributed attention maps). PathGen code is available for open use by the research community through GitHub at https://github.com/Samiran-Dey/PathGen.
中文摘要:PathGen模型通过基于扩散的生成式人工智能,从数字病理学中合成基因组表达,无需实际转录组测试即可实现对癌症分级和患者生存风险的高精度、可解释预测。
English Summary: The PathGen model uses diffusion-based generative AI to synthesize genomic expressions from digital pathology, enabling accurate and interpretable predictions of cancer grading and patient survival risk without requiring actual transcriptomic tests.
Authors:Mohammad Nazeri, Anuj Pokhrel, Alexandyr Card, Aniket Datar, Garrett Warnell, Xuesu Xiao
Abstract:
Sophisticated learning architectures, e.g., Transformers, present a unique opportunity for robots to understand complex vehicle-terrain kinodynamic interactions for off-road mobility. While internet-scale data are available for Natural Language Processing (NLP) and Computer Vision (CV) tasks to train Transformers, real-world mobility data are difficult to acquire with physical robots navigating off-road terrain. Furthermore, training techniques specifically designed to process text and image data in NLP and CV may not apply to robot mobility. In this paper, we propose VertiFormer, a novel data-efficient multi-task Transformer model trained with only one hour of data to address such challenges of applying Transformer architectures for robot mobility on extremely rugged, vertically challenging, off-road terrain. Specifically, VertiFormer employs a new learnable masked modeling and next token prediction paradigm to predict the next pose, action, and terrain patch to enable a variety of off-road mobility tasks simultaneously, e.g., forward and inverse kinodynamics modeling. The non-autoregressive design mitigates computational bottlenecks and error propagation associated with autoregressive models. VertiFormer's unified modality representation also enhances learning of diverse temporal mappings and state representations, which, combined with multiple objective functions, further improves model generalization. Our experiments offer insights into effectively utilizing Transformers for off-road robot mobility with limited data and demonstrate our efficiently trained Transformer can facilitate multiple off-road mobility tasks onboard a physical mobile robot.
Chinese: VertiFormer是一种数据高效的多任务Transformer模型,仅用一小时数据即可预测机器人姿态、动作和地形区块,以解决越野移动中的数据稀缺和计算瓶颈问题。
English: VertiFormer is a data-efficient multi-task Transformer model that uses only one hour of data to predict robot poses, actions, and terrain patches for off-road mobility tasks, overcoming the limitations of data scarcity and computational bottlenecks.
Authors:Karish Grover, Haiyang Yu, Xiang Song, Qi Zhu, Han Xie, Vassilis N. Ioannidis, Christos Faloutsos
Abstract:
Can integrating spectral and curvature signals unlock new potential in graph representation learning? Non-Euclidean geometries, particularly Riemannian manifolds such as hyperbolic (negative curvature) and spherical (positive curvature), offer powerful inductive biases for embedding complex graph structures like scale-free, hierarchical, and cyclic patterns. Meanwhile, spectral filtering excels at processing signal variations across graphs, making it effective in homophilic and heterophilic settings. Leveraging both can significantly enhance the learned representations. To this end, we propose Spectro-Riemannian Graph Neural Networks (CUSP) - the first graph representation learning paradigm that unifies both CUrvature (geometric) and SPectral insights. CUSP is a mixed-curvature spectral GNN that learns spectral filters to optimize node embeddings in products of constant-curvature manifolds (hyperbolic, spherical, and Euclidean). Specifically, CUSP introduces three novel components: (a) Cusp Laplacian, an extension of the traditional graph Laplacian based on Ollivier-Ricci curvature, designed to capture the curvature signals better; (b) Cusp Filtering, which employs multiple Riemannian graph filters to obtain cues from various bands in the eigenspectrum; and (c) Cusp Pooling, a hierarchical attention mechanism combined with a curvature-based positional encoding to assess the relative importance of differently curved substructures in our graph. Empirical evaluation across eight homophilic and heterophilic datasets demonstrates the superiority of CUSP in node classification and link prediction tasks, with a gain of up to 5.3% over state-of-the-art models. The code is available at: https://github.com/amazon-science/cusp.
中文: 提出的谱黎曼图神经网络(CUSP)融合了几何曲率和谱滤波方法,显著提升了图表示学习能力,在节点分类和链接预测任务中表现出卓越性能。
English: The proposed Spectro-Riemannian Graph Neural Networks (CUSP) unify geometric curvature and spectral filtering to enhance graph representation learning, demonstrating superior performance in node classification and link prediction tasks.
Authors:Alexander Nikulin, Ilya Zisman, Denis Tarasov, Nikita Lyubaykin, Andrei Polubarov, Igor Kiselev, Vladislav Kurenkov
Abstract:
Recently, latent action learning, pioneered by Latent Action Policies (LAPO), have shown remarkable pre-training efficiency on observation-only data, offering potential for leveraging vast amounts of video available on the web for embodied AI. However, prior work has focused on distractor-free data, where changes between observations are primarily explained by ground-truth actions. Unfortunately, real-world videos contain action-correlated distractors that may hinder latent action learning. Using Distracting Control Suite (DCS) we empirically investigate the effect of distractors on latent action learning and demonstrate that LAPO struggle in such scenario. We propose LAOM, a simple LAPO modification that improves the quality of latent actions by 8x, as measured by linear probing. Importantly, we show that providing supervision with ground-truth actions, as few as 2.5% of the full dataset, during latent action learning improves downstream performance by 4.2x on average. Our findings suggest that integrating supervision during Latent Action Models (LAM) training is critical in the presence of distractors, challenging the conventional pipeline of first learning LAM and only then decoding from latent to ground-truth actions.
中文摘要:潜在动作策略(LAPO)在现实干扰物下表现不佳,但改进的LAOM方法将潜在动作质量提升8倍,且仅需2.5%的真实动作监督即可使下游任务性能提高4.2倍,这挑战了传统训练流程。
English Summary: Latent Action Policies (LAPO) struggle with real-world distractors, but the proposed LAOM modification improves latent action quality by 8x and incorporating minimal ground-truth supervision boosts downstream performance by 4.2x, challenging conventional training pipelines.
Authors:Yu Feng, Yangli-ao Geng, Yifan Zhu, Zongfu Han, Xie Yu, Kaiwen Xue, Haoran Luo, Mengyang Sun, Guangwei Zhang, Meina Song
Abstract:
Federated learning (FL) has gained widespread attention for its privacy-preserving and collaborative learning capabilities. Due to significant statistical heterogeneity, traditional FL struggles to generalize a shared model across diverse data domains. Personalized federated learning addresses this issue by dividing the model into a globally shared part and a locally private part, with the local model correcting representation biases introduced by the global model. Nevertheless, locally converged parameters more accurately capture domain-specific knowledge, and current methods overlook the potential benefits of these parameters. To address these limitations, we propose PM-MoE architecture. This architecture integrates a mixture of personalized modules and an energy-based personalized modules denoising, enabling each client to select beneficial personalized parameters from other clients. We applied the PM-MoE architecture to nine recent model-split-based personalized federated learning algorithms, achieving performance improvements with minimal additional training. Extensive experiments on six widely adopted datasets and two heterogeneity settings validate the effectiveness of our approach. The source code is available at \url{https://github.com/dannis97500/PM-MOE}.
中文摘要:提出的PM-MoE架构通过让客户端选择性整合其他客户端的有利参数,在多个数据集上以最少额外训练实现了个性化联邦学习性能的提升。
English Summary: The proposed PM-MoE architecture enhances personalized federated learning by enabling clients to selectively integrate beneficial parameters from others, achieving improved performance across multiple datasets with minimal extra training.
Authors:Yurui Li, Yuxuan Chen, Li Zhang, Shijian Li, Gang Pan
Abstract:
The significant role of division of labor (DOL) in promoting cooperation is widely recognized in real-world applications.Many cooperative multi-agent reinforcement learning (MARL) methods have incorporated the concept of DOL to improve cooperation among agents.However, the tasks used in existing testbeds typically correspond to tasks where DOL is often not a necessary feature for achieving optimal policies.Additionally, the full utilize of DOL concept in MARL methods remains unrealized due to the absence of appropriate tasks.To enhance the generality and applicability of MARL methods in real-world scenarios, there is a necessary to develop tasks that demand multi-agent DOL and cooperation.In this paper, we propose a series of tasks designed to meet these requirements, drawing on real-world rules as the guidance for their design.We guarantee that DOL and cooperation are necessary condition for completing tasks and introduce three factors to expand the diversity of proposed tasks to cover more realistic situations.We evaluate 10 cooperative MARL methods on the proposed tasks.The results indicate that all baselines perform poorly on these tasks.To further validate the solvability of these tasks, we also propose simplified variants of proposed tasks.Experimental results show that baselines are able to handle these simplified variants, providing evidence of the solvability of the proposed tasks.The source files is available at https://github.com/Yurui-Li/CTC.
Chinese: 本文提出了需要分工与合作的新型多智能体任务,揭示了现有方法的不足,并通过简化版本验证了任务的可解性。
English: This paper introduces new multi-agent tasks requiring division of labor and cooperation, revealing current methods' limitations while demonstrating solvability through simplified variants.
Authors:Yuan Gao, Hao Wu, Ruiqi Shu, Huanshuo Dong, Fan Xu, Rui Ray Chen, Yibo Yan, Qingsong Wen, Xuming Hu, Kun Wang, Jiahao Wu, Qing Li, Hui Xiong, Xiaomeng Huang
Abstract:
Accurate weather forecasts are important for disaster prevention, agricultural planning, etc. Traditional numerical weather prediction (NWP) methods offer physically interpretable high-accuracy predictions but are computationally expensive and fail to fully leverage rapidly growing historical data. In recent years, deep learning models have made significant progress in weather forecasting, but challenges remain, such as balancing global and regional high-resolution forecasts, excessive smoothing in extreme event predictions, and insufficient dynamic system modeling. To address these issues, this paper proposes a global-regional nested weather forecasting framework (OneForecast) based on graph neural networks. By combining a dynamic system perspective with multi-grid theory, we construct a multi-scale graph structure and densify the target region to capture local high-frequency features. We introduce an adaptive messaging mechanism, using dynamic gating units to deeply integrate node and edge features for more accurate extreme event forecasting. For high-resolution regional forecasts, we propose a neural nested grid method to mitigate boundary information loss. Experimental results show that OneForecast performs excellently across global to regional scales and short-term to long-term forecasts, especially in extreme event predictions. Codes link https://github.com/YuanGao-YG/OneForecast.
Chinese: 本文提出OneForecast框架,通过图神经网络构建全球与区域嵌套的天气预测模型,结合动态系统和多尺度结构,利用自适应消息机制显著提升了极端天气事件的预测精度。
English: This paper introduces OneForecast, a global-regional nested weather forecasting framework using graph neural networks to enhance prediction accuracy, particularly for extreme events, by integrating dynamic systems with multi-scale structures and adaptive messaging.
Authors:Chenhui Xu, Dancheng Liu, Yuting Hu, Jiajie Li, Ruiyang Qin, Qingxiao Zheng, Jinjun Xiong
Abstract:
Physics-Informed Neural Networks (PINNs) are a kind of deep-learning-based numerical solvers for partial differential equations (PDEs). Existing PINNs often suffer from failure modes of being unable to propagate patterns of initial conditions. We discover that these failure modes are caused by the simplicity bias of neural networks and the mismatch between PDE's continuity and PINN's discrete sampling. We reveal that the State Space Model (SSM) can be a continuous-discrete articulation allowing initial condition propagation, and that simplicity bias can be eliminated by aligning a sequence of moderate granularity. Accordingly, we propose PINNMamba, a novel framework that introduces sub-sequence modeling with SSM. Experimental results show that PINNMamba can reduce errors by up to 86.3\% compared with state-of-the-art architecture. Our code is available at https://github.com/miniHuiHui/PINNMamba.
中文: PINNMamba 这一创新框架通过状态空间模型进行子序列建模,解决了物理信息神经网络无法传播初始条件模式的问题,相比现有最优架构可将误差降低高达 86.3%。
English: PINNMamba, a novel framework using State Space Models for sub-sequence modeling, overcomes the limitations of Physics-Informed Neural Networks by enabling initial condition propagation and reducing errors by up to 86.3% compared to existing methods.
Authors:Shengyu Feng, Yiming Yang
Abstract:
This work proposes a simple yet effective sampling framework for combinatorial optimization (CO). Our method builds on discrete Langevin dynamics (LD), an efficient gradient-guided generative paradigm. However, we observe that directly applying LD often leads to limited exploration. To overcome this limitation, we propose the Regularized Langevin Dynamics (RLD), which enforces an expected distance between the sampled and current solutions, effectively avoiding local minima. We develop two CO solvers on top of RLD, one based on simulated annealing (SA), and the other one based on neural network (NN). Empirical results on three classic CO problems demonstrate that both of our methods can achieve comparable or better performance against the previous state-of-the-art (SOTA) SA- and NN-based solvers. In particular, our SA algorithm reduces the runtime of the previous SOTA SA method by up to 80\%, while achieving equal or superior performance. In summary, RLD offers a promising framework for enhancing both traditional heuristics and NN models to solve CO problems. Our code is available at https://github.com/Shengyu-Feng/RLD4CO.
中文: 本研究提出了正则化朗之万动力学(RLD)采样框架,通过避免局部最优来改进组合优化,并开发了两种求解器,在性能和效率上均达到或超越了现有最优方法。
English: This study introduces Regularized Langevin Dynamics (RLD), a sampling framework that enhances combinatorial optimization by preventing local minima, and develops two solvers that match or surpass state-of-the-art methods in performance and efficiency.
Authors:Binchi Zhang, Zaiyi Zheng, Zhengzhang Chen, Jundong Li
Abstract:
Symmetry in the parameter space of deep neural networks (DNNs) has proven beneficial for various deep learning applications. A well-known example is the permutation symmetry in Multi-Layer Perceptrons (MLPs), where permuting the rows of weight matrices in one layer and applying the inverse permutation to adjacent layers yields a functionally equivalent model. While permutation symmetry fully characterizes the equivalence set for MLPs, its discrete nature limits its utility for transformers. In this paper, we introduce rotation symmetry, a novel form of parameter space symmetry for transformers that generalizes permutation symmetry by rotating parameter matrices in self-attention layers. Unlike permutation symmetry, rotation symmetry operates in a continuous domain, thereby significantly expanding the equivalence set for transformers. Based on this property, we propose a theoretically optimal parameter matching algorithm as a plug-and-play module to enhance model fusion. We evaluate our approach using pre-trained transformers across diverse natural language and vision tasks. Experimental results demonstrate that our rotation symmetry-based matching algorithm substantially improves model fusion, highlighting the potential of parameter space symmetry to facilitate model fusion. Our code is available on https://github.com/zhengzaiyi/RotationSymmetry.
中文摘要:本文提出旋转对称性,一种适用于Transformer的连续参数空间对称性,推广了离散的置换对称性,并通过最优匹配算法显著提升了跨语言与视觉任务的模型融合效果。
English Summary: The paper introduces rotation symmetry, a continuous parameter space symmetry for transformers that generalizes discrete permutation symmetry, and proposes an optimal matching algorithm to significantly enhance model fusion across language and vision tasks.
Authors:Yasi Zhang, Oscar Leong
Abstract:
Learning effective regularization is crucial for solving ill-posed inverse problems, which arise in a wide range of scientific and engineering applications. While data-driven methods that parameterize regularizers using deep neural networks have demonstrated strong empirical performance, they often result in highly nonconvex formulations that lack theoretical guarantees. Recent work has shown that incorporating structured nonconvexity into neural network-based regularizers, such as weak convexity, can strike a balance between empirical performance and theoretical tractability. In this paper, we demonstrate that a broader class of nonconvex functions, difference-of-convex (DC) functions, can yield improved empirical performance while retaining strong convergence guarantees. The DC structure enables the use of well-established optimization algorithms, such as the Difference-of-Convex Algorithm (DCA) and a Proximal Subgradient Method (PSM), which extend beyond standard gradient descent. Furthermore, we provide theoretical insights into the conditions under which optimal regularizers can be expressed as DC functions. Extensive experiments on computed tomography (CT) reconstruction tasks show that our approach achieves strong performance across sparse and limited-view settings, consistently outperforming other weakly supervised learned regularizers. Our code is available at \url{https://github.com/YasminZhang/ADCR}.
Chinese: 本文提出了一种用于逆问题中学习正则化器的凸差函数框架,该框架在CT重建任务中显著提升了实证性能并具备坚实的理论保证,优于现有方法。
English: This paper introduces a difference-of-convex (DC) function framework for learning regularizers in inverse problems, which enhances empirical performance with strong theoretical guarantees and outperforms existing methods in CT reconstruction tasks.
Authors:Akiyoshi Tomihari, Issei Sato
Abstract:
Transformers are challenging to optimize with SGD and typically require adaptive optimizers such as Adam. However, the reasons behind the superior performance of Adam over SGD remain unclear. In this study, we investigate the optimization of transformers by focusing on gradient heterogeneity, defined as the disparity in gradient norms among parameters. Our analysis shows that gradient heterogeneity hinders gradient-based optimization, including SGD, while sign-based optimization, a simplified variant of Adam, is less affected. We further examine gradient heterogeneity in transformers and show that it is influenced by the placement of layer normalization. Experimental results from fine-tuning transformers in both NLP and vision domains validate our theoretical analyses. This study provides insights into the optimization challenges of transformers and offers guidance for designing future optimization algorithms. Code is available at https://github.com/tom4649/gradient-heterogeneity.
中文: 本研究揭示了梯度异质性(受层归一化位置影响)阻碍了Transformer中SGD的优化,而像Adam这样的基于符号的优化方法更具鲁棒性,并在NLP和视觉任务中得到了实验验证。
English: This study reveals that gradient heterogeneity, influenced by layer normalization placement, hinders SGD optimization in transformers, while sign-based methods like Adam are more robust, with experimental validation across NLP and vision tasks.
Authors:Kefan Dong, Tengyu Ma
Abstract:
A fundamental challenge in formal theorem proving by LLMs is the lack of high-quality training data. Although reinforcement learning or expert iteration partially mitigates this issue by alternating between LLM generating proofs and finetuning them on correctly generated ones, performance quickly plateaus due to the scarcity of correct proofs (sparse rewards). To keep improving the models with limited data, we draw inspiration from mathematicians, who continuously develop new results, partly by proposing novel conjectures or exercises (which are often variants of known results) and attempting to solve them. We design the Self-play Theorem Prover (STP) that simultaneously takes on two roles, conjecturer and prover, each providing training signals to the other. The conjecturer is trained iteratively on previously generated conjectures that are barely provable by the current prover, which incentivizes it to generate increasingly challenging conjectures over time. The prover attempts to prove the conjectures with standard expert iteration. We evaluate STP with both Lean and Isabelle formal versifiers. With 51.3 billion tokens generated during the training in Lean, STP proves 28.5% of the statements in the LeanWorkbook dataset, doubling the previous best result of 13.2% achieved through expert iteration. The final model achieves state-of-the-art performance among whole-proof generation methods on miniF2F-test (65.0%, pass@3200), Proofnet-test (23.9%, pass@3200) and PutnamBench (8/644, pass@3200). We release our code, model, and dataset in this URL: https://github.com/kfdong/STP.
Chinese: 自博弈定理证明器(STP)通过让猜想器生成日益复杂的定理、证明器尝试解决它们,有效应对了形式定理证明中高质量训练数据不足的问题,并在多个基准测试中取得了领先性能。
English: The Self-play Theorem Prover (STP) addresses the scarcity of high-quality training data in formal theorem proving by having a conjecturer generate increasingly challenging theorems and a prover solve them, achieving state-of-the-art results on multiple benchmarks.
Authors:Rachel Mikulinsky, Morris Alper, Shai Gordin, Enrique Jiménez, Yoram Cohen, Hadar Averbuch-Elor
Abstract:
The cuneiform writing system served as the medium for transmitting knowledge in the ancient Near East for a period of over three thousand years. Cuneiform signs have a complex internal structure which is the subject of expert paleographic analysis, as variations in sign shapes bear witness to historical developments and transmission of writing and culture over time. However, prior automated techniques mostly treat sign types as categorical and do not explicitly model their highly varied internal configurations. In this work, we present an unsupervised approach for recovering the fine-grained internal configuration of cuneiform signs by leveraging powerful generative models and the appearance and structure of prototype font images as priors. Our approach, ProtoSnap, enforces structural consistency on matches found with deep image features to estimate the diverse configurations of cuneiform characters, snapping a skeleton-based template to photographed cuneiform signs. We provide a new benchmark of expert annotations and evaluate our method on this task. Our evaluation shows that our approach succeeds in aligning prototype skeletons to a wide variety of cuneiform signs. Moreover, we show that conditioning on structures produced by our method allows for generating synthetic data with correct structural configurations, significantly boosting the performance of cuneiform sign recognition beyond existing techniques, in particular over rare signs. Our code, data, and trained models are available at the project page: https://tau-vailab.github.io/ProtoSnap/
Authors:Bidossessi Emmanuel Agossou, Marius Pedersen, Kiran Raja, Anuja Vats, PÃ¥l Anders Floor
Abstract:
Pathology detection in Wireless Capsule Endoscopy (WCE) using deep learning has been explored in the recent past. However, deep learning models can be influenced by the color quality of the dataset used to train them, impacting detection, segmentation and classification tasks. In this work, we evaluate the impact of color correction on pathology detection using two prominent object detection models: Retinanet and YOLOv5. We first generate two color corrected versions of a popular WCE dataset (i.e., SEE-AI dataset) using two different color correction functions. We then evaluate the performance of the Retinanet and YOLOv5 on the original and color corrected versions of the dataset. The results reveal that color correction makes the models generate larger bounding boxes and larger intersection areas with the ground truth annotations. Furthermore, color correction leads to an increased number of false positives for certain pathologies. However, these effects do not translate into a consistent improvement in performance metrics such as F1-scores, IoU, and AP50. The code is available at https://github.com/agossouema2011/WCE2024. Keywords: Wireless Capsule Endoscopy, Color correction, Retinanet, YOLOv5, Detection
中文摘要:本研究评估了颜色校正对无线胶囊内窥镜病理检测的影响,发现尽管颜色校正会扩大检测框并增加假阳性,但并未持续提升关键性能指标。
English Summary: This study evaluates how color correction affects pathology detection in Wireless Capsule Endoscopy using Retinanet and YOLOv5, finding that while it enlarges bounding boxes and increases false positives, it does not consistently improve key performance metrics.
Authors:Soon Jynn Chu, Nalaka Amarasiri, Sandesh Giri, Priyata Kafle
Abstract:
Type 1 Diabetes is a chronic autoimmune condition in which the immune system attacks and destroys insulin-producing beta cells in the pancreas, resulting in little to no insulin production. Insulin helps glucose in your blood enter your muscle, fat, and liver cells so they can use it for energy or store it for later use. If insulin is insufficient, it causes sugar to build up in the blood and leads to serious health problems. People with Type 1 Diabetes need synthetic insulin every day. In diabetes management, continuous glucose monitoring is an important feature that provides near real-time blood glucose data. It is useful in deciding the synthetic insulin dose. In this research work, we used machine learning tools, deep neural networks, deep reinforcement learning, and voting and stacking regressors to predict blood glucose levels at 30-min time intervals using the latest DiaTrend dataset. Predicting blood glucose levels is useful in better diabetes management systems. The trained models were compared using several evaluation metrics. Our evaluation results demonstrate the performance of various models across different glycemic conditions for blood glucose prediction. The source codes of this work can be found in: https://github.com/soon-jynn-chu/t1d_bg_prediction
中文: 本研究应用机器学习技术,基于DiaTrend数据集以30分钟为间隔预测血糖水平,旨在通过改进血糖监测和胰岛素剂量决策来优化糖尿病管理。
English: This research applies machine learning techniques to predict blood glucose levels at 30-minute intervals using the DiaTrend dataset, aiming to improve diabetes management through enhanced glucose monitoring and insulin dosing decisions.
Authors:Matthew Chen, Joshua Engels, Max Tegmark
Abstract:
Sparse autoencoders (SAEs) decompose language model representations into a sparse set of linear latent vectors. Recent works have improved SAEs using language model gradients, but these techniques require many expensive backward passes during training and still cause a significant increase in cross entropy loss when SAE reconstructions are inserted into the model. In this work, we improve on these limitations by taking a fundamentally different approach: we use low-rank adaptation (LoRA) to finetune the \textit{language model itself} around a previously trained SAE. We analyze our method across SAE sparsity, SAE width, language model size, LoRA rank, and model layer on the Gemma Scope family of SAEs. In these settings, our method reduces the cross entropy loss gap by 30\% to 55\% when SAEs are inserted during the forward pass. We also find that compared to end-to-end (e2e) SAEs, our approach achieves the same downstream cross entropy loss 3$\times$ to 20$\times$ faster on \gemma and 2$\times$ to 10$\times$ faster on \llama. We further show that our technique improves downstream metrics and can adapt multiple SAEs at once without harming general language model capabilities. Our results demonstrate that improving model interpretability is not limited to post-hoc SAE training; Pareto improvements can also be achieved by directly optimizing the model itself.
Chinese: 本研究采用低秩自适应(LoRA)技术对预训练稀疏自编码器(SAE)周围的语言模型进行微调,将交叉熵损失差距降低30%至55%,相比传统方法实现3到20倍的加速收敛,同时保持模型性能不受影响。
English: This work introduces a novel approach using low-rank adaptation (LoRA) to fine-tune language models around pre-trained sparse autoencoders (SAEs), significantly reducing cross-entropy loss gaps by 30% to 55% and achieving faster convergence compared to traditional methods while maintaining model performance.
Authors:Andrey Polubarov, Nikita Lyubaykin, Alexander Derevyagin, Ilya Zisman, Denis Tarasov, Alexander Nikulin, Vladislav Kurenkov
Abstract:
In-Context Reinforcement Learning (ICRL) represents a promising paradigm for developing generalist agents that learn at inference time through trial-and-error interactions, analogous to how large language models adapt contextually, but with a focus on reward maximization. However, the scalability of ICRL beyond toy tasks and single-domain settings remains an open challenge. In this work, we present the first steps toward scaling ICRL by introducing a fixed, cross-domain model capable of learning behaviors through in-context reinforcement learning. Our results demonstrate that Algorithm Distillation, a framework designed to facilitate ICRL, offers a compelling and competitive alternative to expert distillation to construct versatile action models. These findings highlight the potential of ICRL as a scalable approach for generalist decision-making systems. Code to be released at https://github.com/dunnolab/vintix
中文摘要:情境强化学习(ICRL)通过推理阶段的试错学习展现出开发通用智能体的潜力,其中算法蒸馏框架成为构建多功能行动模型的优越替代方案,优于专家蒸馏方法。
English Summary: In-Context Reinforcement Learning (ICRL) shows potential for developing generalist agents through trial-and-error learning at inference time, with Algorithm Distillation emerging as a competitive alternative to expert distillation for creating versatile action models.
Authors:Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, Tatsunori Hashimoto
Abstract:
Test-time scaling is a promising new approach to language modeling that uses extra test-time compute to improve performance. Recently, OpenAI's o1 model showed this capability but did not publicly share its methodology, leading to many replication efforts. We seek the simplest approach to achieve test-time scaling and strong reasoning performance. First, we curate a small dataset s1K of 1,000 questions paired with reasoning traces relying on three criteria we validate through ablations: difficulty, diversity, and quality. Second, we develop budget forcing to control test-time compute by forcefully terminating the model's thinking process or lengthening it by appending "Wait" multiple times to the model's generation when it tries to end. This can lead the model to double-check its answer, often fixing incorrect reasoning steps. After supervised finetuning the Qwen2.5-32B-Instruct language model on s1K and equipping it with budget forcing, our model s1-32B exceeds o1-preview on competition math questions by up to 27% (MATH and AIME24). Further, scaling s1-32B with budget forcing allows extrapolating beyond its performance without test-time intervention: from 50% to 57% on AIME24. Our model, data, and code are open-source at https://github.com/simplescaling/s1
中文摘要:本研究通过构建精选数据集和预算强制技术,提出了一种简单的测试时扩展方法,使Qwen2.5-32B模型在数学推理任务上超越OpenAI的o1-preview,同时保持完全开源。
English Summary: This study introduces a simple test-time scaling method using a curated dataset and budget forcing technique, enabling the Qwen2.5-32B model to outperform OpenAI's o1-preview on math reasoning tasks while being fully open-source.
Authors:Wenzhi Fang, Dong-Jun Han, Liangqi Yuan, Seyyedali Hosseinalipour, Christopher G. Brinton
Abstract:
Fine-tuning large language models (LLMs) on devices remains a challenging problem. Recent works have fused low-rank adaptation (LoRA) techniques with federated fine-tuning to mitigate challenges associated with device model sizes and data scarcity. Still, the heterogeneity of resources remains a critical bottleneck: while higher-rank modules generally enhance performance, varying device capabilities constrain LoRA's feasible rank range. Existing approaches attempting to resolve this issue either lack analytical justification or impose additional computational overhead, leaving a wide gap for efficient and theoretically-grounded solutions. To address these challenges, we propose federated sketching LoRA (FSLoRA), which leverages a sketching mechanism to enable devices to selectively update submatrices of global LoRA modules maintained by the server. By adjusting the sketching ratios, which determine the ranks of the submatrices on the devices, FSLoRA flexibly adapts to device-specific communication and computational constraints. We provide a rigorous convergence analysis of FSLoRA that characterizes how the sketching ratios affect the convergence rate. Through comprehensive experiments on multiple datasets and LLM models, we demonstrate FSLoRA's performance improvements compared to various baselines. The code is available at https://github.com/wenzhifang/Federated-Sketching-LoRA-Implementation.
中文: 在资源受限的客户端微调大语言模型具有挑战性,但FSLoRA通过草图机制使客户端能选择性更新全局LoRA模块的子矩阵,灵活适应不同客户端的通信与计算限制,并提供了严格的收敛性分析和实验性能提升验证。
English: Fine-tuning large language models on resource-limited clients is challenging, but FSLoRA introduces a sketching mechanism that allows clients to selectively update submatrices of global LoRA modules, adapting to their specific constraints while providing rigorous convergence analysis and demonstrating performance gains in experiments.
Authors:Xingyou Song, Dara Bahri
Abstract:
Language models have recently been shown capable of performing regression wherein numeric predictions are represented as decoded strings. In this work, we provide theoretical grounds for this capability and furthermore investigate the utility of causal sequence decoding models as numeric regression heads given any feature representation. We find that, despite being trained in the usual way - for next-token prediction via cross-entropy loss - decoder-based heads are as performant as standard pointwise heads when benchmarked over standard regression tasks, while being flexible enough to capture smooth numeric distributions, such as in the task of density estimation.
中文摘要:语言模型能够通过将数值预测解码为字符串来有效执行回归任务,其中基于解码器的回归头在性能上与标准方法相当,同时能更好地捕捉平滑数值分布,如密度估计。
English Summary: Language models can effectively perform numeric regression by decoding predictions as strings, with decoder-based heads matching the performance of standard regression methods while offering greater flexibility for tasks like density estimation.
Authors:Natalie Maus, Kyurae Kim, Yimeng Zeng, Haydn Thomas Jones, Fangping Wan, Marcelo Der Torossian Torres, Cesar de la Fuente-Nunez, Jacob R. Gardner
Abstract:
In multi-objective black-box optimization, the goal is typically to find solutions that optimize a set of $T$ black-box objective functions, $f_1$, ..., $f_T$, simultaneously. Traditional approaches often seek a single Pareto-optimal set that balances trade-offs among all objectives. In this work, we consider a problem setting that departs from this paradigm: finding a small set of K < T solutions, that collectively "covers" the T objectives. A set of solutions is defined as "covering" if, for each objective $f_1$, ..., $f_T$, there is at least one good solution. A motivating example for this problem setting occurs in drug design. For example, we may have T pathogens and aim to identify a set of K < T antibiotics such that at least one antibiotic can be used to treat each pathogen. To address this problem, we propose Multi-Objective Coverage Bayesian Optimization (MOCOBO), a principled algorithm designed to efficiently find a covering set. We validate our approach through experiments on challenging high-dimensional tasks, including applications in peptide and molecular design, where MOCOBO is shown to find high-performing covering sets of solutions. The results show that the coverage of the K < T solutions found by MOCOBO matches or nearly matches the coverage of T solutions obtained by optimizing each objective individually. Furthermore, in in vitro experiments, the peptides found by MOCOBO exhibited high potency against drug-resistant pathogens, further demonstrating the potential of MOCOBO for drug discovery. We make code available here: https://github.com/nataliemaus/mocobo.
中文: 本文提出了多目标覆盖贝叶斯优化(MOCOBO)算法,通过寻找K < T个解的集合来有效覆盖所有T个目标,在药物设计等应用中展现出优越性能。
English: This paper introduces Multi-Objective Coverage Bayesian Optimization (MOCOBO), a novel algorithm that efficiently identifies a small set of K < T solutions to collectively cover all T objectives in multi-objective black-box optimization, with demonstrated success in drug design applications.
Authors:Arsenii Gavrikov, Julián GarcÃa Pardiñas, Alberto Garfagnini
Abstract:
Ensuring reliable data collection in large-scale particle physics experiments demands Data Quality Monitoring (DQM) procedures to detect possible detector malfunctions and preserve data integrity. Traditionally, this resource-intensive task has been handled by human shifters who struggle with frequent changes in operational conditions. We present DINAMO: a novel, interpretable, robust, and scalable DQM framework designed to automate anomaly detection in time-dependent settings. Our approach constructs evolving histogram templates with built-in uncertainties, featuring both a statistical variant - extending the classical Exponentially Weighted Moving Average (EWMA) - and a machine learning (ML)-enhanced version that leverages a transformer encoder for improved adaptability. Experimental validations on synthetic datasets demonstrate the high accuracy, adaptability, and interpretability of these methods. The statistical variant is being commissioned in the LHCb experiment at the Large Hadron Collider, underscoring its real-world impact. The code used in this study is available at https://github.com/ArseniiGav/DINAMO.
中文摘要:DINAMO提出了一种可解释且可扩展的框架,用于自动化粒子物理实验中的异常检测,结合了统计和机器学习方法,已在LHCb实验中验证其高精度并投入实际应用。
English Summary: DINAMO introduces an interpretable and scalable framework for automating anomaly detection in particle physics experiments, combining statistical and machine learning approaches that have demonstrated high accuracy and are being implemented in the LHCb experiment.
Authors:Hong Huang, Hai Yang, Yuan Chen, Jiaxun Ye, Dapeng Wu
Abstract:
Federated Learning (FL) enables collaborative model training across distributed clients without data sharing, but its high computational and communication demands strain resource-constrained devices. While existing methods use dynamic pruning to improve efficiency by periodically adjusting sparse model topologies while maintaining sparsity, these approaches suffer from issues such as greedy adjustments, unstable topologies, and communication inefficiency, resulting in less robust models and suboptimal performance under data heterogeneity and partial client availability. To address these challenges, we propose Federated Robust pruning via combinatorial Thompson Sampling (FedRTS), a novel framework designed to develop robust sparse models. FedRTS enhances robustness and performance through its Thompson Sampling-based Adjustment (TSAdj) mechanism, which uses probabilistic decisions informed by stable, farsighted information instead of deterministic decisions reliant on unstable and myopic information in previous methods. Extensive experiments demonstrate that FedRTS achieves state-of-the-art performance in computer vision and natural language processing tasks while reducing communication costs, particularly excelling in scenarios with heterogeneous data distributions and partial client participation. Our codes are available at: https://github.com/Little0o0/FedRTS
中文摘要:FedRTS提出了一种基于组合汤普森采样的新型联邦学习框架,通过概率决策机制开发鲁棒稀疏模型,在数据异构场景下实现了更优性能并显著降低了通信成本。
English Summary: FedRTS introduces a novel federated learning framework using combinatorial Thompson Sampling to develop robust sparse models, achieving superior performance and reduced communication costs in heterogeneous data environments.
Authors:Jialin Zhao, Yingtao Zhang, Carlo Vittorio Cannistraci
Abstract:
The rapid growth of Large Language Models has driven demand for effective model compression techniques to reduce memory and computation costs. Low-rank pruning has gained attention for its GPU compatibility across all densities. However, low-rank pruning struggles to match the performance of semi-structured pruning, often doubling perplexity at similar densities. In this paper, we propose Pivoting Factorization (PIFA), a novel lossless meta low-rank representation that unsupervisedly learns a compact form of any low-rank representation, effectively eliminating redundant information. PIFA identifies pivot rows (linearly independent rows) and expresses non-pivot rows as linear combinations, achieving 24.2% additional memory savings and 24.6% faster inference over low-rank layers at rank = 50% of dimension. To mitigate the performance degradation caused by low-rank pruning, we introduce a novel, retraining-free reconstruction method that minimizes error accumulation (M). MPIFA, combining M and PIFA into an end-to-end framework, significantly outperforms existing low-rank pruning methods, and achieves performance comparable to semi-structured pruning, while surpassing it in GPU efficiency and compatibility. Our code is available at https://github.com/biomedical-cybernetics/pivoting-factorization.
中文摘要:本文提出了Pivoting Factorization (PIFA),一种无损元低秩表示方法,通过消除冗余信息提升模型压缩效果,并结合误差最小化重构的MPIFA框架,在保持与半结构化剪枝相当性能的同时显著提高了GPU效率。
English Summary: The paper introduces Pivoting Factorization (PIFA), a lossless meta low-rank representation that enhances model compression by eliminating redundancy, and MPIFA, an end-to-end framework combining PIFA with error-minimizing reconstruction to achieve performance comparable to semi-structured pruning while improving GPU efficiency.
Authors:Seungheun Baek, Soyon Park, Yan Ting Chok, Mogan Gim, Jaewoo Kang
Abstract:
Motivation: Predicting cellular responses to genetic perturbations is essential for understanding biological systems and developing targeted therapeutic strategies. While variational autoencoders (VAEs) have shown promise in modeling perturbation responses, their limited explainability poses a significant challenge, as the learned features often lack clear biological meaning. Nevertheless, model explainability is one of the most important aspects in the realm of biological AI. One of the most effective ways to achieve explainability is incorporating the concept of gene regulatory networks (GRNs) in designing deep learning models such as VAEs. GRNs elicit the underlying causal relationships between genes and are capable of explaining the transcriptional responses caused by genetic perturbation treatments. Results: We propose GPO-VAE, an explainable VAE enhanced by GRN-aligned Parameter Optimization that explicitly models gene regulatory networks in the latent space. Our key approach is to optimize the learnable parameters related to latent perturbation effects towards GRN-aligned explainability. Experimental results on perturbation prediction show our model achieves state-of-the-art performance in predicting transcriptional responses across multiple benchmark datasets. Furthermore, additional results on evaluating the GRN inference task reveal our model's ability to generate meaningful GRNs compared to other methods. According to qualitative analysis, GPO-VAE posseses the ability to construct biologically explainable GRNs that align with experimentally validated regulatory pathways. GPO-VAE is available at https://github.com/dmis-lab/GPO-VAE
Chinese: GPO-VAE模型通过将基因调控网络整合到变分自编码器中,显著提升了模型的可解释性,在遗传扰动预测任务中表现优异,并能生成与实验验证通路一致的生物学调控网络。
English: The GPO-VAE model integrates gene regulatory networks into a variational autoencoder framework to enhance explainability and achieves state-of-the-art performance in predicting cellular responses to genetic perturbations while generating biologically meaningful regulatory networks.
Authors:Anh Bui, Trang Vu, Long Vuong, Trung Le, Paul Montague, Tamas Abraham, Junae Kim, Dinh Phung
Abstract:
Concept erasure has emerged as a promising technique for mitigating the risk of harmful content generation in diffusion models by selectively unlearning undesirable concepts. The common principle of previous works to remove a specific concept is to map it to a fixed generic concept, such as a neutral concept or just an empty text prompt. In this paper, we demonstrate that this fixed-target strategy is suboptimal, as it fails to account for the impact of erasing one concept on the others. To address this limitation, we model the concept space as a graph and empirically analyze the effects of erasing one concept on the remaining concepts. Our analysis uncovers intriguing geometric properties of the concept space, where the influence of erasing a concept is confined to a local region. Building on this insight, we propose the Adaptive Guided Erasure (AGE) method, which \emph{dynamically} selects optimal target concepts tailored to each undesirable concept, minimizing unintended side effects. Experimental results show that AGE significantly outperforms state-of-the-art erasure methods on preserving unrelated concepts while maintaining effective erasure performance. Our code is published at {https://github.com/tuananhbui89/Adaptive-Guided-Erasure}.
Chinese: 本文提出自适应引导擦除方法(AGE),通过动态选择最优目标概念来消除扩散模型中的不良概念,有效减少对其他概念的副作用,并在实验中显著优于现有擦除技术。
English: This paper introduces Adaptive Guided Erasure (AGE), a method that dynamically selects optimal target concepts for erasing undesirable ones in diffusion models, minimizing side effects on unrelated concepts and outperforming existing techniques.
Authors:Zi-Jian Cheng, Zi-Yi Jia, Zhi Zhou, Yu-Feng Li, Lan-Zhe Guo
Abstract:
Tabular data is widely utilized in various machine learning tasks. Current tabular learning research predominantly focuses on closed environments, while in real-world applications, open environments are often encountered, where distribution and feature shifts occur, leading to significant degradation in model performance. Previous research has primarily concentrated on mitigating distribution shifts, whereas feature shifts, a distinctive and unexplored challenge of tabular data, have garnered limited attention. To this end, this paper conducts the first comprehensive study on feature shifts in tabular data and introduces the first tabular feature-shift benchmark (TabFSBench). TabFSBench evaluates impacts of four distinct feature-shift scenarios on four tabular model categories across various datasets and assesses the performance of large language models (LLMs) and tabular LLMs in the tabular benchmark for the first time. Our study demonstrates three main observations: (1) most tabular models have the limited applicability in feature-shift scenarios; (2) the shifted feature set importance has a linear relationship with model performance degradation; (3) model performance in closed environments correlates with feature-shift performance. Future research direction is also explored for each observation.
Benchmark: https://github.com/LAMDASZ-ML/TabFSBench.
中文摘要:本文首次对表格数据中的特征偏移问题进行全面研究,提出TabFSBench基准评估不同特征偏移场景下的模型表现,揭示了模型适用性局限及性能相关性等关键发现。
English Summary: This paper presents the first comprehensive study on feature shifts in tabular data, introducing TabFSBench to evaluate model performance across various feature-shift scenarios and revealing key findings about model limitations and performance correlations.
Authors:Jaesin Ahn, Heechul Jung
Abstract:
Text-to-image diffusion models show remarkable generation performance following text prompts, but risk generating Not Safe For Work (NSFW) contents from unsafe prompts. Existing approaches, such as prompt filtering or concept unlearning, fail to defend against adversarial attacks while maintaining benign image quality. In this paper, we propose a novel approach called Distorting Embedding Space (DES), a text encoder-based defense mechanism that effectively tackles these issues through innovative embedding space control. DES transforms unsafe embeddings, extracted from a text encoder using unsafe prompts, toward carefully calculated safe embedding regions to prevent unsafe contents generation, while reproducing the original safe embeddings. DES also neutralizes the nudity embedding, extracted using prompt ``nudity", by aligning it with neutral embedding to enhance robustness against adversarial attacks. These methods ensure both robust defense and high-quality image generation. Additionally, DES can be adopted in a plug-and-play manner and requires zero inference overhead, facilitating its deployment. Extensive experiments on diverse attack types, including black-box and white-box scenarios, demonstrate DES's state-of-the-art performance in both defense capability and benign image generation quality. Our model is available at https://github.com/aei13/DES.
中文:提出的扭曲嵌入空间(DES)方法通过将不安全文本嵌入转换为安全区域,有效防止生成不良内容,同时保持高质量良性图像输出并增强对抗攻击的防御能力。
English: The proposed Distorting Embedding Space (DES) method effectively prevents NSFW content generation by transforming unsafe text embeddings into safe regions while maintaining high-quality benign image output and robust defense against adversarial attacks.
Authors:Antoine Simoulin, Namyong Park, Xiaoyi Liu, Grey Yang
Abstract:
Fine-tuning provides an effective means to specialize pre-trained models for various downstream tasks. However, fine-tuning often incurs high memory overhead, especially for large transformer-based models, such as LLMs. While existing methods may reduce certain parts of the memory required for fine-tuning, they still require caching all intermediate activations computed in the forward pass to update weights during the backward pass. In this work, we develop TokenTune, a method to reduce memory usage, specifically the memory to store intermediate activations, in the fine-tuning of transformer-based models. During the backward pass, TokenTune approximates the gradient computation by backpropagating through just a subset of input tokens. Thus, with TokenTune, only a subset of intermediate activations are cached during the forward pass. Also, TokenTune can be easily combined with existing methods like LoRA, further reducing the memory cost. We evaluate our approach on pre-trained transformer models with up to billions of parameters, considering the performance on multiple downstream tasks such as text classification and question answering in a few-shot learning setup. Overall, TokenTune achieves performance on par with full fine-tuning or representative memory-efficient fine-tuning methods, while greatly reducing the memory footprint, especially when combined with other methods with complementary memory reduction mechanisms. We hope that our approach will facilitate the fine-tuning of large transformers, in specializing them for specific domains or co-training them with other neural components from a larger system. Our code is available at https://github.com/facebookresearch/tokentune.
Chinese: TokenTune是一种针对Transformer模型的高效微调方法,通过在反向传播中仅处理部分输入标记来减少激活内存占用,在保持与全参数微调相当性能的同时显著降低了内存消耗。
English: TokenTune is a memory-efficient fine-tuning method for transformer models that reduces activation memory by processing only a subset of tokens during backward passes, achieving comparable performance to full fine-tuning while significantly lowering memory usage.
Authors:Zehong Wang, Zheyuan Zhang, Tianyi Ma, Nitesh V Chawla, Chuxu Zhang, Yanfang Ye
Abstract:
Graph learning tasks often hinge on identifying key substructure patterns -- such as triadic closures in social networks or benzene rings in molecular graphs -- that underpin downstream performance. However, most existing graph neural networks (GNNs) rely on message passing, which aggregates local neighborhood information iteratively and struggles to explicitly capture such fundamental motifs, like triangles, k-cliques, and rings. This limitation hinders both expressiveness and long-range dependency modeling. In this paper, we introduce the Neural Graph Pattern Machine (GPM), a novel framework that bypasses message passing by learning directly from graph substructures. GPM efficiently extracts, encodes, and prioritizes task-relevant graph patterns, offering greater expressivity and improved ability to capture long-range dependencies. Empirical evaluations across four standard tasks -- node classification, link prediction, graph classification, and graph regression -- demonstrate that GPM outperforms state-of-the-art baselines. Further analysis reveals that GPM exhibits strong out-of-distribution generalization, desirable scalability, and enhanced interpretability. Code and datasets are available at: https://github.com/Zehong-Wang/GPM.
中文摘要:神经图模式机(GPM)框架通过直接从图子结构学习,克服了传统图神经网络的局限性,在多项任务中实现更优性能,同时展现出更强的表达能力和可解释性。
English Summary: The Neural Graph Pattern Machine (GPM) framework overcomes limitations of traditional graph neural networks by directly learning from key graph substructures, achieving superior performance across multiple tasks while demonstrating enhanced expressivity and interpretability.
Authors:Harshwardhan Praveen, Jacob Brown, Christopher Earls
Abstract:
In this work, we present a mesh-independent, data-driven library, chebgreen, to mathematically model one-dimensional systems, possessing an associated control parameter, and whose governing partial differential equation is unknown. The proposed method learns an Empirical Green's Function for the associated, but hidden, boundary value problem, in the form of a Rational Neural Network from which we subsequently construct a bivariate representation in a Chebyshev basis. We uncover the Green's function, at an unseen control parameter value, by interpolating the left and right singular functions within a suitable library, expressed as points on a manifold of Quasimatrices, while the associated singular values are interpolated with Lagrange polynomials.
中文摘要:本研究提出了chebgreen这一与网格无关的数据驱动库,通过有理神经网络和切比雪夫基表示,从数据中学习经验格林函数来建模控制方程未知的一维系统。
English Summary: The study introduces chebgreen, a mesh-independent data-driven library that models one-dimensional systems with unknown governing equations by learning Empirical Green's Functions through Rational Neural Networks and Chebyshev basis representations.
Authors:Yinbo Chen, Rohit Girdhar, Xiaolong Wang, Sai Saketh Rambhatla, Ishan Misra
Abstract:
Tokenizing images into compact visual representations is a key step in learning efficient and high-quality image generative models. We present a simple diffusion tokenizer (DiTo) that learns compact visual representations for image generation models. Our key insight is that a single learning objective, diffusion L2 loss, can be used for training scalable image tokenizers. Since diffusion is already widely used for image generation, our insight greatly simplifies training such tokenizers. In contrast, current state-of-the-art tokenizers rely on an empirically found combination of heuristics and losses, thus requiring a complex training recipe that relies on non-trivially balancing different losses and pretrained supervised models. We show design decisions, along with theoretical grounding, that enable us to scale DiTo for learning competitive image representations. Our results show that DiTo is a simpler, scalable, and self-supervised alternative to the current state-of-the-art image tokenizer which is supervised. DiTo achieves competitive or better quality than state-of-the-art in image reconstruction and downstream image generation tasks.
Authors:Hao Dong, Moru Liu, Kaiyang Zhou, Eleni Chatzi, Juho Kannala, Cyrill Stachniss, Olga Fink
Abstract:
In real-world scenarios, achieving domain adaptation and generalization poses significant challenges, as models must adapt to or generalize across unknown target distributions. Extending these capabilities to unseen multimodal distributions, i.e., multimodal domain adaptation and generalization, is even more challenging due to the distinct characteristics of different modalities. Significant progress has been made over the years, with applications ranging from action recognition to semantic segmentation. Besides, the recent advent of large-scale pre-trained multimodal foundation models, such as CLIP, has inspired works leveraging these models to enhance adaptation and generalization performances or adapting them to downstream tasks. This survey provides the first comprehensive review of recent advances from traditional approaches to foundation models, covering: (1) Multimodal domain adaptation; (2) Multimodal test-time adaptation; (3) Multimodal domain generalization; (4) Domain adaptation and generalization with the help of multimodal foundation models; and (5) Adaptation of multimodal foundation models. For each topic, we formally define the problem and thoroughly review existing methods. Additionally, we analyze relevant datasets and applications, highlighting open challenges and potential future research directions. We maintain an active repository that contains up-to-date literature at https://github.com/donghao51/Awesome-Multimodal-Adaptation.
中文: 本综述全面回顾了多模态领域自适应与泛化的研究进展,涵盖从传统方法到基础模型的各类方法,并分析了关键挑战、应用场景及未来研究方向。
English: This survey comprehensively reviews multimodal domain adaptation and generalization, covering traditional methods to foundation models, analyzing key challenges, applications, and future research directions.
Authors:Matthieu Barreau, Haoming Shen
Abstract:
Physics-Informed Neural Networks (PINNs) have emerged as powerful tools for integrating physics-based models with data by minimizing both data and physics losses. However, this multi-objective optimization problem is notoriously challenging, with some benchmark problems leading to unfeasible solutions. To address these issues, various strategies have been proposed, including adaptive weight adjustments in the loss function. In this work, we introduce clear definitions of accuracy and robustness in the context of PINNs and propose a novel training algorithm based on the Primal-Dual (PD) optimization framework. Our approach enhances the robustness of PINNs while maintaining comparable performance to existing weight-balancing methods. Numerical experiments demonstrate that the PD method consistently achieves reliable solutions across all investigated cases, even in the low-data regime, and can be easily implemented, facilitating its practical adoption. The code is available at https://github.com/haoming-SHEN/Accuracy-and-Robustness-of-Weight-Balancing-Methods-for-Training-PINNs.git.
中文: 本文提出了一种基于原始-对偶优化框架的新训练算法,增强了物理信息神经网络的鲁棒性,在保持与现有权重平衡方法相当性能的同时,实现了所有测试场景下的可靠求解。
English: This paper introduces a Primal-Dual optimization framework to enhance the robustness of Physics-Informed Neural Networks (PINNs), achieving reliable performance across various scenarios while maintaining accuracy comparable to existing methods.
Authors:Anmol Goel, Yaxi Hu, Iryna Gurevych, Amartya Sanyal
Abstract:
Aligning Large Language Models (LLMs) with human values and away from undesirable behaviors (such as hallucination) has become increasingly important. Recently, steering LLMs towards a desired behavior via activation editing has emerged as an effective method to mitigate harmful generations at inference-time. Activation editing modifies LLM representations by preserving information from positive demonstrations (e.g., truthful) and minimising information from negative demonstrations (e.g., hallucinations). When these demonstrations come from a private dataset, the aligned LLM may leak private information contained in those private samples. In this work, we present the first study of aligning LLM behavior with private datasets. Our work proposes the Private Steering for LLM Alignment (PSA) algorithm to edit LLM activations with differential privacy (DP) guarantees. We conduct extensive experiments on seven different benchmarks with open-source LLMs of different sizes (0.5B to 7B) and model families (LlaMa, Qwen, Mistral and Gemma). Our results show that PSA achieves DP guarantees for LLM alignment with minimal loss in performance, including alignment metrics, open-ended text generation quality, and general-purpose reasoning. We also develop the first Membership Inference Attack (MIA) for evaluating and auditing the empirical privacy for the problem of LLM steering via activation editing. Our experiments support the theoretical guarantees by showing improved guarantees for our PSA algorithm compared to several existing non-private techniques.
中文摘要:本研究提出具有差分隐私保证的私有引导对齐算法(PSA),通过激活编辑在保护私有数据集的同时对齐大语言模型,在七个基准测试中以最小性能损失实现有效对齐。
English Summary: This study introduces the Private Steering for LLM Alignment (PSA) algorithm, which uses differentially private activation editing to align large language models with private datasets while minimizing performance loss across seven benchmarks.
Authors:Benjamin Feuer, Chinmay Hegde
Abstract:
Language model (LLM) post-training, from DPO to distillation, can refine behaviors and unlock new skills, but the open science supporting these post-training techniques is still in its infancy. One limiting factor has been the difficulty of conducting large-scale comparative analyses of synthetic data generating models and LLM judges. To close this gap, we introduce WILDCHAT-50M, the largest public chat dataset to date. We extend the existing WildChat dataset to include responses not only from GPT, but from over 50 different open-weight models, ranging in size from 0.5B to 104B parameters. We conduct an extensive comparative analysis and demonstrate the potential of this dataset by creating RE-WILD, our own public SFT mix, which outperforms the recent Tulu-3 SFT mixture from Allen AI with only 40% as many samples. Our dataset, samples and code are available at https://github.com/penfever/wildchat-50m.
中文:WILDCHAT-50M作为最大公开聊天数据集的推出,支持对语言模型后训练技术进行广泛比较分析,并证明新SFT混合方法能以更少样本实现更优性能。
English: The introduction of WILDCHAT-50M, the largest public chat dataset, enables extensive comparative analysis of language model post-training techniques and demonstrates superior performance with a new SFT mixture using significantly fewer samples.
Authors:Yue Liu, Hongcheng Gao, Shengfang Zhai, Jun Xia, Tianyi Wu, Zhiwei Xue, Yulin Chen, Kenji Kawaguchi, Jiaheng Zhang, Bryan Hooi
Abstract:
As LLMs increasingly impact safety-critical applications, ensuring their safety using guardrails remains a key challenge. This paper proposes GuardReasoner, a new safeguard for LLMs, by guiding the guard model to learn to reason. Concretely, we first create the GuardReasonerTrain dataset, which consists of 127K samples with 460K detailed reasoning steps. Then, we introduce reasoning SFT to unlock the reasoning capability of guard models. In addition, we present hard sample DPO to further strengthen their reasoning ability. In this manner, GuardReasoner achieves better performance, explainability, and generalizability. Extensive experiments and analyses on 13 benchmarks of 3 guardrail tasks demonstrate its superiority. Remarkably, GuardReasoner 8B surpasses GPT-4o+CoT by 5.74% and LLaMA Guard 3 8B by 20.84% F1 score on average. We release the training data, code, and models with different scales (1B, 3B, 8B) of GuardReasoner : https://github.com/yueliu1999/GuardReasoner/.
Chinese: 本文提出GuardReasoner,一种基于推理的LLM安全防护方法,通过专门训练显著提升模型性能与可解释性,在多项基准测试中表现卓越。
English: This paper introduces GuardReasoner, a reasoning-based safeguard for LLMs that enhances safety through specialized training, achieving superior performance and explainability across multiple benchmarks.
Authors:Yuxin Zuo, Shang Qu, Yifei Li, Zhangren Chen, Xuekai Zhu, Ermo Hua, Kaiyan Zhang, Ning Ding, Bowen Zhou
Abstract:
We introduce MedXpertQA, a highly challenging and comprehensive benchmark to evaluate expert-level medical knowledge and advanced reasoning. MedXpertQA includes 4,460 questions spanning 17 specialties and 11 body systems. It includes two subsets, Text for text evaluation and MM for multimodal evaluation. Notably, MM introduces expert-level exam questions with diverse images and rich clinical information, including patient records and examination results, setting it apart from traditional medical multimodal benchmarks with simple QA pairs generated from image captions. MedXpertQA applies rigorous filtering and augmentation to address the insufficient difficulty of existing benchmarks like MedQA, and incorporates specialty board questions to improve clinical relevance and comprehensiveness. We perform data synthesis to mitigate data leakage risk and conduct multiple rounds of expert reviews to ensure accuracy and reliability. We evaluate 18 leading models on \benchmark. Moreover, medicine is deeply connected to real-world decision-making, providing a rich and representative setting for assessing reasoning abilities beyond mathematics and code. To this end, we develop a reasoning-oriented subset to facilitate the assessment of o1-like models. Code and data are available at: https://github.com/TsinghuaC3I/MedXpertQA
中文:MedXpertQA是一个包含4,460道跨17个专科医学难题的权威评测基准,通过严格筛选和多模态临床数据,专门评估超越传统问答的高级推理能力。
English: MedXpertQA is a challenging medical benchmark featuring 4,460 expert-level questions across 17 specialties, incorporating rigorous filtering and multimodal clinical data to evaluate advanced reasoning beyond traditional QA pairs.
Authors:Jinlu Wang, Yanfeng Sun, Jiapu Wang, Junbin Gao, Shaofan Wang, Jipeng Guo
Abstract:
Graph Neural Networks (GNNs) have demonstrated remarkable effectiveness in various graph representation learning tasks. However, most existing GNNs focus primarily on capturing local information through explicit graph convolution, often neglecting global message-passing. This limitation hinders the establishment of a collaborative interaction between global and local information, which is crucial for comprehensively understanding graph data. To address these challenges, we propose a novel framework called Comprehensive Graph Representation Learning (ComGRL). ComGRL integrates local information into global information to derive powerful representations. It achieves this by implicitly smoothing local information through flexible graph contrastive learning, ensuring reliable representations for subsequent global exploration. Then ComGRL transfers the locally derived representations to a multi-head self-attention module, enhancing their discriminative ability by uncovering diverse and rich global correlations. To further optimize local information dynamically under the self-supervision of pseudo-labels, ComGRL employs a triple sampling strategy to construct mixed node pairs and applies reliable Mixup augmentation across attributes and structure for local contrastive learning. This approach broadens the receptive field and facilitates coordination between local and global representation learning, enabling them to reinforce each other. Experimental results across six widely used graph datasets demonstrate that ComGRL achieves excellent performance in node classification tasks. The code could be available at https://github.com/JinluWang1002/ComGRL.
中文:提出的ComGRL框架通过图对比学习和多头自注意力机制,将局部信息融入全局上下文以增强图表示学习,在多个数据集的节点分类任务中取得了优异性能。
English: The proposed ComGRL framework enhances graph representation learning by integrating local information into global contexts through graph contrastive learning and multi-head self-attention, achieving superior performance in node classification tasks across multiple datasets.
Authors:Haoxiong Liu, Jiacheng Sun, Zhenguo Li, Andrew C Yao
Abstract:
The synergy between deep learning models and traditional automation tools, such as built-in tactics of the proof assistant and off-the-shelf automated theorem provers, plays a crucial role in developing robust and efficient neural theorem provers(NTPs). However, for proof synthesis with LLMs, previous work applies automation tools either only when explicitly invoked by the model or at a single granularity level, failing to fully exploit their power. To solve this issue, we propose ProofAug, a procedure that equips LLMs with automation methods at various granularities through fine-grained structure analysis of model-generated proof proposals. ProofAug also serves as a versatile plug-and-play module that seamlessly integrates with any tree-search algorithm, enabling our construction of an efficient recursive proving (ERP) module to further enhance performance. The superiority of our method is validated on the miniF2F benchmark using the open-source deepseek-math-7b-base model and the Isabelle proof assistant. Notably, by additionally employing a mixed prompting strategy, we achieve a cumulative pass rate of 66.0% after curation of the dataset (61.9% for the original version) with 2100 queries to the model per problem (In contrast, the previous SOTA in Isabelle, Subgoal-XL, only achieves 56.1% using 16384 queries per problem). We also implement a Lean 4 version of ProofAug that can improve the pass@1 performance of Kimina-Prover-Preview-Distill-1.5B from 44.3% to 50.4% on miniF2F-test. Our code is available at https://github.com/haoxiongliu/ProofAug.
中文摘要:本文提出ProofAug方法,通过多粒度自动化工具增强大语言模型在神经定理证明中的能力,在miniF2F基准测试中以更少计算查询量实现了最先进的性能表现。
English Summary: This paper introduces ProofAug, a method that enhances large language models with multi-granular automation tools for neural theorem proving, achieving state-of-the-art performance on the miniF2F benchmark with significantly fewer computational queries.
Authors:Amitay Sicherman, Kira Radinsky
Abstract:
The challenge in computational biology and drug discovery lies in creating comprehensive representations of proteins and molecules that capture their intrinsic properties and interactions. Traditional methods often focus on unimodal data, such as protein sequences or molecular structures, limiting their ability to capture complex biochemical relationships. This work enhances these representations by integrating biochemical reactions encompassing interactions between molecules and proteins. By leveraging reaction data alongside pre-trained embeddings from state-of-the-art protein and molecule models, we develop ReactEmbed, a novel method that creates a unified embedding space through contrastive learning. We evaluate ReactEmbed across diverse tasks, including drug-target interaction, protein-protein interaction, protein property prediction, and molecular property prediction, consistently surpassing all current state-of-the-art models. Notably, we showcase ReactEmbed's practical utility through successful implementation in lipid nanoparticle-based drug delivery, enabling zero-shot prediction of blood-brain barrier permeability for protein-nanoparticle complexes. The code and comprehensive database of reaction pairs are available for open use at \href{https://github.com/amitaysicherman/ReactEmbed}{GitHub}.
中文: 本研究提出ReactEmbed方法,通过整合生化反应与预训练的蛋白质和分子嵌入来创建统一表征,在多项任务中均超越现有最优模型,并成功应用于药物递送领域展示了其实用价值。
English: This work introduces ReactEmbed, a novel method that integrates biochemical reactions with pre-trained protein and molecule embeddings to create a unified representation, consistently outperforming state-of-the-art models across various tasks and demonstrating practical utility in drug delivery applications.
Authors:Adarsh Kappiyath, Abhra Chaudhuri, Ajay Jaiswal, Ziquan Liu, Yunpeng Li, Xiatian Zhu, Lu Yin
Abstract:
Ranking samples by fine-grained estimates of spuriosity (the degree to which spurious cues are present) has recently been shown to significantly benefit bias mitigation, over the traditional binary biased-\textit{vs}-unbiased partitioning of train sets. However, this spuriosity ranking comes with the requirement of human supervision. In this paper, we propose a debiasing framework based on our novel \ul{Se}lf-Guided \ul{B}ias \ul{Ra}nking (\emph{Sebra}), that mitigates biases (spurious correlations) via an automatic ranking of data points by spuriosity within their respective classes. Sebra leverages a key local symmetry in Empirical Risk Minimization (ERM) training -- the ease of learning a sample via ERM inversely correlates with its spuriousity; the fewer spurious correlations a sample exhibits, the harder it is to learn, and vice versa. However, globally across iterations, ERM tends to deviate from this symmetry. Sebra dynamically steers ERM to correct this deviation, facilitating the sequential learning of attributes in increasing order of difficulty, \ie, decreasing order of spuriosity. As a result, the sequence in which Sebra learns samples naturally provides spuriousity rankings. We use the resulting fine-grained bias characterization in a contrastive learning framework to mitigate biases from multiple sources. Extensive experiments show that Sebra consistently outperforms previous state-of-the-art unsupervised debiasing techniques across multiple standard benchmarks, including UrbanCars, BAR, CelebA, and ImageNet-1K. Code, pre-trained models, and training logs are available at https://kadarsh22.github.io/sebra_iclr25/.
Authors:Qingxiang Liu, Chenghao Liu, Sheng Sun, Di Yao, Yuxuan Liang
Abstract:
Unsupervised anomaly detection of multivariate time series is a challenging task, given the requirements of deriving a compact detection criterion without accessing the anomaly points. The existing methods are mainly based on reconstruction error or association divergence, which are both confined to isolated subsequences with limited horizons, hardly promising unified series-level criterion. In this paper, we propose the Global Dictionary-enhanced Transformer (GDformer) with a renovated dictionary-based cross attention mechanism to cultivate the global representations shared by all normal points in the entire series. Accordingly, the cross-attention maps reflect the correlation weights between the point and global representations, which naturally leads to the representation-wise similarity-based detection criterion. To foster more compact detection boundary, prototypes are introduced to capture the distribution of normal point-global correlation weights. GDformer consistently achieves state-of-the-art unsupervised anomaly detection performance on five real-world benchmark datasets. Further experiments validate the global dictionary has great transferability among various datasets. The code is available at https://github.com/yuppielqx/GDformer.
中文摘要:本文提出GDformer方法,通过全局字典和交叉注意力机制构建统一检测标准,在多变量时间序列的无监督异常检测中实现了最优性能。
English Summary: The paper introduces GDformer, a novel unsupervised anomaly detection method for multivariate time series that uses a global dictionary and cross-attention mechanism to establish a unified detection criterion, achieving state-of-the-art performance across multiple datasets.
Authors:Siyuan Jiang, Yihan Hu, Wenjie Li, Pengcheng Zeng
Abstract:
Functional data - observations in the form of curves or trajectories - arise in diverse domains such as biomedical sensing, motion capture, and handwriting recognition. A core challenge in functional data analysis (FDA) is accounting for phase variability, where misaligned temporal patterns hinder accurate inference. We introduce DeepFRC, an end-to-end deep learning framework for joint functional registration and classification. Unlike conventional approaches that decouple alignment and prediction, DeepFRC integrates class-aware elastic warping and a learnable basis representation into a unified architecture. This design enables temporal alignment and dimensionality reduction to be jointly optimized with classification, improving both interpretability and accuracy. We establish the first theoretical connection between alignment quality and generalization error, and validate our model on synthetic and real-world benchmarks. DeepFRC consistently outperforms state-of-the-art methods, especially in scenarios with complex temporal misalignment. Code is available at: https://github.com/Drivergo-93589/DeepFRC.
Chinese: DeepFRC 是一个端到端的深度学习框架,通过整合类别感知的弹性扭曲和可学习基表示,联合优化功能配准与分类,在处理时间错位时提高了准确性和可解释性,并优于现有先进方法。
English: DeepFRC is an end-to-end deep learning framework that jointly optimizes functional registration and classification by integrating class-aware elastic warping and a learnable basis representation, improving accuracy and interpretability while outperforming state-of-the-art methods in handling temporal misalignment.
Authors:Kumar Ashutosh, Yossi Gandelsman, Xinlei Chen, Ishan Misra, Rohit Girdhar
Abstract:
We present MILS: Multimodal Iterative LLM Solver, a surprisingly simple, training-free approach, to imbue multimodal capabilities into your favorite LLM. Leveraging their innate ability to perform multi-step reasoning, MILS prompts the LLM to generate candidate outputs, each of which are scored and fed back iteratively, eventually generating a solution to the task. This enables various applications that typically require training specialized models on task-specific data. In particular, we establish a new state-of-the-art on emergent zero-shot image, video and audio captioning. MILS seamlessly applies to media generation as well, discovering prompt rewrites to improve text-to-image generation, and even edit prompts for style transfer! Finally, being a gradient-free optimization approach, MILS can invert multimodal embeddings into text, enabling applications like cross-modal arithmetic.
中文: MILS是一种无需训练的方法,通过迭代生成和评分候选答案来增强LLM的多模态能力,在图像、视频和音频描述等任务中实现了最先进的性能。
English: MILS is a training-free method that enhances multimodal capabilities in LLMs through iterative candidate generation and scoring, achieving state-of-the-art performance in tasks like captioning and media generation.
Authors:Da Chang, Yu Li, Ganzhao Yuan
Abstract:
In the training of large language models (LLMs), updating parameters more efficiently and stably has always been an important challenge. To achieve efficient parameter updates, existing methods usually achieve performance comparable to full parameter updates through methods such as low-dimensional decomposition or layer-wise selective updates. In this work, we propose AlphaAdam, an optimization framework for LLM from the perspective of intra-layer parameter updates. By decoupling parameter updates and dynamically adjusting their strength, AlphaAdam accelerates convergence and improves training stability. We construct parameter masks based on the consistency of historical momentum and gradient direction and combine them with an adaptive mask strength strategy to ensure efficient optimization and theoretical convergence guarantees, which is also applicable to most momentum-based optimizers. Extensive experiments show that AlphaAdam outperforms state-of-the-art methods such as AdamW in terms of convergence speed and computational efficiency across tasks, including GPT-2 pre-trained and fine-tuned RoBERTa and Llama-7B. Our AlphaAdam implements an optimizer enhancement framework for LLMs through intra-layer asynchronous masked adaptive updates. Our code is available in this https://github.com/MaeChd/AlphaAdam.
中文: AlphaAdam是一种通过层内异步掩码自适应更新来优化大语言模型训练的框架,相比AdamW等方法,在收敛速度和计算效率上表现更优。
English: AlphaAdam is an optimization framework that enhances large language model training by enabling intra-layer asynchronous masked adaptive updates, improving convergence speed and computational efficiency compared to methods like AdamW.
Authors:Akinori F. Ebihara, Taiki Miyagawa, Kazuyuki Sakurai, Hitoshi Imaoka
Abstract:
Time-sensitive machine learning benefits from Sequential Probability Ratio Test (SPRT), which provides an optimal stopping time for early classification of time series. However, in finite horizon scenarios, where input lengths are finite, determining the optimal stopping rule becomes computationally intensive due to the need for backward induction, limiting practical applicability. We thus introduce FIRMBOUND, an SPRT-based framework that efficiently estimates the solution to backward induction from training data, bridging the gap between optimal stopping theory and real-world deployment. It employs density ratio estimation and convex function learning to provide statistically consistent estimators for sufficient statistic and conditional expectation, both essential for solving backward induction; consequently, FIRMBOUND minimizes Bayes risk to reach optimality. Additionally, we present a faster alternative using Gaussian process regression, which significantly reduces training time while retaining low deployment overhead, albeit with potential compromise in statistical consistency. Experiments across independent and identically distributed (i.i.d.), non-i.i.d., binary, multiclass, synthetic, and real-world datasets show that FIRMBOUND achieves optimalities in the sense of Bayes risk and speed-accuracy tradeoff. Furthermore, it advances the tradeoff boundary toward optimality when possible and reduces decision-time variance, ensuring reliable decision-making. Code is publicly available at https://github.com/Akinori-F-Ebihara/FIRMBOUND
Chinese: FIRMBOUND是一种基于SPRT的高效框架,通过密度比估计和凸函数学习来估计反向归纳解,克服了有限时间序列分类中的计算瓶颈,在多种数据集上实现了最优贝叶斯风险并提升了速度-精度权衡性能。
English: FIRMBOUND is an efficient SPRT-based framework that overcomes computational limitations in finite horizon time series classification by estimating backward induction solutions through density ratio estimation and convex function learning, achieving optimal Bayes risk and improved speed-accuracy tradeoffs across diverse datasets.
Authors:Bartosz CywiÅski, Kamil Deja
Abstract:
Diffusion models, while powerful, can inadvertently generate harmful or undesirable content, raising significant ethical and safety concerns. Recent machine unlearning approaches offer potential solutions but often lack transparency, making it difficult to understand the changes they introduce to the base model. In this work, we introduce SAeUron, a novel method leveraging features learned by sparse autoencoders (SAEs) to remove unwanted concepts in text-to-image diffusion models. First, we demonstrate that SAEs, trained in an unsupervised manner on activations from multiple denoising timesteps of the diffusion model, capture sparse and interpretable features corresponding to specific concepts. Building on this, we propose a feature selection method that enables precise interventions on model activations to block targeted content while preserving overall performance. Our evaluation shows that SAeUron outperforms existing approaches on the UnlearnCanvas benchmark for concepts and style unlearning, and effectively eliminates nudity when evaluated with I2P. Moreover, we show that with a single SAE, we can remove multiple concepts simultaneously and that in contrast to other methods, SAeUron mitigates the possibility of generating unwanted content under adversarial attack. Code and checkpoints are available at https://github.com/cywinski/SAeUron.
中文:SAeUron是一种创新方法,利用稀疏自编码器精确移除文生图扩散模型中的不良概念,在安全性和可解释性上优于现有方法,同时保持模型性能。
English: SAeUron is a novel method that uses sparse autoencoders to precisely remove unwanted concepts in text-to-image diffusion models, outperforming existing approaches in safety and interpretability while maintaining model performance.
Authors:Rui Min, Tianyu Pang, Chao Du, Qian Liu, Minhao Cheng, Min Lin
Abstract:
Chatbot Arena is a popular platform for evaluating LLMs by pairwise battles, where users vote for their preferred response from two randomly sampled anonymous models. While Chatbot Arena is widely regarded as a reliable LLM ranking leaderboard, we show that crowdsourced voting can be rigged to improve (or decrease) the ranking of a target model $m_{t}$. We first introduce a straightforward target-only rigging strategy that focuses on new battles involving $m_{t}$, identifying it via watermarking or a binary classifier, and exclusively voting for $m_{t}$ wins. However, this strategy is practically inefficient because there are over $190$ models on Chatbot Arena and on average only about $1\%$ of new battles will involve $m_{t}$. To overcome this, we propose omnipresent rigging strategies, exploiting the Elo rating mechanism of Chatbot Arena that any new vote on a battle can influence the ranking of the target model $m_{t}$, even if $m_{t}$ is not directly involved in the battle. We conduct experiments on around $1.7$ million historical votes from the Chatbot Arena Notebook, showing that omnipresent rigging strategies can improve model rankings by rigging only hundreds of new votes. While we have evaluated several defense mechanisms, our findings highlight the importance of continued efforts to prevent vote rigging. Our code is available at https://github.com/sail-sg/Rigging-ChatbotArena.
中文: Chatbot Arena的众包投票系统易受操纵策略影响,通过利用Elo评分机制,即使少量投票也能操控模型排名,凸显了加强防御的必要性。
English: Chatbot Arena's crowdsourced voting system is vulnerable to rigging strategies that can manipulate model rankings by exploiting the Elo rating mechanism, even with minimal vote interference, highlighting the need for stronger defenses.
Authors:Aude Vuilliomenet, Santiago MartÃnez Balvanera, Oisin Mac Aodha, Kate E. Jones, Duncan Wilson
Abstract:
1. Passive acoustic monitoring (PAM) coupled with artificial intelligence (AI) is becoming an essential tool for biodiversity monitoring. Traditional PAM systems require manual data offloading and impose substantial demands on storage and computing infrastructure. The combination of on-device AI-based processing and network connectivity enables local data analysis and transmission of only relevant information, greatly reducing storage needs. However, programming these devices for robust operation is challenging, requiring expertise in embedded systems and software engineering. Despite the increase in AI-based models for bioacoustics, their full potential remains unrealized without accessible tools to deploy them on custom hardware and tailor device behaviour to specific monitoring goals. 2. To address this challenge, we develop acoupi, an open-source Python framework that simplifies the creation and deployment of smart bioacoustic devices. acoupi integrates audio recording, AI-based data processing, data management, and real-time wireless messaging into a unified and configurable framework. By modularising key elements of the bioacoustic monitoring workflow, acoupi allows users to easily customise, extend, or select specific components to fit their unique monitoring needs. 3. We demonstrate the flexibility of acoupi by integrating two bioacoustic classifiers: BirdNET, for the classification of bird species, and BatDetect2, for the classification of UK bat species. We test the reliability of acoupi over a month-long deployment of two acoupi-powered devices in a UK urban park. 4. acoupi can be deployed on low-cost hardware such as the Raspberry Pi and can be customised for various applications. acoupi standardised framework and simplified tools facilitate the adoption of AI-powered PAM systems for researchers and conservationists. acoupi is on GitHub at https://github.com/acoupi/acoupi.
中文摘要:acoupi开源框架通过集成AI处理和无线通信技术,简化了智能生物声学监测设备的开发,使研究人员能够利用低成本硬件实现可定制的生物多样性监测方案。
English Summary: The acoupi framework simplifies the development of smart bioacoustic monitoring devices by integrating AI processing and wireless communication, enabling customizable and efficient biodiversity tracking using low-cost hardware.
Authors:Fabrizio Sandri, Elia Cunegatti, Giovanni Iacca
Abstract:
We propose a novel Two-Stage framework for Structured Pruning (\textsc{2SSP}) for pruning Large Language Models (LLMs), which combines two different strategies of pruning, namely Width and Depth Pruning. The first stage (Width Pruning) removes entire neurons, hence their corresponding rows and columns, aiming to preserve the connectivity among the pruned structures in the intermediate state of the Feed-Forward Networks in each Transformer block. This is done based on an importance score measuring the impact of each neuron on the output magnitude. The second stage (Depth Pruning), instead, removes entire Attention submodules. This is done by applying an iterative process that removes the Attention with the minimum impact on a given metric of interest (in our case, perplexity). We also propose a novel mechanism to balance the sparsity rate of the two stages w.r.t. to the desired global sparsity. We test \textsc{2SSP} on four LLM families and three sparsity rates (25\%, 37.5\%, and 50\%), measuring the resulting perplexity over three language modeling datasets as well as the performance over six downstream tasks. Our method consistently outperforms five state-of-the-art competitors over three language modeling and six downstream tasks, with an up to two-order-of-magnitude gain in terms of pruning time. The code is available at https://github.com/FabrizioSandri/2SSP.
中文: 我们提出了一种新颖的两阶段结构化剪枝框架(2SSP),通过结合宽度剪枝和深度剪枝来缩减大语言模型的规模,在保持性能的同时显著优于现有方法并大幅提升剪枝效率。
English: We introduce a novel Two-Stage Structured Pruning (2SSP) framework for Large Language Models that combines width and depth pruning to reduce model size while maintaining performance, outperforming existing methods with significant efficiency gains.
Authors:Derui Wang, Kristen Moore, Diksha Goel, Minjune Kim, Gang Li, Yang Li, Robin Doss, Minhui Xue, Bo Li, Seyit Camtepe, Liming Zhu
Abstract:
Deep reinforcement learning (DRL) has gained widespread adoption in control and decision-making tasks due to its strong performance in dynamic environments. However, DRL agents are vulnerable to noisy observations and adversarial attacks, and concerns about the adversarial robustness of DRL systems have emerged. Recent efforts have focused on addressing these robustness issues by establishing rigorous theoretical guarantees for the returns achieved by DRL agents in adversarial settings. Among these approaches, policy smoothing has proven to be an effective and scalable method for certifying the robustness of DRL agents. Nevertheless, existing certifiably robust DRL relies on policies trained with simple Gaussian augmentations, resulting in a suboptimal trade-off between certified robustness and certified return. To address this issue, we introduce a novel paradigm dubbed \texttt{C}ertified-r\texttt{A}dius-\texttt{M}aximizing \texttt{P}olicy (\texttt{CAMP}) training. \texttt{CAMP} is designed to enhance DRL policies, achieving better utility without compromising provable robustness. By leveraging the insight that the global certified radius can be derived from local certified radii based on training-time statistics, \texttt{CAMP} formulates a surrogate loss related to the local certified radius and optimizes the policy guided by this surrogate loss. We also introduce \textit{policy imitation} as a novel technique to stabilize \texttt{CAMP} training. Experimental results demonstrate that \texttt{CAMP} significantly improves the robustness-return trade-off across various tasks. Based on the results, \texttt{CAMP} can achieve up to twice the certified expected return compared to that of baselines. Our code is available at https://github.com/NeuralSec/camp-robust-rl.
中文: 提出的CAMP训练范式通过基于局部认证半径的替代损失优化深度强化学习策略,在保持可证明鲁棒性的同时,将认证期望回报提升至基线方法的两倍。
English: The proposed CAMP training paradigm enhances deep reinforcement learning policies by optimizing a surrogate loss based on local certified radii, achieving superior certified robustness and up to double the certified expected return compared to baseline methods.
Authors:Daesoo Lee, Sara Malacarne, Erlend Aune
Abstract:
In this paper, we introduce Neural Mapper for Vector Quantized Time Series Generator (NM-VQTSG), a novel method aimed at addressing fidelity challenges in vector quantized (VQ) time series generation. VQ-based methods, such as TimeVQVAE, have demonstrated success in generating time series but are hindered by two critical bottlenecks: information loss during compression into discrete latent spaces and deviations in the learned prior distribution from the ground truth distribution. These challenges result in synthetic time series with compromised fidelity and distributional accuracy. To overcome these limitations, NM-VQTSG leverages a U-Net-based neural mapping model to bridge the distributional gap between synthetic and ground truth time series. To be more specific, the model refines synthetic data by addressing artifacts introduced during generation, effectively aligning the distributions of synthetic and real data. Importantly, NM-VQTSG can be used for synthetic time series generated by any VQ-based generative method. We evaluate NM-VQTSG across diverse datasets from the UCR Time Series Classification archive, demonstrating its capability to consistently enhance fidelity in both unconditional and conditional generation tasks. The improvements are evidenced by significant improvements in FID, IS, and conditional FID, additionally backed up by visual inspection in a data space and a latent space. Our findings establish NM-VQTSG as a new method to improve the quality of synthetic time series. Our implementation is available on \url{https://github.com/ML4ITS/TimeVQVAE}.
中文: 本文提出NM-VQTSG方法,通过U-Net神经网络映射模型修正矢量量化时间序列生成中的分布偏差和伪影,在多个数据集上显著提升了生成数据的保真度和分布准确性。
English: This paper presents NM-VQTSG, a U-Net-based neural mapping model that enhances the fidelity of vector quantized time series generation by correcting distributional gaps and artifacts in synthetic data, demonstrating consistent improvements across multiple datasets and metrics.
Authors:Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim Furkan Tekin, Ling Liu
Abstract:
Recent research shows that Large Language Models (LLMs) are vulnerable to harmful fine-tuning attacks -- models lose their safety alignment ability after fine-tuning on a few harmful samples. For risk mitigation, a guardrail is typically used to filter out harmful samples before fine-tuning. By designing a new red-teaming method, we in this paper show that purely relying on the moderation guardrail for data filtration is not reliable. Our proposed attack method, dubbed Virus, easily bypasses the guardrail moderation by slightly modifying the harmful data. Experimental results show that the harmful data optimized by Virus is not detectable by the guardrail with up to 100\% leakage ratio, and can simultaneously achieve superior attack performance. Finally, the key message we want to convey through this paper is that: \textbf{it is reckless to consider guardrail moderation as a clutch at straws towards harmful fine-tuning attack}, as it cannot solve the inherent safety issue of the pre-trained LLMs. Our code is available at https://github.com/git-disl/Virus
Chinese: 本研究表明,仅依赖护栏审核过滤有害数据不可靠,因为提出的Virus攻击方法能通过微调有害样本轻松绕过防护,暴露了预训练大语言模型固有的安全隐患。
English: This study reveals that relying solely on guardrail moderation to filter harmful data is unreliable, as the proposed Virus attack method can bypass it by subtly modifying harmful samples, exposing the inherent safety vulnerabilities of pre-trained large language models.
Authors:David Salinas, Omar Swelam, Frank Hutter
Abstract:
Evaluating Large Language Models (LLMs) often requires costly human annotations. To address this, LLM-based judges have been proposed, which compare the outputs of two LLMs enabling the ranking of models without human intervention. While several approaches have been proposed, many confounding factors are present between different papers. For instance the model, the prompt and other hyperparameters are typically changed at the same time making apple-to-apple comparisons challenging. In this paper, we propose to systematically analyze and tune the hyperparameters of LLM judges. To alleviate the high cost of evaluating a judge, we propose to leverage multi-objective multi-fidelity which allows to find judges that trade accuracy for cost and also significantly reduce the cost of the search. Our method identifies judges that not only outperform existing benchmarks in accuracy and cost-efficiency but also utilize open-weight models, ensuring greater accessibility and reproducibility. The code to reproduce our experiments is available at this repository https://github.com/geoalgo/judgetuning .
Chinese: 本文提出一种系统性优化方法,通过多目标多保真度优化调整超参数,使基于大语言模型的评估器在采用开源权重模型时,实现了更高的准确率、成本效益及可复现性。
English: This paper introduces a systematic method to optimize LLM-based judges by tuning hyperparameters using multi-objective multi-fidelity optimization, achieving higher accuracy and cost-efficiency with open-weight models for better accessibility.
Authors:Hossein Mirzaei, Ali Ansari, Bahar Dibaei Nia, Mojtaba Nafez, Moein Madadi, Sepehr Rezaee, Zeinab Sadat Taghavi, Arad Maleki, Kian Shamsaie, Mahdi Hajialilue, Jafar Habibi, Mohammad Sabokrou, Mohammad Hossein Rohban
Abstract:
Scanning for trojan (backdoor) in deep neural networks is crucial due to their significant real-world applications. There has been an increasing focus on developing effective general trojan scanning methods across various trojan attacks. Despite advancements, there remains a shortage of methods that perform effectively without preconceived assumptions about the backdoor attack method. Additionally, we have observed that current methods struggle to identify classifiers trojaned using adversarial training. Motivated by these challenges, our study introduces a novel scanning method named TRODO (TROjan scanning by Detection of adversarial shifts in Out-of-distribution samples). TRODO leverages the concept of "blind spots"--regions where trojaned classifiers erroneously identify out-of-distribution (OOD) samples as in-distribution (ID). We scan for these blind spots by adversarially shifting OOD samples towards in-distribution. The increased likelihood of perturbed OOD samples being classified as ID serves as a signature for trojan detection. TRODO is both trojan and label mapping agnostic, effective even against adversarially trained trojaned classifiers. It is applicable even in scenarios where training data is absent, demonstrating high accuracy and adaptability across various scenarios and datasets, highlighting its potential as a robust trojan scanning strategy.
中文摘要:本研究提出了一种名为TRODO的新型木马扫描方法,通过检测分布外样本中的对抗性偏移来识别深度神经网络中的后门,无需训练数据即可有效应对各类攻击场景。
English Summary: The study introduces TRODO, a novel trojan scanning method that detects backdoors in deep neural networks by identifying adversarial shifts in out-of-distribution samples, proving effective across various attacks and scenarios without requiring training data.
Authors:J. Pablo Muñoz, Jinjie Yuan, Nilesh Jain
Abstract:
Large pre-trained models have achieved outstanding results in sequence modeling. The Transformer block and its attention mechanism have been the main drivers of the success of these models. Recently, alternative architectures, such as Selective Structured State Space Models (SSMs), have been proposed to address the inefficiencies of Transformers. This paper explores the compression of SSM-based models, particularly Mamba and its hybrids. We study the sensitivity of these models to the removal of selected components at different granularities to reduce the model size and computational overhead, thus improving their efficiency while maintaining accuracy. The proposed solutions, collectively referred to as Mamba-Shedder, achieve a speedup of up to 1.4x during inference, demonstrating that model efficiency can be improved by eliminating several redundancies with minimal impact on the overall model performance. The code is available at https://github.com/IntelLabs/Hardware-Aware-Automated-Machine-Learning.
Chinese Summary: 本文提出Mamba-Shedder方法,通过压缩选择性结构化状态空间模型来减少计算开销和模型规模,同时保持精度,实现了高达1.4倍的推理加速。
English Summary: This paper introduces Mamba-Shedder, a method for compressing selective structured state space models to reduce computational overhead and model size while preserving accuracy, achieving up to 1.4x inference speedup.
Authors:Hossein Mirzaei, Mohammad Jafari, Hamid Reza Dehbashi, Ali Ansari, Sepehr Ghobadi, Masoud Hadi, Arshia Soltani Moakhar, Mohammad Azizmalayeri, Mahdieh Soleymani Baghshah, Mohammad Hossein Rohban
Abstract:
In recent years, there have been significant improvements in various forms of image outlier detection. However, outlier detection performance under adversarial settings lags far behind that in standard settings. This is due to the lack of effective exposure to adversarial scenarios during training, especially on unseen outliers, leading to detection models failing to learn robust features. To bridge this gap, we introduce RODEO, a data-centric approach that generates effective outliers for robust outlier detection. More specifically, we show that incorporating outlier exposure (OE) and adversarial training can be an effective strategy for this purpose, as long as the exposed training outliers meet certain characteristics, including diversity, and both conceptual differentiability and analogy to the inlier samples. We leverage a text-to-image model to achieve this goal. We demonstrate both quantitatively and qualitatively that our adaptive OE method effectively generates ``diverse'' and ``near-distribution'' outliers, leveraging information from both text and image domains. Moreover, our experimental results show that utilizing our synthesized outliers significantly enhances the performance of the outlier detector, particularly in adversarial settings.
中文: 近年来图像异常检测在对抗性场景下表现不佳,为此我们提出RODEO方法,通过文本到图像模型生成多样且接近正常分布的异常样本,有效提升了检测器在对抗环境下的性能。
English: Recent advances in image outlier detection struggle with adversarial scenarios due to insufficient training exposure, prompting the introduction of RODEO, a data-centric method using text-to-image models to generate diverse and near-distribution outliers that significantly boost detector robustness.
Authors:Arik Reuter, Tim G. J. Rudner, Vincent Fortuin, David Rügamer
Abstract:
Transformers have emerged as the dominant architecture in the field of deep learning, with a broad range of applications and remarkable in-context learning (ICL) capabilities. While not yet fully understood, ICL has already proved to be an intriguing phenomenon, allowing transformers to learn in context -- without requiring further training. In this paper, we further advance the understanding of ICL by demonstrating that transformers can perform full Bayesian inference for commonly used statistical models in context. More specifically, we introduce a general framework that builds on ideas from prior fitted networks and continuous normalizing flows and enables us to infer complex posterior distributions for models such as generalized linear models and latent factor models. Extensive experiments on real-world datasets demonstrate that our ICL approach yields posterior samples that are similar in quality to state-of-the-art MCMC or variational inference methods that do not operate in context. The source code for this paper is available at https://github.com/ArikReuter/ICL_for_Full_Bayesian_Inference.
中文: 本文证明Transformer能够通过上下文学习对统计模型执行完整的贝叶斯推断,无需额外训练即可生成与传统方法质量相当的后验样本。
English: This paper demonstrates that transformers can perform full Bayesian inference for statistical models through in-context learning, producing posterior samples comparable to traditional methods without requiring additional training.
Authors:Yinfeng Gao, Qichao Zhang, Da-wei Ding, Dongbin Zhao
Abstract:
It is still a challenging topic to make reactive driving behaviors in complex urban environments as road users' intentions are unknown. Model-based reinforcement learning (MBRL) offers great potential to learn a reactive policy by constructing a world model that can provide informative states and imagination training. However, a critical limitation in relevant research lies in the scene-level reconstruction representation learning, which may overlook key interactive vehicles and hardly model the interactive features among vehicles and their long-term intentions. Therefore, this paper presents a novel MBRL method with a predictive individual world model (PIWM) for autonomous driving. PIWM describes the driving environment from an individual-level perspective and captures vehicles' interactive relations and their intentions via trajectory prediction task. Meanwhile, a behavior policy is learned jointly with PIWM. It is trained in PIWM's imagination and effectively navigates in the urban driving scenes leveraging intention-aware latent states. The proposed method is trained and evaluated on simulation environments built upon real-world challenging interactive scenarios. Compared with popular model-free and state-of-the-art model-based reinforcement learning methods, experimental results show that the proposed method achieves the best performance in terms of safety and efficiency.
中文摘要:本文提出了一种新颖的基于模型强化学习的方法,通过预测个体世界模型(PIWM)从个体层面捕捉车辆交互关系和意图,在复杂城市驾驶场景中实现了最佳的安全性和效率表现。
English Summary: This paper introduces a novel model-based reinforcement learning method with a Predictive Individual World Model (PIWM) that enhances autonomous driving by capturing vehicle interactions and intentions through trajectory prediction, achieving superior safety and efficiency in urban environments.
Authors:Robert O'Shea, Bipin Rajendran
Abstract:
State-of-the-art methods for backpropagation-free learning employ local error feedback to direct iterative optimisation via gradient descent. In this study, we examine the more restrictive setting where retrograde communication from neuronal outputs is unavailable for pre-synaptic weight optimisation. To address this challenge, we propose Forward Projection (FP). This novel randomised closed-form training method requires only a single forward pass over the entire dataset for model fitting, without retrograde communication. Target values for pre-activation membrane potentials are generated layer-wise via nonlinear projections of pre-synaptic inputs and the labels. Local loss functions are optimised over pre-synaptic inputs using closed-form regression, without feedback from neuronal outputs or downstream layers. Interpretability is a key advantage of FP training; membrane potentials of hidden neurons in FP-trained networks encode information which is interpretable layer-wise as label predictions. We demonstrate the effectiveness of FP across four biomedical datasets. In few-shot learning tasks, FP yielded more generalisable models than those optimised via backpropagation. In large-sample tasks, FP-based models achieve generalisation comparable to gradient descent-based local learning methods while requiring only a single forward propagation step, achieving significant speed up for training. Interpretation functions defined on local neuronal activity in FP-based models successfully identified clinically salient features for diagnosis in two biomedical datasets. Forward Projection is a computationally efficient machine learning approach that yields interpretable neural network models without retrograde communication of neuronal activity during training.
中文: 前向投影是一种无需反向传播的新型训练方法,通过单次前向传播和闭式回归构建可解释神经网络,在保持竞争力的泛化能力同时显著提升计算效率。
English: Forward Projection is a novel, backpropagation-free training method that uses single-pass forward propagation and closed-form regression to create interpretable neural networks, achieving competitive generalization with significant computational efficiency.
Authors:Simon Dahan, Gabriel Bénédict, Logan Z. J. Williams, Yourong Guo, Daniel Rueckert, Robert Leech, Emma C. Robinson
Abstract:
Current AI frameworks for brain decoding and encoding, typically train and test models within the same datasets. This limits their utility for brain computer interfaces (BCI) or neurofeedback, for which it would be useful to pool experiences across individuals to better simulate stimuli not sampled during training. A key obstacle to model generalisation is the degree of variability of inter-subject cortical organisation, which makes it difficult to align or compare cortical signals across participants. In this paper we address this through the use of surface vision transformers, which build a generalisable model of cortical functional dynamics, through encoding the topography of cortical networks and their interactions as a moving image across a surface. This is then combined with tri-modal self-supervised contrastive (CLIP) alignment of audio, video, and fMRI modalities to enable the retrieval of visual and auditory stimuli from patterns of cortical activity (and vice-versa). We validate our approach on 7T task-fMRI data from 174 healthy participants engaged in the movie-watching experiment from the Human Connectome Project (HCP). Results show that it is possible to detect which movie clips an individual is watching purely from their brain activity, even for individuals and movies not seen during training. Further analysis of attention maps reveals that our model captures individual patterns of brain activity that reflect semantic and visual systems. This opens the door to future personalised simulations of brain function. Code & pre-trained models will be made available at https://github.com/metrics-lab/sim, processed data for training will be available upon request at https://gin.g-node.org/Sdahan30/sim.
中文摘要:现有脑解码AI模型难以跨个体泛化,本研究通过表面视觉变换器结合多模态对比学习,实现了从未见过的受试者和电影中仅凭大脑活动准确识别刺激内容。
English Summary: Current AI models for brain decoding struggle with generalization across individuals, but this study introduces a surface vision transformer combined with multimodal contrastive learning to enable accurate stimulus retrieval from brain activity, even for unseen participants and movies.
Authors:George Wright, Slawomir Michniewski, Eleanor Jameson, Fayyaz ul Amir Afsar Minhas
Abstract:
Background: Phage therapy shows promise for treating antibiotic-resistant Klebsiella infections. Identifying phage depolymerases that target Klebsiella capsular polysaccharides is crucial, as these capsules contribute to biofilm formation and virulence. However, homology-based searches have limitations in novel depolymerase discovery.
Objective: To develop a machine learning model for identifying and ranking potential phage depolymerases targeting Klebsiella.
Methods: We developed DepoRanker, a machine learning algorithm to rank proteins by their likelihood of being depolymerases. The model was experimentally validated on 5 newly characterized proteins and compared to BLAST.
Results: DepoRanker demonstrated superior performance to BLAST in identifying potential depolymerases. Experimental validation confirmed its predictive ability on novel proteins.
Conclusions: DepoRanker provides an accurate and functional tool to expedite depolymerase discovery for phage therapy against Klebsiella. It is available as a webserver and open-source software.
Availability: Webserver: https://deporanker.dcs.warwick.ac.uk/ Source code: https://github.com/wgrgwrght/deporanker
Chinese: 研究人员开发了DepoRanker机器学习工具,在识别靶向克雷伯菌的噬菌体解聚酶方面优于BLAST方法,为针对耐药感染的噬菌体治疗加速了解聚酶的发现进程。
English: Researchers developed DepoRanker, a machine learning tool that outperforms BLAST in identifying phage depolymerases targeting Klebsiella, accelerating discovery for phage therapy against antibiotic-resistant infections.
Authors:J. Pablo Muñoz, Jinjie Yuan, Nilesh Jain
Abstract:
The rapid expansion of Large Language Models (LLMs) has posed significant challenges regarding the computational resources required for fine-tuning and deployment. Recent advancements in low-rank adapters have demonstrated their efficacy in parameter-efficient fine-tuning (PEFT) of these models. This retrospective paper comprehensively discusses innovative approaches that synergize low-rank representations with Neural Architecture Search (NAS) techniques, particularly weight-sharing super-networks. Robust solutions for compressing and fine-tuning large pre-trained models are developed by integrating these methodologies. Our analysis highlights the potential of these combined strategies to democratize the use of LLMs, making them more accessible for deployment in resource-constrained environments. The resulting models exhibit reduced memory footprints and faster inference times, paving the way for more practical and scalable applications of LLMs. Models and code are available at https://github.com/IntelLabs/Hardware-Aware-Automated-Machine-Learning.
中文: 本文探讨了将低秩适配器与神经架构搜索相结合的方法,以高效微调和压缩大型语言模型,使其在资源受限环境中实现内存减少和推理加速的实际部署。
English: This paper explores combining low-rank adapters with Neural Architecture Search to efficiently fine-tune and compress Large Language Models, enabling their practical deployment in resource-limited settings with reduced memory and faster inference.
Authors:Weixin Liang, Junhong Shen, Genghan Zhang, Ning Dong, Luke Zettlemoyer, Lili Yu
Abstract:
State Space Models (SSMs) have emerged as efficient alternatives to Transformers for sequential modeling, but their inability to leverage modality-specific features limits their performance in multi-modal pretraining. Here, we propose Mixture-of-Mamba, a novel SSM architecture that introduces modality-aware sparsity through modality-specific parameterization of the Mamba block. Building on Mixture-of-Transformers (W. Liang et al. arXiv:2411.04996; 2024), we extend the benefits of modality-aware sparsity to SSMs while preserving their computational efficiency. We evaluate Mixture-of-Mamba across three multi-modal pretraining settings: Transfusion (interleaved text and continuous image tokens with diffusion loss), Chameleon (interleaved text and discrete image tokens), and an extended three-modality framework incorporating speech. Mixture-of-Mamba consistently reaches the same loss values at earlier training steps with significantly reduced computational costs. In the Transfusion setting, Mixture-of-Mamba achieves equivalent image loss using only 34.76% of the training FLOPs at the 1.4B scale. In the Chameleon setting, Mixture-of-Mamba reaches similar image loss with just 42.50% of the FLOPs at the 1.4B scale, and similar text loss with just 65.40% of the FLOPs. In the three-modality setting, MoM matches speech loss at 24.80% of the FLOPs at the 1.4B scale. Our ablation study highlights the synergistic effects of decoupling projection components, where joint decoupling yields greater gains than individual modifications. These results establish modality-aware sparsity as a versatile and effective design principle, extending its impact from Transformers to SSMs and setting new benchmarks in multi-modal pretraining. Our code can be accessed at https://github.com/Weixin-Liang/Mixture-of-Mamba
中文: Mixture-of-Mamba通过模态感知稀疏性改进了状态空间模型,在文本、图像和语音的多模态预训练中,能以显著降低的计算成本实现同等性能。
English: Mixture-of-Mamba introduces modality-aware sparsity into State Space Models, enabling efficient multi-modal pretraining by achieving comparable performance with significantly reduced computational costs across text, image, and speech tasks.
Authors:Jacopo Di Ventura, Dylan R. Ashley, Vincent Herrmann, Francesco Faccio, Jürgen Schmidhuber
Abstract:
Upside Down Reinforcement Learning (UDRL) is a promising framework for solving reinforcement learning problems which focuses on learning command-conditioned policies. In this work, we extend UDRL to the task of learning a command-conditioned generator of deep neural network policies. We accomplish this using Hypernetworks - a variant of Fast Weight Programmers, which learn to decode input commands representing a desired expected return into command-specific weight matrices. Our method, dubbed Upside Down Reinforcement Learning with Policy Generators (UDRLPG), streamlines comparable techniques by removing the need for an evaluator or critic to update the weights of the generator. To counteract the increased variance in last returns caused by not having an evaluator, we decouple the sampling probability of the buffer from the absolute number of policies in it, which, together with a simple weighting strategy, improves the empirical convergence of the algorithm. Compared with existing algorithms, UDRLPG achieves competitive performance and high returns, sometimes outperforming more complex architectures. Our experiments show that a trained generator can generalize to create policies that achieve unseen returns zero-shot. The proposed method appears to be effective in mitigating some of the challenges associated with learning highly multimodal functions. Altogether, we believe that UDRLPG represents a promising step forward in achieving greater empirical sample efficiency in RL. A full implementation of UDRLPG is publicly available at https://github.com/JacopoD/udrlpg_
Chinese: UDRLPG通过使用超网络扩展了倒置强化学习框架,能够根据指令生成特定神经网络策略,无需评估器即可实现优异性能,并通过解耦采样和加权策略提高了算法收敛性。
English: UDRLPG extends Upside Down Reinforcement Learning by employing Hypernetworks to generate command-specific neural network policies, eliminating the need for an evaluator and achieving competitive performance with improved convergence through decoupled sampling and weighting strategies.
Authors:Wenxuan Xie, Fanpu Cao
Abstract:
In recent work on time-series prediction, Transformers and even large language models have garnered significant attention due to their strong capabilities in sequence modeling. However, in practical deployments, time-series prediction often requires operation in resource-constrained environments, such as edge devices, which are unable to handle the computational overhead of large models. To address such scenarios, some lightweight models have been proposed, but they exhibit poor performance on non-stationary sequences. In this paper, we propose $\textit{SWIFT}$, a lightweight model that is not only powerful, but also efficient in deployment and inference for Long-term Time Series Forecasting (LTSF). Our model is based on three key points: (i) Utilizing wavelet transform to perform lossless downsampling of time series. (ii) Achieving cross-band information fusion with a learnable filter. (iii) Using only one shared linear layer or one shallow MLP for sub-series' mapping. We conduct comprehensive experiments, and the results show that $\textit{SWIFT}$ achieves state-of-the-art (SOTA) performance on multiple datasets, offering a promising method for edge computing and deployment in this task. Moreover, it is noteworthy that the number of parameters in $\textit{SWIFT-Linear}$ is only 25\% of what it would be with a single-layer linear model for time-domain prediction. Our code is available at https://github.com/LancelotXWX/SWIFT.
中文: 提出的轻量级模型SWIFT通过小波变换和跨频段融合技术,在长期时间序列预测中实现了最先进的性能,同时仅需极少计算资源,非常适合边缘设备部署。
English: The proposed lightweight model SWIFT achieves state-of-the-art performance for long-term time series forecasting by employing wavelet transform and cross-band fusion while requiring only minimal computational resources, making it ideal for edge deployment.
Authors:Michael Birsak, John Femiani, Biao Zhang, Peter Wonka
Abstract:
Assigning realistic materials to 3D models remains a significant challenge in computer graphics. We propose MatCLIP, a novel method that extracts shape- and lighting-insensitive descriptors of Physically Based Rendering (PBR) materials to assign plausible textures to 3D objects based on images, such as the output of Latent Diffusion Models (LDMs) or photographs. Matching PBR materials to static images is challenging because the PBR representation captures the dynamic appearance of materials under varying viewing angles, shapes, and lighting conditions. By extending an Alpha-CLIP-based model on material renderings across diverse shapes and lighting, and encoding multiple viewing conditions for PBR materials, our approach generates descriptors that bridge the domains of PBR representations with photographs or renderings, including LDM outputs. This enables consistent material assignments without requiring explicit knowledge of material relationships between different parts of an object. MatCLIP achieves a top-1 classification accuracy of 76.6%, outperforming state-of-the-art methods such as PhotoShape and MatAtlas by over 15 percentage points on publicly available datasets. Our method can be used to construct material assignments for 3D shape datasets such as ShapeNet, 3DCoMPaT++, and Objaverse. All code and data will be released.
Authors:Jiahao Chen, Bin Qin, Jiangmeng Li, Hao Chen, Bing Su
Abstract:
Long-tailed learning has garnered increasing attention due to its practical significance. Among the various approaches, the fine-tuning paradigm has gained considerable interest with the advent of foundation models. However, most existing methods primarily focus on leveraging knowledge from these models, overlooking the inherent biases introduced by the imbalanced training data they rely on. In this paper, we examine how such imbalances from pre-training affect long-tailed downstream tasks. Specifically, we find the imbalance biases inherited in foundation models on downstream task as parameter imbalance and data imbalance. During fine-tuning, we observe that parameter imbalance plays a more critical role, while data imbalance can be mitigated using existing re-balancing strategies. Moreover, we find that parameter imbalance cannot be effectively addressed by current re-balancing techniques, such as adjusting the logits, during training, unlike data imbalance. To tackle both imbalances simultaneously, we build our method on causal learning and view the incomplete semantic factor as the confounder, which brings spurious correlations between input samples and labels. To resolve the negative effects of this, we propose a novel backdoor adjustment method that learns the true causal effect between input samples and labels, rather than merely fitting the correlations in the data. Notably, we achieve an average performance increase of about $1.67\%$ on each dataset. Code is available: https://github.com/JiahaoChen1/Pre-train-Imbalance
中文摘要:本文研究了预训练数据不平衡对长尾下游任务的影响,识别出参数不平衡和数据不平衡是关键问题,并提出了一种基于因果学习的后门调整方法,在多个数据集上平均性能提升约1.67%。
English Summary: This paper investigates how imbalances in pre-training data affect long-tailed downstream tasks, identifying parameter and data imbalances as key issues, and proposes a novel causal learning-based backdoor adjustment method that achieves an average performance improvement of about 1.67% across datasets.
Authors:Chengting Yu, Xiaochen Zhao, Lei Liu, Shu Yang, Gaoang Wang, Erping Li, Aili Wang
Abstract:
Spiking Neural Networks (SNNs) are emerging as a brain-inspired alternative to traditional Artificial Neural Networks (ANNs), prized for their potential energy efficiency on neuromorphic hardware. Despite this, SNNs often suffer from accuracy degradation compared to ANNs and face deployment challenges due to fixed inference timesteps, which require retraining for adjustments, limiting operational flexibility. To address these issues, our work considers the spatio-temporal property inherent in SNNs, and proposes a novel distillation framework for deep SNNs that optimizes performance across full-range timesteps without specific retraining, enhancing both efficacy and deployment adaptability. We provide both theoretical analysis and empirical validations to illustrate that training guarantees the convergence of all implicit models across full-range timesteps. Experimental results on CIFAR-10, CIFAR-100, CIFAR10-DVS, and ImageNet demonstrate state-of-the-art performance among distillation-based SNNs training methods. Our code is available at https://github.com/Intelli-Chip-Lab/snn\_temporal\_decoupling\_distillation.
中文: 本研究提出了一种新型的脉冲神经网络蒸馏框架,无需重新训练即可优化全时间步的性能,通过理论和实验验证在多个基准测试中取得了领先水平。
English: This work introduces a novel distillation framework for Spiking Neural Networks (SNNs) that optimizes performance across all timesteps without retraining, achieving state-of-the-art results on multiple benchmarks through theoretical and empirical validation.
Authors:Adil Kaan Akan, Yucel Yemez
Abstract:
We present SlotAdapt, an object-centric learning method that combines slot attention with pretrained diffusion models by introducing adapters for slot-based conditioning. Our method preserves the generative power of pretrained diffusion models, while avoiding their text-centric conditioning bias. We also incorporate an additional guidance loss into our architecture to align cross-attention from adapter layers with slot attention. This enhances the alignment of our model with the objects in the input image without using external supervision. Experimental results show that our method outperforms state-of-the-art techniques in object discovery and image generation tasks across multiple datasets, including those with real images. Furthermore, we demonstrate through experiments that our method performs remarkably well on complex real-world images for compositional generation, in contrast to other slot-based generative methods in the literature. The project page can be found at https://kaanakan.github.io/SlotAdapt/.
Authors:Edoardo Cetin, Tianyu Zhao, Yujin Tang
Abstract:
We propose a new finetuning method to provide pre-trained large language models (LMs) the ability to scale test-time compute through the diffusion framework. By increasing the number of diffusion steps, we show our finetuned models achieve monotonically increasing accuracy, directly translating to improved performance across downstream tasks. Furthermore, our finetuned models can expertly answer questions on specific topics by integrating powerful guidance techniques, and autonomously determine the compute required for a given problem by leveraging adaptive ODE solvers. Our method is universally applicable to any foundation model pre-trained with a cross-entropy loss and does not modify any of its original weights, fully preserving its strong single-step generation capabilities. We show our method is more effective and fully compatible with traditional finetuning approaches, introducing an orthogonal new direction to unify the strengths of the autoregressive and diffusion frameworks.
中文: 本研究提出一种微调方法,使预训练大语言模型能够通过扩散框架扩展测试时计算量,在不改变原始模型权重的前提下提升准确性和任务表现。
English: This study introduces a finetuning method that enables pre-trained large language models to scale test-time compute using the diffusion framework, enhancing accuracy and task performance without altering original model weights.
Authors:Muhammad Maaz, Timothy C. Y. Chan
Abstract:
We introduce the problem of formally verifying properties of Markov processes where the parameters are given by the output of machine learning models. For a broad class of machine learning models, including linear models, tree-based models, and neural networks, verifying properties of Markov chains like reachability, hitting time, and total reward can be formulated as a bilinear program. We develop a decomposition and bound propagation scheme for solving the bilinear program and show through computational experiments that our method solves the problem to global optimality up to 100x faster than state-of-the-art solvers. To demonstrate the practical utility of our approach, we apply it to a real-world healthcare case study. Along with the paper, we release markovml, an open-source tool for building Markov processes, integrating pretrained machine learning models, and verifying their properties, available at https://github.com/mmaaz-git/markovml.
Chinese: 本研究提出了一种方法,用于形式化验证由机器学习模型输出参数的马可夫过程属性,将验证问题构建为双线性规划并以比现有求解器快100倍的速度求解,通过医疗案例展示了实用性并发布了开源工具。
English: This study presents a method for formally verifying properties of Markov processes with parameters derived from machine learning models, formulating the verification as a bilinear program and solving it up to 100 times faster than existing solvers, with practical application demonstrated in a healthcare case study and an open-source tool released.
Authors:Yuxuan Gu, Wuyang Zhou, Giorgos Iacovides, Danilo Mandic
Abstract:
The reasoning abilities of Large Language Models (LLMs) can be improved by structurally denoising their weights, yet existing techniques primarily focus on denoising the feed-forward network (FFN) of the transformer block, and can not efficiently utilise the Multi-head Attention (MHA) block, which is the core of transformer architectures. To address this issue, we propose a novel intuitive framework that, at its very core, performs MHA compression through a multi-head tensorisation process and the Tucker decomposition. This enables both higher-dimensional structured denoising and compression of the MHA weights, by enforcing a shared higher-dimensional subspace across the weights of the multiple attention heads. We demonstrate that this approach consistently enhances the reasoning capabilities of LLMs across multiple benchmark datasets, and for both encoder-only and decoder-only architectures, while achieving compression rates of up to $\sim 250$ times in the MHA weights, all without requiring any additional data, training, or fine-tuning. Furthermore, we show that the proposed method can be seamlessly combined with existing FFN-only-based denoising techniques to achieve further improvements in LLM reasoning performance.
Chinese: 本研究提出一种新颖框架,通过多头注意力张量化和塔克分解对权重进行结构化去噪与压缩,无需额外数据或训练即可实现高达250倍的压缩,并显著提升大语言模型的推理能力。
English: This study introduces a novel framework that enhances Large Language Models' reasoning by structurally denoising and compressing Multi-head Attention weights through tensorization and Tucker decomposition, achieving up to 250x compression and improved performance without extra data or training.
Authors:Vaclav Knapp, Matyas Bohacek
Abstract:
Recent pose-transfer methods aim to generate temporally consistent and fully controllable videos of human action where the motion from a reference video is reenacted by a new identity. We evaluate three state-of-the-art pose-transfer methods -- AnimateAnyone, MagicAnimate, and ExAvatar -- by generating videos with actions and identities outside the training distribution and conducting a participant study about the quality of these videos. In a controlled environment of 20 distinct human actions, we find that participants, presented with the pose-transferred videos, correctly identify the desired action only 42.92% of the time. Moreover, the participants find the actions in the generated videos consistent with the reference (source) videos only 36.46% of the time. These results vary by method: participants find the splatting-based ExAvatar more consistent and photorealistic than the diffusion-based AnimateAnyone and MagicAnimate.
Chinese: 一项针对三种姿态迁移方法的研究显示,参与者仅能识别42.92%的目标动作,对生成视频与参考视频动作一致性的认可度仅为36.46%,其中基于点云渲染的ExAvatar在真实感方面优于扩散模型方法。
English: A study evaluating three pose-transfer methods found that participants correctly identified target actions only 42.92% of the time and perceived motion consistency with reference videos in just 36.46% of cases, with ExAvatar outperforming diffusion-based alternatives in realism.
Authors:Yang Ji, Ying Sun, Yuting Zhang, Zhigaoyuan Wang, Yuanxin Zhuang, Zheng Gong, Dazhong Shen, Chuan Qin, Hengshu Zhu, Hui Xiong
Abstract:
Neural networks have achieved remarkable success across various fields. However, the lack of interpretability limits their practical use, particularly in critical decision-making scenarios. Post-hoc interpretability, which provides explanations for pre-trained models, is often at risk of robustness and fidelity. This has inspired a rising interest in self-interpretable neural networks, which inherently reveal the prediction rationale through the model structures. Although there exist surveys on post-hoc interpretability, a comprehensive and systematic survey of self-interpretable neural networks is still missing. To address this gap, we first collect and review existing works on self-interpretable neural networks and provide a structured summary of their methodologies from five key perspectives: attribution-based, function-based, concept-based, prototype-based, and rule-based self-interpretation. We also present concrete, visualized examples of model explanations and discuss their applicability across diverse scenarios, including image, text, graph data, and deep reinforcement learning. Additionally, we summarize existing evaluation metrics for self-interpretability and identify open challenges in this field, offering insights for future research. To support ongoing developments, we present a publicly accessible resource to track advancements in this domain: https://github.com/yangji721/Awesome-Self-Interpretable-Neural-Network.
中文: 本综述系统梳理了自解释神经网络,从五种方法视角分类总结其应用,并评估不同场景下的适用性,同时指出现有挑战与未来研究方向。
English: This survey systematically reviews self-interpretable neural networks, categorizing them into five methodological approaches and evaluating their applications across various data types while identifying current challenges and future research directions.
Authors:Ali Khodabandeh Yalabadi, Mehdi Yazdani-Jahromi, Ozlem Ozmen Garibay
Abstract:
Structure-based drug design (SBDD) leverages the 3D structure of biomolecular targets to guide the creation of new therapeutic agents. Recent advances in generative models, including diffusion models and geometric deep learning, have demonstrated promise in optimizing ligand generation. However, the scarcity of high-quality protein-ligand complex data and the inherent challenges in aligning generated ligands with target proteins limit the effectiveness of these methods. We propose BoKDiff, a novel framework that enhances ligand generation by combining multi-objective optimization and Best-of-K alignment methodologies. Built upon the DecompDiff model, BoKDiff generates diverse candidates and ranks them using a weighted evaluation of molecular properties such as QED, SA, and docking scores. To address alignment challenges, we introduce a method that relocates the center of mass of generated ligands to their docking poses, enabling accurate sub-component extraction. Additionally, we integrate a Best-of-N (BoN) sampling approach, which selects the optimal ligand from multiple generated candidates without requiring fine-tuning. BoN achieves exceptional results, with QED values exceeding 0.6, SA scores above 0.75, and a success rate surpassing 35%, demonstrating its efficiency and practicality. BoKDiff achieves state-of-the-art results on the CrossDocked2020 dataset, including a -8.58 average Vina docking score and a 26% success rate in molecule generation. This study is the first to apply Best-of-K alignment and Best-of-N sampling to SBDD, highlighting their potential to bridge generative modeling with practical drug discovery requirements. The code is provided at https://github.com/khodabandeh-ali/BoKDiff.git.
Chinese: BoKDiff是一种创新框架,它通过结合多目标优化与Best-of-K对齐及Best-of-N采样方法,显著提升了基于结构的药物设计效果,在配体生成中实现了最先进的性能,并展现出优异的分子属性优化效率。
English: BoKDiff is a novel framework that enhances structure-based drug design by combining multi-objective optimization with Best-of-K alignment and Best-of-N sampling, achieving state-of-the-art results in ligand generation and demonstrating high efficiency in molecular property optimization.
Authors:Soheil Gharatappeh, Salimeh Yasaei Sekeh
Abstract:
Iterative magnitude pruning methods (IMPs), proven to be successful in reducing the number of insignificant nodes in over-parameterized deep neural networks (DNNs), have been getting an enormous amount of attention with the rapid deployment of DNNs into cutting-edge technologies with computation and memory constraints. Despite IMPs popularity in pruning networks, a fundamental limitation of existing IMP algorithms is the significant training time required for each pruning iteration. Our paper introduces a novel \textit{stopping criterion} for IMPs that monitors information and gradient flows between networks layers and minimizes the training time. Information Consistent Pruning (\ourmethod{}) eliminates the need to retrain the network to its original performance during intermediate steps while maintaining overall performance at the end of the pruning process. Through our experiments, we demonstrate that our algorithm is more efficient than current IMPs across multiple dataset-DNN combinations. We also provide theoretical insights into the core idea of our algorithm alongside mathematical explanations of flow-based IMP. Our code is available at \url{https://github.com/Sekeh-Lab/InfCoP}.
中文: 本文提出信息一致性剪枝方法,通过监控网络层间的信息和梯度流来减少训练时间,在剪枝过程中无需完全重训练即可保持性能,显著提升了迭代幅度剪枝的效率。
English: The paper introduces Information Consistent Pruning, a novel stopping criterion for iterative magnitude pruning that reduces training time by monitoring information and gradient flows, maintaining performance without full retraining between steps.
Authors:Romeo Sommerfeld, Christian Helms, Ralf Herbrich
Abstract:
Bayesian neural networks (BNNs) offer the potential for reliable uncertainty quantification and interpretability, which are critical for trustworthy AI in high-stakes domains. However, existing methods often struggle with issues such as overconfidence, hyperparameter sensitivity, and posterior collapse, leaving room for alternative approaches. In this work, we advance message passing (MP) for BNNs and present a novel framework that models the predictive posterior as a factor graph. To the best of our knowledge, our framework is the first MP method that handles convolutional neural networks and avoids double-counting training data, a limitation of previous MP methods that causes overconfidence. We evaluate our approach on CIFAR-10 with a convolutional neural network of roughly 890k parameters and find that it can compete with the SOTA baselines AdamW and IVON, even having an edge in terms of calibration. On synthetic data, we validate the uncertainty estimates and observe a strong correlation (0.9) between posterior credible intervals and its probability of covering the true data-generating function outside the training range. While our method scales to an MLP with 5.6 million parameters, further improvements are necessary to match the scale and performance of state-of-the-art variational inference methods.
中文: 本研究为贝叶斯神经网络提出了一种新颖的消息传递框架,该框架能有效处理卷积网络并避免数据重复计算,在CIFAR-10数据集上展现出与先进基线相当的校准性能,并在合成数据上验证了可靠的uncertainty量化能力,同时承认其在扩展性方面仍需改进。
English: This study introduces a novel message passing framework for Bayesian neural networks that effectively handles convolutional networks and prevents data double-counting, achieving competitive calibration on CIFAR-10 and demonstrating reliable uncertainty quantification on synthetic data while acknowledging scalability limitations compared to state-of-the-art methods.
Authors:Oubo Ma, Linkang Du, Yang Dai, Chunyi Zhou, Qingming Li, Yuwen Pu, Shouling Ji
Abstract:
Deep reinforcement learning (DRL) is widely applied to safety-critical decision-making scenarios. However, DRL is vulnerable to backdoor attacks, especially action-level backdoors, which pose significant threats through precise manipulation and flexible activation, risking outcomes like vehicle collisions or drone crashes. The key distinction of action-level backdoors lies in the utilization of the backdoor reward function to associate triggers with target actions. Nevertheless, existing studies typically rely on backdoor reward functions with fixed values or conditional flipping, which lack universality across diverse DRL tasks and backdoor designs, resulting in fluctuations or even failure in practice.
This paper proposes the first universal action-level backdoor attack framework, called UNIDOOR, which enables adaptive exploration of backdoor reward functions through performance monitoring, eliminating the reliance on expert knowledge and grid search. We highlight that action tampering serves as a crucial component of action-level backdoor attacks in continuous action scenarios, as it addresses attack failures caused by low-frequency target actions. Extensive evaluations demonstrate that UNIDOOR significantly enhances the attack performance of action-level backdoors, showcasing its universality across diverse attack scenarios, including single/multiple agents, single/multiple backdoors, discrete/continuous action spaces, and sparse/dense reward signals. Furthermore, visualization results encompassing state distribution, neuron activation, and animations demonstrate the stealthiness of UNIDOOR. The source code of UNIDOOR can be found at https://github.com/maoubo/UNIDOOR.
中文: 本文提出首个通用动作级后门攻击框架UNIDOOR,通过性能监控自适应探索后门奖励函数,在多种攻击场景下显著提升攻击性能并保持隐蔽性。
English: This paper introduces UNIDOOR, the first universal action-level backdoor attack framework for deep reinforcement learning, which adaptively explores backdoor reward functions through performance monitoring and demonstrates superior attack performance across diverse scenarios while maintaining stealthiness.
Authors:Huayu Chen, Kai Jiang, Kaiwen Zheng, Jianfei Chen, Hang Su, Jun Zhu
Abstract:
Classifier-Free Guidance (CFG) has been a default technique in various visual generative models, yet it requires inference from both conditional and unconditional models during sampling. We propose to build visual models that are free from guided sampling. The resulting algorithm, Guidance-Free Training (GFT), matches the performance of CFG while reducing sampling to a single model, halving the computational cost. Unlike previous distillation-based approaches that rely on pretrained CFG networks, GFT enables training directly from scratch. GFT is simple to implement. It retains the same maximum likelihood objective as CFG and differs mainly in the parameterization of conditional models. Implementing GFT requires only minimal modifications to existing codebases, as most design choices and hyperparameters are directly inherited from CFG. Our extensive experiments across five distinct visual models demonstrate the effectiveness and versatility of GFT. Across domains of diffusion, autoregressive, and masked-prediction modeling, GFT consistently achieves comparable or even lower FID scores, with similar diversity-fidelity trade-offs compared with CFG baselines, all while being guidance-free. Code will be available at https://github.com/thu-ml/GFT.
Chinese: 提出的无引导训练(GFT)方法通过从头开始训练视觉模型,无需引导采样即可达到分类器自由引导的性能,同时在多个生成领域中保持相似的多样性-保真度权衡,并将计算成本减半。
English: The proposed Guidance-Free Training (GFT) method eliminates the need for guided sampling by training visual models directly from scratch, matching the performance of Classifier-Free Guidance while halving computational costs and maintaining similar diversity-fidelity trade-offs across multiple generative domains.
Authors:Junan Zhang, Jing Yang, Zihao Fang, Yuancheng Wang, Zehua Zhang, Zhuo Wang, Fan Fan, Zhizheng Wu
Abstract:
We introduce AnyEnhance, a unified generative model for voice enhancement that processes both speech and singing voices. Based on a masked generative model, AnyEnhance is capable of handling both speech and singing voices, supporting a wide range of enhancement tasks including denoising, dereverberation, declipping, super-resolution, and target speaker extraction, all simultaneously and without fine-tuning. AnyEnhance introduces a prompt-guidance mechanism for in-context learning, which allows the model to natively accept a reference speaker's timbre. In this way, it could boost enhancement performance when a reference audio is available and enable the target speaker extraction task without altering the underlying architecture. Moreover, we also introduce a self-critic mechanism into the generative process for masked generative models, yielding higher-quality outputs through iterative self-assessment and refinement. Extensive experiments on various enhancement tasks demonstrate AnyEnhance outperforms existing methods in terms of both objective metrics and subjective listening tests. Demo audios are publicly available at https://amphionspace.github.io/anyenhance/.
中文:AnyEnhance是一个统一的生成模型,通过提示引导的上下文学习和迭代自优化机制,能够同时处理语音和歌声的多种增强任务,在各项测试中均优于现有方法。
English: AnyEnhance is a unified generative model that simultaneously handles multiple voice enhancement tasks for both speech and singing voices through prompt-guided in-context learning and iterative self-refinement, outperforming existing methods across various benchmarks.
Authors:Hao Shu, Jicheng Li, Yu Jin, Hailin Wang
Abstract:
In recent years, the prediction of multidimensional time series data has become increasingly important due to its wide-ranging applications. Tensor-based prediction methods have gained attention for their ability to preserve the inherent structure of such data. However, existing approaches, such as tensor autoregression and tensor decomposition, often have consistently failed to provide clear assertions regarding the number of samples that can be exactly predicted. While matrix-based methods using nuclear norms address this limitation, their reliance on matrices limits accuracy and increases computational costs when handling multidimensional data. To overcome these challenges, we reformulate multidimensional time series prediction as a deterministic tensor completion problem and propose a novel theoretical framework. Specifically, we develop a deterministic tensor completion theory and introduce the Temporal Convolutional Tensor Nuclear Norm (TCTNN) model. By convolving the multidimensional time series along the temporal dimension and applying the tensor nuclear norm, our approach identifies the maximum forecast horizon for exact predictions. Additionally, TCTNN achieves superior performance in prediction accuracy and computational efficiency compared to existing methods across diverse real-world datasets, including climate temperature, network flow, and traffic ride data. Our implementation is publicly available at https://github.com/HaoShu2000/TCTNN.
中文摘要:本研究提出了一种新颖的张量补全框架,通过时序卷积张量核范数(TCTNN)确定精确预测范围,并在多维时间序列预测中实现了更优的精度和计算效率。
English Summary: This study introduces a novel tensor completion framework using Temporal Convolutional Tensor Nuclear Norm (TCTNN) to determine exact prediction horizons and achieve superior accuracy and efficiency in multidimensional time series forecasting.
Authors:Chuanyang Zheng
Abstract:
We present a new family of mobile hybrid vision networks, called iFormer, with a focus on optimizing latency and accuracy on mobile applications. iFormer effectively integrates the fast local representation capacity of convolution with the efficient global modeling ability of self-attention. The local interactions are derived from transforming a standard convolutional network, \textit{i.e.}, ConvNeXt, to design a more lightweight mobile network. Our newly introduced mobile modulation attention removes memory-intensive operations in MHA and employs an efficient modulation mechanism to boost dynamic global representational capacity. We conduct comprehensive experiments demonstrating that iFormer outperforms existing lightweight networks across various tasks. Notably, iFormer achieves an impressive Top-1 accuracy of 80.4\% on ImageNet-1k with a latency of only 1.10 ms on an iPhone 13, surpassing the recently proposed MobileNetV4 under similar latency constraints. Additionally, our method shows significant improvements in downstream tasks, including COCO object detection, instance segmentation, and ADE20k semantic segmentation, while still maintaining low latency on mobile devices for high-resolution inputs in these scenarios.
中文: iFormer是一种新型移动混合视觉网络,通过结合卷积的局部处理能力和自注意力的全局建模优势,在移动设备上实现了卓越的精度与低延迟,在ImageNet分类和COCO目标检测等任务中超越了现有模型。
English: iFormer is a novel mobile hybrid vision network that integrates convolution's local processing with self-attention's global modeling to achieve superior accuracy and low latency on mobile devices, outperforming existing models in tasks like ImageNet classification and COCO object detection.
Authors:Zhikai Chen, Han Xie, Jian Zhang, Xiang song, Jiliang Tang, Huzefa Rangwala, George Karypis
Abstract:
Recent years have witnessed significant advancements in graph machine learning (GML), with its applications spanning numerous domains. However, the focus of GML has predominantly been on developing powerful models, often overlooking a crucial initial step: constructing suitable graphs from common data formats, such as tabular data. This construction process is fundamental to applying graph-based models, yet it remains largely understudied and lacks formalization. Our research aims to address this gap by formalizing the graph construction problem and proposing an effective solution. We identify two critical challenges to achieve this goal: 1. The absence of dedicated datasets to formalize and evaluate the effectiveness of graph construction methods, and 2. Existing automatic construction methods can only be applied to some specific cases, while tedious human engineering is required to generate high-quality graphs. To tackle these challenges, we present a two-fold contribution. First, we introduce a set of datasets to formalize and evaluate graph construction methods. Second, we propose an LLM-based solution, AutoG, automatically generating high-quality graph schemas without human intervention. The experimental results demonstrate that the quality of constructed graphs is critical to downstream task performance, and AutoG can generate high-quality graphs that rival those produced by human experts. Our code can be accessible from https://github.com/amazon-science/Automatic-Table-to-Graph-Generation.
Chinese: 近年来图机器学习虽发展迅速,却忽视了从表格数据构建图结构这一关键基础步骤;本研究通过提出AutoG这一基于大语言模型的解决方案,能够自动生成媲美人工设计的高质量图谱,有效填补了该领域的研究空白。
English: Recent graph machine learning advancements have overlooked the critical step of constructing graphs from tabular data, prompting this research to formalize the problem and introduce AutoG, an LLM-based solution that automatically generates high-quality graphs rivaling human expertise.
Authors:Hossein Mirzaei, Mohammad Jafari, Hamid Reza Dehbashi, Zeinab Sadat Taghavi, Mohammad Sabokrou, Mohammad Hossein Rohban
Abstract:
Novelty Detection (ND) plays a crucial role in machine learning by identifying new or unseen data during model inference. This capability is especially important for the safe and reliable operation of automated systems. Despite advances in this field, existing techniques often fail to maintain their performance when subject to adversarial attacks. Our research addresses this gap by marrying the merits of nearest-neighbor algorithms with robust features obtained from models pretrained on ImageNet. We focus on enhancing the robustness and performance of ND algorithms. Experimental results demonstrate that our approach significantly outperforms current state-of-the-art methods across various benchmarks, particularly under adversarial conditions. By incorporating robust pretrained features into the k-NN algorithm, we establish a new standard for performance and robustness in the field of robust ND. This work opens up new avenues for research aimed at fortifying machine learning systems against adversarial vulnerabilities. Our implementation is publicly available at https://github.com/rohban-lab/ZARND.
中文摘要:本研究通过将鲁棒的ImageNet预训练特征与k近邻算法相结合,显著提升了机器学习中新颖性检测在对抗攻击下的性能,为鲁棒性设立了新标准。
English Summary: Our research enhances novelty detection in machine learning by integrating robust ImageNet-pretrained features with k-NN algorithms, significantly outperforming existing methods under adversarial attacks and setting a new benchmark for robustness.
Authors:Pauline Bourigault, Danilo P. Mandic
Abstract:
We present a novel approach to anomaly detection by integrating Generalized Hyperbolic (GH) processes into kernel-based methods. The GH distribution, known for its flexibility in modeling skewness, heavy tails, and kurtosis, helps to capture complex patterns in data that deviate from Gaussian assumptions. We propose a GH-based kernel function and utilize it within Kernel Density Estimation (KDE) and One-Class Support Vector Machines (OCSVM) to develop anomaly detection frameworks. Theoretical results confirmed the positive semi-definiteness and consistency of the GH-based kernel, ensuring its suitability for machine learning applications. Empirical evaluation on synthetic and real-world datasets showed that our method improves detection performance in scenarios involving heavy-tailed and asymmetric or imbalanced distributions. https://github.com/paulinebourigault/GHKernelAnomalyDetect
中文: 本研究提出了一种新颖的异常检测方法,通过将广义双曲过程融入基于核的技术,在处理重尾和不对称数据分布时显著提升了检测性能。
English: This study introduces a novel anomaly detection method that integrates Generalized Hyperbolic processes into kernel-based approaches, enhancing performance in handling heavy-tailed and asymmetric data distributions.
Authors:Aitor Sánchez-Ferrera, Borja Calvo, Jose A. Lozano
Abstract:
Time series anomaly detection presents various challenges due to the sequential and dynamic nature of time-dependent data. Traditional unsupervised methods frequently encounter difficulties in generalization, often overfitting to known normal patterns observed during training and struggling to adapt to unseen normality. In response to this limitation, self-supervised techniques for time series have garnered attention as a potential solution to undertake this obstacle and enhance the performance of anomaly detectors. This paper presents a comprehensive review of the recent methods that make use of self-supervised learning for time series anomaly detection. A taxonomy is proposed to categorize these methods based on their primary characteristics, facilitating a clear understanding of their diversity within this field. The information contained in this survey, along with additional details that will be periodically updated, is available on the following GitHub repository: https://github.com/Aitorzan3/Awesome-Self-Supervised-Time-Series-Anomaly-Detection.
中文: 本文综述了自监督学习在时间序列异常检测中的应用,提出了分类法以系统归类现有方法,旨在克服传统无监督方法的泛化不足问题。
English: This paper reviews self-supervised learning methods for time series anomaly detection, proposing a taxonomy to categorize them and addressing the limitations of traditional unsupervised approaches.
Authors:Zhihao Yao, Jixuan Yin, Bo Li
Abstract:
Short text clustering has gained significant attention in the data mining community. However, the limited valuable information contained in short texts often leads to low-discriminative representations, increasing the difficulty of clustering. This paper proposes a novel short text clustering framework, called Reliable \textbf{P}seudo-labeling via \textbf{O}ptimal \textbf{T}ransport with \textbf{A}ttention for Short Text Clustering (\textbf{POTA}), that generate reliable pseudo-labels to aid discriminative representation learning for clustering. Specially, \textbf{POTA} first implements an instance-level attention mechanism to capture the semantic relationships among samples, which are then incorporated as a semantic consistency regularization term into an optimal transport problem. By solving this OT problem, we can yield reliable pseudo-labels that simultaneously account for sample-to-sample semantic consistency and sample-to-cluster global structure information. Additionally, the proposed OT can adaptively estimate cluster distributions, making \textbf{POTA} well-suited for varying degrees of imbalanced datasets. Then, we utilize the pseudo-labels to guide contrastive learning to generate discriminative representations and achieve efficient clustering. Extensive experiments demonstrate \textbf{POTA} outperforms state-of-the-art methods. The code is available at: \href{https://github.com/YZH0905/POTA-STC/tree/main}{https://github.com/YZH0905/POTA-STC/tree/main}.
中文: 本文提出POTA短文本聚类框架,通过结合注意力机制与最优传输生成可靠伪标签,提升表征区分度,在非平衡数据集上表现优异。
English: This paper introduces POTA, a novel short text clustering framework that uses optimal transport with attention to generate reliable pseudo-labels, enhancing discriminative representation learning and achieving superior performance on imbalanced datasets.
Authors:Hulingxiao He, Geng Li, Zijun Geng, Jinglin Xu, Yuxin Peng
Abstract:
Multi-modal large language models (MLLMs) have shown remarkable abilities in various visual understanding tasks. However, MLLMs still struggle with fine-grained visual recognition (FGVR), which aims to identify subordinate-level categories from images. This can negatively impact more advanced capabilities of MLLMs, such as object-centric visual question answering and reasoning. In our study, we revisit three quintessential capabilities of MLLMs for FGVR, including object information extraction, category knowledge reserve, object-category alignment, and position of the root cause as a misalignment problem. To address this issue, we present Finedefics, an MLLM that enhances the model's FGVR capability by incorporating informative attribute descriptions of objects into the training phase. We employ contrastive learning on object-attribute pairs and attribute-category pairs simultaneously and use examples from similar but incorrect categories as hard negatives, naturally bringing representations of visual objects and category names closer. Extensive evaluations across multiple popular FGVR datasets demonstrate that Finedefics outperforms existing MLLMs of comparable parameter sizes, showcasing its remarkable efficacy. The code is available at https://github.com/PKU-ICST-MIPL/Finedefics_ICLR2025.
Chinese: 本研究提出了Finedefics多模态大语言模型,通过引入物体属性描述并采用包含困难负样本的对比学习,显著提升了细粒度视觉识别能力,在多个数据集上展现出优越性能。
English: The study introduces Finedefics, a multi-modal large language model that enhances fine-grained visual recognition by incorporating object attribute descriptions and using contrastive learning with hard negatives, achieving superior performance on multiple datasets.
Authors:Ziqi Liu
Abstract:
Long-term time series forecasting is essential in areas like finance and weather prediction. Besides traditional methods that operate in the time domain, many recent models transform time series data into the frequency domain to better capture complex patterns. However, these methods often use filtering techniques to remove certain frequency signals as noise, which may unintentionally discard important information and reduce prediction accuracy. To address this, we propose the Frequency Decomposition Mixture-of-Experts (FreqMoE) model, which dynamically decomposes time series data into frequency bands, each processed by a specialized expert. A gating mechanism adjusts the importance of each output of expert based on frequency characteristics, and the aggregated results are fed into a prediction module that iteratively refines the forecast using residual connections. Our experiments demonstrate that FreqMoE outperforms state-of-the-art models, achieving the best performance on 51 out of 70 metrics across all tested datasets, while significantly reducing the number of required parameters to under 50k, providing notable efficiency advantages. Code is available at: https://github.com/sunbus100/FreqMoE-main
中文: FreqMoE模型通过动态分解时序数据至频段并由专家处理,在多数指标上超越现有最优模型,同时显著提升了预测精度与参数效率。
English: The FreqMoE model dynamically decomposes time series into frequency bands processed by specialized experts, outperforming state-of-the-art methods with superior accuracy and efficiency across multiple datasets.
Authors:Zihang Li, Yangdong Ruan, Wenjun Liu, Zhengyang Wang, Tong Yang
Abstract:
Although retrieval-augmented generation(RAG) significantly improves generation quality by retrieving external knowledge bases and integrating generated content, it faces computational efficiency bottlenecks, particularly in knowledge retrieval tasks involving hierarchical structures for Tree-RAG. This paper proposes a Tree-RAG acceleration method based on the improved Cuckoo Filter, which optimizes entity localization during the retrieval process to achieve significant performance improvements. Tree-RAG effectively organizes entities through the introduction of a hierarchical tree structure, while the Cuckoo Filter serves as an efficient data structure that supports rapid membership queries and dynamic updates. The experiment results demonstrate that our method is much faster than naive Tree-RAG while maintaining high levels of generative quality. When the number of trees is large, our method is hundreds of times faster than naive Tree-RAG. Our work is available at https://github.com/TUPYP7180/CFT-RAG-2025.
中文: 本文提出了一种基于改进布谷鸟过滤器的Tree-RAG加速方法,通过优化实体定位显著提升检索效率,在保持生成质量的同时,处理大规模树结构时速度可提升数百倍。
English: This paper introduces an acceleration method for Tree-RAG using an enhanced Cuckoo Filter, which significantly boosts retrieval efficiency by optimizing entity localization while preserving generation quality, achieving up to hundreds of times faster performance with large tree structures.
Authors:Bao Duong, Sunil Gupta, Thin Nguyen
Abstract:
Existing score-based methods for directed acyclic graph (DAG) learning from observational data struggle to recover the causal graph accurately and sample-efficiently. To overcome this, in this study, we propose DrBO (DAG recovery via Bayesian Optimization)-a novel DAG learning framework leveraging Bayesian optimization (BO) to find high-scoring DAGs. We show that, by sophisticatedly choosing the promising DAGs to explore, we can find higher-scoring ones much more efficiently. To address the scalability issues of conventional BO in DAG learning, we replace Gaussian Processes commonly employed in BO with dropout neural networks, trained in a continual manner, which allows for (i) flexibly modeling the DAG scores without overfitting, (ii) incorporation of uncertainty into the estimated scores, and (iii) scaling with the number of evaluations. As a result, DrBO is computationally efficient and can find the accurate DAG in fewer trials and less time than existing state-of-the-art methods. This is demonstrated through an extensive set of empirical evaluations on many challenging settings with both synthetic and real data. Our implementation is available at https://github.com/baosws/DrBO.
中文: 本研究提出DrBO,一种利用贝叶斯优化和dropout神经网络的新型有向无环图学习框架,能高效准确地从观测数据中恢复因果图,在速度和精度上均优于现有方法。
English: This study introduces DrBO, a novel DAG learning framework that uses Bayesian optimization with dropout neural networks to efficiently and accurately recover causal graphs from observational data, outperforming existing methods in both speed and accuracy.
Authors:Hongbo Zheng, Suyuan Wang, Neeraj Gangwar, Nickvash Kani
Abstract:
Vector representations have been pivotal in advancing natural language processing (NLP), with prior research focusing on embedding techniques for mathematical expressions using mathematically equivalent formulations. While effective, these approaches are constrained by the size and diversity of training data. In this work, we address these limitations by introducing E-Gen, a novel e-graph-based dataset generation scheme that synthesizes large and diverse mathematical expression datasets, surpassing prior methods in size and operator variety. Leveraging this dataset, we train embedding models using two strategies: (1) generating mathematically equivalent expressions, and (2) contrastive learning to explicitly group equivalent expressions. We evaluate these embeddings on both in-distribution and out-of-distribution mathematical language processing tasks, comparing them against prior methods. Finally, we demonstrate that our embedding-based approach outperforms state-of-the-art large language models (LLMs) on several tasks, underscoring the necessity of optimizing embedding methods for the mathematical data modality. The source code and datasets are available at https://github.com/MLPgroup/E-Gen.
中文: 本研究提出E-Gen,一种基于e-图的数据生成方法,能创建大规模多样化的数学表达式数据集,用于训练嵌入模型,在多项数学处理任务中超越了现有技术和大型语言模型的性能。
English: This study introduces E-Gen, an e-graph-based method that generates large, diverse mathematical expression datasets to train embedding models, which outperform existing techniques and large language models on various mathematical processing tasks.
Authors:Juan Ramirez, Ignacio Hounie, Juan Elenter, Jose Gallego-Posada, Meraj Hashemizadeh, Alejandro Ribeiro, Simon Lacoste-Julien
Abstract:
We introduce Feasible Learning (FL), a sample-centric learning paradigm where models are trained by solving a feasibility problem that bounds the loss for each training sample. In contrast to the ubiquitous Empirical Risk Minimization (ERM) framework, which optimizes for average performance, FL demands satisfactory performance on every individual data point. Since any model that meets the prescribed performance threshold is a valid FL solution, the choice of optimization algorithm and its dynamics play a crucial role in shaping the properties of the resulting solutions. In particular, we study a primal-dual approach which dynamically re-weights the importance of each sample during training. To address the challenge of setting a meaningful threshold in practice, we introduce a relaxation of FL that incorporates slack variables of minimal norm. Our empirical analysis, spanning image classification, age regression, and preference optimization in large language models, demonstrates that models trained via FL can learn from data while displaying improved tail behavior compared to ERM, with only a marginal impact on average performance.
Chinese: 可行学习(FL)是一种以样本为中心的学习范式,通过限制每个数据点的损失来训练模型,相较于经验风险最小化,它在保持平均性能的同时显著改善了尾部表现。
English: Feasible Learning (FL) is a sample-centric paradigm that trains models by bounding the loss for each individual data point, leading to improved tail performance with minimal impact on average results compared to Empirical Risk Minimization.
Authors:Michael K. Chen, Xikun Zhang, Dacheng Tao
Abstract:
Logical reasoning is a critical component of Large Language Models (LLMs), and substantial research efforts in recent years have aimed to enhance their deductive reasoning capabilities. However, existing deductive reasoning benchmarks, which are crucial for evaluating and advancing LLMs, are inadequate due to their lack of task complexity, presence of prior knowledge as a confounder, and superficial error analysis. To address these deficiencies, we introduce JustLogic, a synthetically generated deductive reasoning benchmark designed for rigorous evaluation of LLMs. JustLogic is (i) highly complex, capable of generating a diverse range of linguistic patterns, vocabulary, and argument structures; (ii) prior knowledge independent, eliminating the advantage of models possessing prior knowledge and ensuring that only deductive reasoning is used to answer questions; and (iii) capable of in-depth error analysis on the heterogeneous effects of reasoning depth and argument form on model accuracy. Our experimental results on JustLogic reveal that (i) state-of-the-art (SOTA) reasoning LLMs perform on par or better than the human average but significantly worse than the human ceiling, and (ii) SOTA non-reasoning models still underperform the human average. All code and data are available at https://github.com/michaelchen-lab/JustLogic
中文摘要:JustLogic基准通过提供更高的复杂性、独立于先验知识以及深入错误分析能力,解决了现有大语言模型演绎推理评估的不足,实验表明推理型大语言模型仅达到人类平均水平,远未达到人类最佳水平。
English Summary: The JustLogic benchmark addresses limitations in existing deductive reasoning evaluations for LLMs by offering enhanced complexity, independence from prior knowledge, and detailed error analysis capabilities, revealing that while reasoning LLMs match average human performance, they fall short of peak human ability.
Authors:Libo Wang
Abstract:
In view of the gap in the current large language model in sharing memory across dialogues, this research proposes a wormhole memory module (WMM) to realize memory as a Rubik's cube that can be arbitrarily retrieved between different dialogues. Through simulation experiments, the researcher built an experimental framework based on the Python environment and used setting memory barriers to simulate the current situation where memories between LLMs dialogues are difficult to share. The CoQA development data set was imported into the experiment, and the feasibility of its cross-dialogue memory retrieval function was verified for WMM's nonlinear indexing and dynamic retrieval, and a comparative analysis was conducted with the capabilities of Titans and MemGPT memory modules. Experimental results show that WMM demonstrated the ability to retrieve memory across dialogues and the stability of quantitative indicators in eight experiments. It contributes new technical approaches to the optimization of memory management of LLMs and provides experience for the practical application in the future.
Chinese: 本研究提出了一种虫洞记忆模块(WMM),实现了大语言模型跨对话记忆检索功能,实验验证了其可行性和稳定性,为优化记忆管理提供了新的技术途径。
English: This research introduces a wormhole memory module (WMM) that enables cross-dialogue memory retrieval in large language models, demonstrating its feasibility and stability through experiments and offering new technical approaches for memory management optimization.
Authors:Yixing Jiang, Kameron C. Black, Gloria Geng, Danny Park, James Zou, Andrew Y. Ng, Jonathan H. Chen
Abstract:
Recent large language models (LLMs) have demonstrated significant advancements, particularly in their ability to serve as agents thereby surpassing their traditional role as chatbots. These agents can leverage their planning and tool utilization capabilities to address tasks specified at a high level. However, a standardized dataset to benchmark the agent capabilities of LLMs in medical applications is currently lacking, making the evaluation of LLMs on complex tasks in interactive healthcare environments challenging. To address this gap, we introduce MedAgentBench, a broad evaluation suite designed to assess the agent capabilities of large language models within medical records contexts. MedAgentBench encompasses 300 patient-specific clinically-derived tasks from 10 categories written by human physicians, realistic profiles of 100 patients with over 700,000 data elements, a FHIR-compliant interactive environment, and an accompanying codebase. The environment uses the standard APIs and communication infrastructure used in modern EMR systems, so it can be easily migrated into live EMR systems. MedAgentBench presents an unsaturated agent-oriented benchmark that current state-of-the-art LLMs exhibit some ability to succeed at. The best model (Claude 3.5 Sonnet v2) achieves a success rate of 69.67%. However, there is still substantial space for improvement which gives the community a next direction to optimize. Furthermore, there is significant variation in performance across task categories. MedAgentBench establishes this and is publicly available at https://github.com/stanfordmlgroup/MedAgentBench , offering a valuable framework for model developers to track progress and drive continuous improvements in the agent capabilities of large language models within the medical domain.
中文: 近期大语言模型已从聊天机器人发展为智能体,但缺乏医疗领域的标准化评测基准,为此我们推出MedAgentBench——一个包含临床任务和真实患者数据的综合评估平台,旨在衡量并提升大语言模型在医疗环境中的智能体能力。
English: Recent large language models have advanced beyond chatbots to act as agents, yet lack standardized medical benchmarks, prompting the introduction of MedAgentBench—a comprehensive evaluation suite with clinically derived tasks and realistic patient data to assess and improve LLM agent capabilities in healthcare.
Authors:Jiazhen Zhang, Yuexi Du, Nicha C. Dvornek, John A. Onofrey
Abstract:
Automated segmentation plays a pivotal role in medical image analysis and computer-assisted interventions. Despite the promising performance of existing methods based on convolutional neural networks (CNNs), they neglect useful equivariant properties for images, such as rotational and reflection equivariance. This limitation can decrease performance and lead to inconsistent predictions, especially in applications like vessel segmentation where explicit orientation is absent. While existing equivariant learning approaches attempt to mitigate these issues, they substantially increase learning cost, model size, or both. To overcome these challenges, we propose a novel application of an efficient symmetric rotation-equivariant (SRE) convolutional (SRE-Conv) kernel implementation to the U-Net architecture, to learn rotation and reflection-equivariant features, while also reducing the model size dramatically. We validate the effectiveness of our method through improved segmentation performance on retina vessel fundus imaging. Our proposed SRE U-Net not only significantly surpasses standard U-Net in handling rotated images, but also outperforms existing equivariant learning methods and does so with a reduced number of trainable parameters and smaller memory cost. The code is available at https://github.com/OnofreyLab/sre_conv_segm_isbi2025.
中文摘要:本研究提出的SRE U-Net在U-Net架构中引入了高效的对称旋转等变卷积核,不仅显著提升了医学图像旋转分割性能,还大幅减少了模型参数量和计算成本,优于现有等变学习方法。
English Summary: The proposed SRE U-Net introduces an efficient symmetric rotation-equivariant convolutional kernel to the U-Net architecture, achieving superior segmentation performance on rotated medical images while significantly reducing model size and computational costs compared to existing methods.
Authors:Panisara Meehinkong, Donlapark Ponnoprat
Abstract:
Conformal prediction provides a framework for uncertainty quantification, specifically in the forms of prediction intervals and sets with distribution-free guaranteed coverage. While recent cross-conformal techniques such as CV+ and Jackknife+-after-bootstrap achieve better data efficiency than traditional split conformal methods, they incur substantial computational costs due to required pairwise comparisons between training and test samples' out-of-bag scores. Observing that these methods naturally extend from ensemble models, particularly random forests, we leverage existing optimized random forest implementations to enable efficient cross-conformal predictions.
We present coverforest, a Python package that implements efficient conformal prediction methods specifically optimized for random forests. coverforest supports both regression and classification tasks through various conformal prediction methods, including split conformal, CV+, Jackknife+-after-bootstrap, and adaptive prediction sets. Our package leverages parallel computing and Cython optimizations to speed up out-of-bag calculations. Our experiments demonstrate that coverforest's predictions achieve the desired level of coverage. In addition, its training and prediction times can be faster than an existing implementation by 2--9 times. The source code for the coverforest is hosted on GitHub at https://github.com/donlapark/coverforest.
中文: coverforest Python 包针对随机森林优化实现了高效共形预测方法,通过并行计算和Cython优化,在保证覆盖精度的同时将训练预测速度提升2-9倍。
English: The coverforest Python package efficiently implements conformal prediction methods optimized for random forests, achieving desired coverage with 2-9 times faster performance through parallel computing and Cython optimizations.
Authors:Haifeng Wen, Hong Xing, Osvaldo Simeone
Abstract:
Post-hoc calibration of pre-trained models is critical for ensuring reliable inference, especially in safety-critical domains such as healthcare. Conformal Prediction (CP) offers a robust post-hoc calibration framework, providing distribution-free statistical coverage guarantees for prediction sets by leveraging held-out datasets. In this work, we address a decentralized setting where each device has limited calibration data and can communicate only with its neighbors over an arbitrary graph topology. We propose two message-passing-based approaches for achieving reliable inference via CP: quantile-based distributed conformal prediction (Q-DCP) and histogram-based distributed conformal prediction (H-DCP). Q-DCP employs distributed quantile regression enhanced with tailored smoothing and regularization terms to accelerate convergence, while H-DCP uses a consensus-based histogram estimation approach. Through extensive experiments, we investigate the trade-offs between hyperparameter tuning requirements, communication overhead, coverage guarantees, and prediction set sizes across different network topologies. The code of our work is released on: https://github.com/HaifengWen/Distributed-Conformal-Prediction.
Chinese: 本研究提出了Q-DCP和H-DCP两种分布式共形预测方法,使数据有限的设备能通过任意图拓扑与邻居通信协作,实现具有统计保证的可靠推理。
English: This study introduces two distributed conformal prediction methods, Q-DCP and H-DCP, which enable devices with limited calibration data to collaboratively achieve reliable inference with statistical guarantees through neighbor communication over arbitrary graph topologies.
Authors:Wenzhang Liu, Lianjun Jin, Lu Ren, Chaoxu Mu, Changyin Sun
Abstract:
Intelligent decision-making within large and redundant action spaces remains challenging in deep reinforcement learning. Considering similar but ineffective actions at each step can lead to repetitive and unproductive trials. Existing methods attempt to improve agent exploration by reducing or penalizing redundant actions, yet they fail to provide quantitative and reliable evidence to determine redundancy. In this paper, we propose a method to improve exploration efficiency by estimating the causal effects of actions. Unlike prior methods, our approach offers quantitative results regarding the causality of actions for one-step transitions. We first pre-train an inverse dynamics model to serve as prior knowledge of the environment. Subsequently, we classify actions across the entire action space at each time step and estimate the causal effect of each action to suppress redundant actions during exploration. We provide a theoretical analysis to demonstrate the effectiveness of our method and present empirical results from simulations in environments with redundant actions to evaluate its performance. Our implementation is available at https://github.com/agi-brain/cee.git.
Chinese: 本文提出了一种通过估计动作的因果效应来提高深度强化学习中探索效率的方法,该方法能定量识别并抑制冗余动作,从而解决大动作空间中的决策挑战。
English: This paper introduces a method to enhance exploration efficiency in deep reinforcement learning by estimating the causal effects of actions, which quantitatively identifies and suppresses redundant actions to address challenges in large action spaces.
Authors:Fanxing Li, Fangyu Sun, Tianbao Zhang, Danping Zou
Abstract:
Quadrotor control policies can be trained with high performance using the exact gradients of the rewards to directly optimize policy parameters via backpropagation-through-time (BPTT). However, designing a fully differentiable reward architecture is often challenging. Partially differentiable rewards will result in biased gradient propagation that degrades training performance. To overcome this limitation, we propose Amended Backpropagation-through-Time (ABPT), a novel approach that mitigates gradient bias while preserving the training efficiency of BPTT. ABPT combines 0-step and N-step returns, effectively reducing the bias by leveraging value gradients from the learned Q-value function. Additionally, it adopts entropy regularization and state initialization mechanisms to encourage exploration during training. We evaluate ABPT on four representative quadrotor flight tasks \li{in both real world and simulation}. Experimental results demonstrate that ABPT converges significantly faster and achieves higher ultimate rewards than existing learning algorithms, particularly in tasks involving partially differentiable rewards. The code will be released at http://github.com/Fanxing-LI/ABPT.
中文: 提出的修正时间反向传播(ABPT)方法通过结合0步与N步回报及熵正则化,有效缓解了四旋翼控制训练中的梯度偏差,在部分可微奖励场景下比现有算法收敛更快且获得更高回报。
English: The proposed Amended Backpropagation-through-Time (ABPT) method effectively mitigates gradient bias in quadrotor control training by combining 0-step and N-step returns with entropy regularization, achieving faster convergence and higher rewards than existing algorithms, especially with partially differentiable rewards.
Authors:Zhengyang Tang, Ziniu Li, Zhenyang Xiao, Tian Ding, Ruoyu Sun, Benyou Wang, Dayiheng Liu, Fei Huang, Tianyu Liu, Bowen Yu, Junyang Lin
Abstract:
Critiques are important for enhancing the performance of Large Language Models (LLMs), enabling both self-improvement and constructive feedback for others by identifying flaws and suggesting improvements. However, evaluating the critique capabilities of LLMs presents a significant challenge due to the open-ended nature of the task. In this work, we introduce a new benchmark designed to assess the critique capabilities of LLMs. Unlike existing benchmarks, which typically function in an open-loop fashion, our approach employs a closed-loop methodology that evaluates the quality of corrections generated from critiques. Moreover, the benchmark incorporates features such as self-critique, cross-critique, and iterative critique, which are crucial for distinguishing the abilities of advanced reasoning models from more classical ones. We implement this benchmark using eight challenging reasoning tasks. We have several interesting findings. First, despite demonstrating comparable performance in direct chain-of-thought generation, classical LLMs significantly lag behind the advanced reasoning-based model o1-mini across all critique scenarios. Second, in self-critique and iterative critique settings, classical LLMs may even underperform relative to their baseline capabilities. We hope that this benchmark will serve as a valuable resource to guide future advancements. The code and data are available at \url{https://github.com/tangzhy/RealCritic}.
Chinese: 本文提出了一种闭环基准来评估大语言模型的批判能力,发现先进推理模型在自我批判和迭代场景中优于传统模型,而传统模型有时甚至表现低于基准水平。
English: This paper introduces a closed-loop benchmark to evaluate LLMs' critique capabilities, revealing that advanced reasoning models outperform classical LLMs in self-critique and iterative scenarios, with classical models sometimes regressing below baseline performance.
Authors:Xu Chu, Zhijie Tan, Hanlin Xue, Guanyu Wang, Tong Mo, Weiping Li
Abstract:
Large Language Models (LLMs) are widely applied to downstream domains. However, current LLMs for high-stakes domain tasks, such as financial investment and legal QA, typically generate brief answers without reasoning processes and explanations. This limits users' confidence in making decisions based on their responses. While original CoT shows promise, it lacks self-correction mechanisms during reasoning. This work introduces Domain$o1$s, which enhances LLMs' reasoning capabilities on domain tasks through supervised fine-tuning and tree search. We construct CoT-stock-2k and CoT-legal-2k datasets for fine-tuning models that activate domain-specific reasoning steps based on their judgment. Additionally, we propose Selective Tree Exploration to spontaneously explore solution spaces and sample optimal reasoning paths to improve performance. We also introduce PROOF-Score, a new metric for evaluating domain models' explainability, complementing traditional accuracy metrics with richer assessment dimensions. Extensive experiments on stock investment recommendation and legal reasoning QA tasks demonstrate Domaino1s's leading performance and explainability. Our code is available at https://github.com/Hyalinesky/Domaino1s.
中文: 本文提出的Domain$o1$s方法通过监督微调和选择性树搜索增强大语言模型在专业领域的推理能力,结合新建评估指标验证了其在金融投资和法律问答任务中具备领先性能与可解释性。
English: This paper introduces Domain$o1$s, a method that enhances LLMs' reasoning and explainability in high-stakes domains through supervised fine-tuning with specialized datasets and selective tree exploration, validated by a new evaluation metric and experiments showing superior performance.
Authors:Lingwei Zhu, Han Wang, Yukie Nagai
Abstract:
Sparse continuous policies are distributions that can choose some actions at random yet keep strictly zero probability for the other actions, which are radically different from the Gaussian. They have important real-world implications, e.g. in modeling safety-critical tasks like medicine. The combination of offline reinforcement learning and sparse policies provides a novel paradigm that enables learning completely from logged datasets a safety-aware sparse policy. However, sparse policies can cause difficulty with the existing offline algorithms which require evaluating actions that fall outside of the current support. In this paper, we propose the first offline policy optimization algorithm that tackles this challenge: Fat-to-Thin Policy Optimization (FtTPO). Specifically, we maintain a fat (heavy-tailed) proposal policy that effectively learns from the dataset and injects knowledge to a thin (sparse) policy, which is responsible for interacting with the environment. We instantiate FtTPO with the general $q$-Gaussian family that encompasses both heavy-tailed and sparse policies and verify that it performs favorably in a safety-critical treatment simulation and the standard MuJoCo suite. Our code is available at \url{https://github.com/lingweizhu/fat2thin}.
中文: 本文提出了Fat-to-Thin策略优化算法(FtTPO),这是首个能够通过从重尾提议策略向稀疏目标策略转移知识,从而有效从记录数据集中学习安全感知稀疏策略的离线算法,在安全关键仿真和标准测试中表现优异。
English: This paper introduces Fat-to-Thin Policy Optimization (FtTPO), the first offline algorithm that effectively learns safety-aware sparse policies from logged datasets by transferring knowledge from a heavy-tailed proposal policy to a sparse target policy, demonstrating strong performance in safety-critical simulations and standard benchmarks.
Authors:Xinyu Ma, Yifeng Xu, Yang Lin, Tianlong Wang, Xu Chu, Xin Gao, Junfeng Zhao, Yasha Wang
Abstract:
We introduce DRESS, a novel approach for generating stylized large language model (LLM) responses through representation editing. Existing methods like prompting and fine-tuning are either insufficient for complex style adaptation or computationally expensive, particularly in tasks like NPC creation or character role-playing. Our approach leverages the over-parameterized nature of LLMs to disentangle a style-relevant subspace within the model's representation space to conduct representation editing, ensuring a minimal impact on the original semantics. By applying adaptive editing strengths, we dynamically adjust the steering vectors in the style subspace to maintain both stylistic fidelity and semantic integrity. We develop two stylized QA benchmark datasets to validate the effectiveness of DRESS, and the results demonstrate significant improvements compared to baseline methods such as prompting and ITI. In short, DRESS is a lightweight, train-free solution for enhancing LLMs with flexible and effective style control, making it particularly useful for developing stylized conversational agents. Codes and benchmark datasets are available at https://github.com/ArthurLeoM/DRESS-LLM.
Chinese: DRESS是一种轻量级、无需训练的新方法,通过在大语言模型的表示空间中编辑风格子空间,在保持语义完整性的同时实现灵活有效的风格控制,经新基准数据集验证优于现有基线方法。
English: DRESS is a lightweight, train-free method that enhances large language models by editing representations in a style subspace, achieving superior style control without compromising semantics, as validated on new benchmark datasets.
Authors:Taha Emre, Teresa Araújo, Marzieh Oghbaie, Dmitrii Lachinov, Guilherme Aresta, Hrvoje BogunoviÄ
Abstract:
Neovascular age-related macular degeneration (nAMD) is a leading cause of vision loss among older adults, where disease activity detection and progression prediction are critical for nAMD management in terms of timely drug administration and improving patient outcomes. Recent advancements in deep learning offer a promising solution for predicting changes in AMD from optical coherence tomography (OCT) retinal volumes. In this work, we proposed deep learning models for the two tasks of the public MARIO Challenge at MICCAI 2024, designed to detect and forecast changes in nAMD severity with longitudinal retinal OCT. For the first task, we employ a Vision Transformer (ViT) based Siamese Network to detect changes in AMD severity by comparing scan embeddings of a patient from different time points. To train a model to forecast the change after 3 months, we exploit, for the first time, an Earth Mover (Wasserstein) Distance-based loss to harness the ordinal relation within the severity change classes. Both models ranked high on the preliminary leaderboard, demonstrating that their predictive capabilities could facilitate nAMD treatment management.
中文: 本研究提出基于视觉变换器的孪生网络模型,结合Wasserstein距离损失函数,能够通过纵向OCT扫描准确检测和预测新生血管性年龄相关性黄斑变性的严重程度变化,为临床治疗管理提供有效支持。
English: This study introduces deep learning models using Vision Transformers and Wasserstein Distance loss to effectively detect and predict neovascular AMD severity changes from longitudinal OCT scans, showing strong potential for improving treatment management.
Authors:Sadegh Mahdavi, Muchen Li, Kaiwen Liu, Christos Thrampoulidis, Leonid Sigal, Renjie Liao
Abstract:
Advances in Large Language Models (LLMs) have sparked interest in their ability to solve Olympiad-level math problems. However, the training and evaluation of these models are constrained by the limited size and quality of available datasets, as creating large-scale data for such advanced problems requires extensive effort from human experts. In addition, current benchmarks are prone to contamination, leading to unreliable evaluations. In this paper, we present an automated pipeline that leverages the rich resources of the Art of Problem Solving (AoPS) forum, which predominantly features Olympiad-level problems and community-driven solutions. Using open-source LLMs, we develop a method to extract question-answer pairs from the forum, resulting in AoPS-Instruct, a dataset of more than 600,000 high-quality QA pairs. Our experiments demonstrate that fine-tuning LLMs on AoPS-Instruct improves their reasoning abilities across various benchmarks. Moreover, we build an automatic pipeline that introduces LiveAoPSBench, an evolving evaluation set with timestamps, derived from the latest forum data, providing a contamination-resistant benchmark for assessing LLM performance. Notably, we observe a significant decline in LLM performance over time, suggesting their success on older examples may stem from pre-training exposure rather than true reasoning ability. Our work presents a scalable approach to creating and maintaining large-scale, high-quality datasets for advanced math reasoning, offering valuable insights into the capabilities and limitations of LLMs in this domain. Our benchmark and code is available at https://github.com/DSL-Lab/aops
中文: 本文提出一种利用解题艺术论坛资源的自动化流程,构建了包含60多万高质量数学问答对的AoPS-Instruct数据集,实验表明该数据集能提升大语言模型的推理能力,同时通过带时间戳的动态基准揭示模型性能随时间下降的现象。
English: This paper introduces an automated pipeline using the Art of Problem Solving forum to create AoPS-Instruct, a large-scale dataset of over 600,000 high-quality math QA pairs, which enhances LLMs' reasoning abilities and provides a contamination-resistant benchmark revealing their performance decline over time.
Authors:Mitch Kosieradzki, Seongjin Choi
Abstract:
For intelligent transportation systems and autonomous vehicles to operate safely and efficiently, they must reliably predict the future motion and trajectory of surrounding agents within complex traffic environments. At the same time, the motion of these agents is inherently uncertain, making accurate prediction difficult. In this paper, we propose \textbf{TrajFlow}, a generative framework for estimating the occupancy density of dynamic agents. Our framework utilizes a causal encoder to extract semantically meaningful embeddings of the observed trajectory, as well as a normalizing flow to decode these embeddings and determine the most likely future location of an agent at some time point in the future. Our formulation differs from existing approaches because we model the marginal distribution of spatial locations instead of the joint distribution of unobserved trajectories. The advantages of a marginal formulation are numerous. First, we demonstrate that the marginal formulation produces higher accuracy on challenging trajectory forecasting benchmarks. Second, the marginal formulation allows for fully continuous sampling of future locations. Finally, marginal densities are better suited for downstream tasks as they allow for the computation of per-agent motion trajectories and occupancy grids, the two most commonly used representations for motion forecasting. We present a novel architecture based entirely on neural differential equations as an implementation of this framework and provide ablations to demonstrate the advantages of a continuous implementation over a more traditional discrete neural network based approach. The code is available at https://github.com/UMN-Choi-Lab/TrajFlow.
中文: 提出的TrajFlow框架通过因果编码器和归一化流建模动态智能体空间位置的边际分布,在复杂交通环境中实现了更高精度的运动预测和连续采样能力。
English: The proposed TrajFlow framework uses a causal encoder and normalizing flow to model the marginal distribution of dynamic agents' spatial locations, achieving higher accuracy and continuous sampling for motion prediction in complex traffic environments.
Authors:Yiyun Zhou, Wenkang Han, Jingyuan Chen
Abstract:
Knowledge Tracing (KT) is a fundamental component of Intelligent Tutoring Systems (ITS), enabling the modeling of students' knowledge states to predict future performance. The introduction of Deep Knowledge Tracing (DKT), the first deep learning-based KT (DLKT) model, has brought significant advantages in terms of applicability and comprehensiveness. However, recent DLKT models, such as Attentive Knowledge Tracing (AKT), have often prioritized predictive performance at the expense of these benefits. While deep sequential models like DKT have shown potential, they face challenges related to parallel computing, storage decision modification, and limited storage capacity. To address these limitations, we propose DKT2, a novel KT model that leverages the recently developed xLSTM architecture. DKT2 enhances applicable input representation using the Rasch model and incorporates Item Response Theory (IRT) for output interpretability, allowing for the decomposition of learned knowledge into familiar and unfamiliar knowledge. By integrating this knowledge with predicted questions, DKT2 generates comprehensive knowledge states. Extensive experiments conducted across three large-scale datasets demonstrate that DKT2 consistently outperforms 18 baseline models in various prediction tasks, underscoring its potential for real-world educational applications. This work bridges the gap between theoretical advancements and practical implementation in KT. Our code and datasets are fully available at https://github.com/zyy-2001/DKT2.
中文: DKT2是一种新颖的知识追踪模型,它采用xLSTM架构,结合拉什模型和项目反应理论来增强输入表示和输出可解释性,并在多种预测任务中持续优于18个基线模型,显示出在实际教育应用中的巨大潜力。
English: DKT2 is a novel knowledge tracing model that utilizes the xLSTM architecture, integrates the Rasch model and Item Response Theory for enhanced input representation and output interpretability, and consistently outperforms 18 baseline models across various prediction tasks, demonstrating strong potential for real-world educational applications.
Authors:Joshua Davis, Thomas Sounack, Kate Sciacca, Jessie M Brain, Brigitte N Durieux, Nicole D Agaronnik, Charlotta Lindvall
Abstract:
Extracting sections from clinical notes is crucial for downstream analysis but is challenging due to variability in formatting and labor-intensive nature of manual sectioning. While proprietary large language models (LLMs) have shown promise, privacy concerns limit their accessibility. This study develops a pipeline for automated note sectioning using open-source LLMs, focusing on three sections: History of Present Illness, Interval History, and Assessment and Plan. We fine-tuned three open-source LLMs to extract sections using a curated dataset of 487 progress notes, comparing results relative to proprietary models (GPT-4o, GPT-4o mini). Internal and external validity were assessed via precision, recall and F1 score. Fine-tuned Llama 3.1 8B outperformed GPT-4o (F1=0.92). On the external validity test set, performance remained high (F1= 0.85). Fine-tuned open-source LLMs can surpass proprietary models in clinical note sectioning, offering advantages in cost, performance, and accessibility.
中文: 本研究开发了一种基于微调开源大语言模型的临床笔记自动分段流程,在超越专有模型性能的同时,有效解决了隐私保护和可访问性问题。
English: This study develops a pipeline using fine-tuned open-source LLMs to automate clinical note sectioning, demonstrating superior performance over proprietary models while addressing privacy and accessibility concerns.
Authors:Sneh Pandya, Purvik Patel, Brian D. Nord, Mike Walmsley, Aleksandra ÄiprijanoviÄ
Abstract:
Modern neural networks (NNs) often do not generalize well in the presence of a "covariate shift"; that is, in situations where the training and test data distributions differ, but the conditional distribution of classification labels remains unchanged. In such cases, NN generalization can be reduced to a problem of learning more domain-invariant features. Domain adaptation (DA) methods include a range of techniques aimed at achieving this; however, these methods have struggled with the need for extensive hyperparameter tuning, which then incurs significant computational costs. In this work, we introduce SIDDA, an out-of-the-box DA training algorithm built upon the Sinkhorn divergence, that can achieve effective domain alignment with minimal hyperparameter tuning and computational overhead. We demonstrate the efficacy of our method on multiple simulated and real datasets of varying complexity, including simple shapes, handwritten digits, and real astronomical observations. SIDDA is compatible with a variety of NN architectures, and it works particularly well in improving classification accuracy and model calibration when paired with equivariant neural networks (ENNs). We find that SIDDA enhances the generalization capabilities of NNs, achieving up to a $\approx40\%$ improvement in classification accuracy on unlabeled target data. We also study the efficacy of DA on ENNs with respect to the varying group orders of the dihedral group $D_N$, and find that the model performance improves as the degree of equivariance increases. Finally, we find that SIDDA enhances model calibration on both source and target data--achieving over an order of magnitude improvement in the ECE and Brier score. SIDDA's versatility, combined with its automated approach to domain alignment, has the potential to advance multi-dataset studies by enabling the development of highly generalizable models.
中文: SIDDA是一种基于Sinkhorn散度的高效域自适应算法,能以最少的超参数调优实现域对齐,显著提升神经网络在不同数据集上的泛化能力和模型校准效果。
English: SIDDA is an efficient domain adaptation algorithm that uses Sinkhorn divergence to achieve domain alignment with minimal hyperparameter tuning, significantly improving neural network generalization and model calibration across various datasets.
Authors:Ioannis Nasios
Abstract:
Kelp forests, as foundation species, are vital to marine ecosystems, providing essential food and habitat for numerous organisms. This study explores the integration of crowdsourced labels with advanced artificial intelligence models to develop a fast and accurate kelp canopy detection pipeline using Landsat images. Building on the success of a machine learning competition, where this approach ranked third and performed consistently well on both local validation and public and private leaderboards, the research highlights the effectiveness of combining Mixed Vision Transformers (MIT) with ConvNeXt models. Training these models on various image sizes significantly enhanced the accuracy of the ensemble results. U-Net emerged as the best segmentation architecture, with UpperNet also contributing to the final ensemble. Key Landsat bands, such as ShortWave InfraRed (SWIR1) and Near-InfraRed (NIR), were crucial while altitude data was used in postprocessing to eliminate false positives on land. The methodology achieved a high detection rate, accurately identifying about three out of four pixels containing kelp canopy while keeping false positives low. Despite the medium resolution of Landsat satellites, their extensive historical coverage makes them effective for studying kelp forests. This work also underscores the potential of combining machine learning models with crowdsourced data for effective and scalable environmental monitoring. All running code for training all models and inference can be found at https://github.com/IoannisNasios/Kelp_Forests.
中文摘要:本研究通过整合众包标签与混合视觉转换器和ConvNeXt等人工智能模型,利用Landsat图像开发出高效的海藻冠层检测方法,在中等卫星分辨率下仍能实现精确识别。
English Summary: This study demonstrates a highly effective method for detecting kelp canopies using Landsat imagery by integrating crowdsourced labels with AI models like Mixed Vision Transformers and ConvNeXt, achieving accurate results despite medium satellite resolution.
Authors:Xing Hu, Yuan Cheng, Dawei Yang, Zukang Xu, Zhihang Yuan, Jiangyong Yu, Chen Xu, Zhe Jiang, Sifan Zhou
Abstract:
Post-training quantization (PTQ) has emerged as a widely adopted technique for compressing and accelerating Large Language Models (LLMs). The major challenge in LLM quantization is that uneven and heavy-tailed data distributions can expand the quantization range, thereby reducing bit precision for most values. Recent methods attempt to eliminate outliers and balance inter-channel differences by employing linear transformations; however, they remain heuristic and are often overlook optimizing the data distribution across the entire quantization space.In this paper, we introduce Quantization Space Utilization Rate (QSUR), a novel metric that effectively assesses the quantizability of transformed data by measuring the space utilization of the data in the quantization space. We complement QSUR with mathematical derivations that examine the effects and limitations of various transformations, guiding our development of Orthogonal and Scaling Transformation-based Quantization (OSTQuant). OSQuant employs a learnable equivalent transformation, consisting of an orthogonal transformation and a scaling transformation, to optimize the distributions of weights and activations across the entire quantization space. Futhermore, we propose the KL-Top loss function, designed to mitigate noise during optimization while retaining richer semantic information within the limited calibration data imposed by PTQ. OSTQuant outperforms existing work on various LLMs and benchmarks. In the W4-only setting, it retains 99.5\% of the floating-point accuracy. In the more challenging W4A4KV4 configuration, OSTQuant reduces the performance gap by 32\% on the LLaMA-3-8B model compared to state-of-the-art methods. \href{https://github.com/BrotherHappy/OSTQuant}{https://github.com/BrotherHappy/OSTQuant}.
中文: 本文提出OSTQuant方法,通过正交变换和缩放变换优化量化空间中的数据分布,在大语言模型压缩中实现了卓越性能,同时保持高精度。
English: This paper introduces OSTQuant, a novel quantization method that optimizes data distribution across the quantization space using orthogonal and scaling transformations, achieving superior performance in compressing Large Language Models while maintaining high accuracy.
Authors:Chengyi Cai, Zesheng Ye, Lei Feng, Jianzhong Qi, Feng Liu
Abstract:
Visual reprogramming (VR) reuses pre-trained vision models for downstream image classification tasks by adding trainable noise patterns to inputs. When applied to vision-language models (e.g., CLIP), existing VR approaches follow the same pipeline used in vision models (e.g., ResNet, ViT), where ground-truth class labels are inserted into fixed text templates to guide the optimization of VR patterns. This label-based approach, however, overlooks the rich information and diverse attribute-guided textual representations that CLIP can exploit, which may lead to the misclassification of samples. In this paper, we propose Attribute-based Visual Reprogramming (AttrVR) for CLIP, utilizing descriptive attributes (DesAttrs) and distinctive attributes (DistAttrs), which respectively represent common and unique feature descriptions for different classes. Besides, as images of the same class may reflect different attributes after VR, AttrVR iteratively refines patterns using the $k$-nearest DesAttrs and DistAttrs for each image sample, enabling more dynamic and sample-specific optimization. Theoretically, AttrVR is shown to reduce intra-class variance and increase inter-class separation. Empirically, it achieves superior performance in 12 downstream tasks for both ViT-based and ResNet-based CLIP. The success of AttrVR facilitates more effective integration of VR from unimodal vision models into vision-language models. Our code is available at https://github.com/tmlr-group/AttrVR.
Chinese Summary: 本文提出基于属性的视觉重编程方法(AttrVR),通过利用描述性和区分性属性动态优化输入模式,有效降低类内差异并提升CLIP模型在12个下游任务中的分类性能。
English Summary: This paper introduces Attribute-based Visual Reprogramming (AttrVR) for CLIP, which leverages descriptive and distinctive attributes to dynamically optimize input patterns, reducing intra-class variance and improving classification performance across 12 downstream tasks.
Authors:Hao Dong, Eleni Chatzi, Olga Fink
Abstract:
Test-time adaptation (TTA) has demonstrated significant potential in addressing distribution shifts between training and testing data. Open-set test-time adaptation (OSTTA) aims to adapt a source pre-trained model online to an unlabeled target domain that contains unknown classes. This task becomes more challenging when multiple modalities are involved. Existing methods have primarily focused on unimodal OSTTA, often filtering out low-confidence samples without addressing the complexities of multimodal data. In this work, we present Adaptive Entropy-aware Optimization (AEO), a novel framework specifically designed to tackle Multimodal Open-set Test-time Adaptation (MM-OSTTA) for the first time. Our analysis shows that the entropy difference between known and unknown samples in the target domain strongly correlates with MM-OSTTA performance. To leverage this, we propose two key components: Unknown-aware Adaptive Entropy Optimization (UAE) and Adaptive Modality Prediction Discrepancy Optimization (AMP). These components enhance the ability of model to distinguish unknown class samples during online adaptation by amplifying the entropy difference between known and unknown samples. To thoroughly evaluate our proposed methods in the MM-OSTTA setting, we establish a new benchmark derived from existing datasets. This benchmark includes two downstream tasks and incorporates five modalities. Extensive experiments across various domain shift situations demonstrate the efficacy and versatility of the AEO framework. Additionally, we highlight the strong performance of AEO in long-term and continual MM-OSTTA settings, both of which are challenging and highly relevant to real-world applications. Our source code is available at https://github.com/donghao51/AEO.
中文: AEO框架通过自适应熵优化增强多模态开放集测试时适应能力,利用已知与未知样本间的熵差异提升模型对未知类别的识别性能。
English: The AEO framework introduces adaptive entropy optimization to enhance multimodal open-set test-time adaptation by distinguishing unknown classes through amplified entropy differences between known and unknown samples.
Authors:Jiayi Lei, Renrui Zhang, Xiangfei Hu, Weifeng Lin, Zhen Li, Wenjian Sun, Ruoyi Du, Le Zhuo, Zhongyu Li, Xinyue Li, Shitian Zhao, Ziyu Guo, Yiting Lu, Peng Gao, Hongsheng Li
Abstract:
With the rapid development of diffusion models, text-to-image(T2I) models have made significant progress, showcasing impressive abilities in prompt following and image generation. Recently launched models such as FLUX.1 and Ideogram2.0, along with others like Dall-E3 and Stable Diffusion 3, have demonstrated exceptional performance across various complex tasks, raising questions about whether T2I models are moving towards general-purpose applicability. Beyond traditional image generation, these models exhibit capabilities across a range of fields, including controllable generation, image editing, video, audio, 3D, and motion generation, as well as computer vision tasks like semantic segmentation and depth estimation. However, current evaluation frameworks are insufficient to comprehensively assess these models' performance across expanding domains. To thoroughly evaluate these models, we developed the IMAGINE-E and tested six prominent models: FLUX.1, Ideogram2.0, Midjourney, Dall-E3, Stable Diffusion 3, and Jimeng. Our evaluation is divided into five key domains: structured output generation, realism, and physical consistency, specific domain generation, challenging scenario generation, and multi-style creation tasks. This comprehensive assessment highlights each model's strengths and limitations, particularly the outstanding performance of FLUX.1 and Ideogram2.0 in structured and specific domain tasks, underscoring the expanding applications and potential of T2I models as foundational AI tools. This study provides valuable insights into the current state and future trajectory of T2I models as they evolve towards general-purpose usability. Evaluation scripts will be released at https://github.com/jylei16/Imagine-e.
中文:随着FLUX.1和Ideogram2.0等扩散模型的突破,文本到图像模型在多领域展现出通用化潜力,但现有评估体系尚不完善,为此开发的IMAGINE-E基准通过五大关键维度对六款主流模型进行了系统评估,揭示了其作为基础AI工具的发展前景。
English: Recent advances in diffusion-based text-to-image models like FLUX.1 and Ideogram2.0 demonstrate expanding capabilities across multiple domains, though current evaluation frameworks remain inadequate for comprehensive assessment, prompting the development of the IMAGINE-E benchmark to systematically evaluate six leading models across five key domains.
Authors:Rui Li, Xiaohan Wang, Yuhui Zhang, Orr Zohar, Zeyu Wang, Serena Yeung-Levy
Abstract:
Despite significant advancements in video large multimodal models (video-LMMs), achieving effective temporal grounding in long-form videos remains a challenge for existing models. To address this limitation, we propose Temporal Preference Optimization (TPO), a novel post-training framework designed to enhance the temporal grounding capabilities of video-LMMs through preference learning. TPO adopts a self-training approach that enables models to differentiate between well-grounded and less accurate temporal responses by leveraging curated preference datasets at two granularities: localized temporal grounding, which focuses on specific video segments, and comprehensive temporal grounding, which captures extended temporal dependencies across entire video sequences. By optimizing on these preference datasets, TPO significantly enhances temporal understanding while reducing reliance on manually annotated data. Extensive experiments on three long-form video understanding benchmarks--LongVideoBench, MLVU, and Video-MME--demonstrate the effectiveness of TPO across two state-of-the-art video-LMMs. Notably, LLaVA-Video-TPO establishes itself as the leading 7B model on the Video-MME benchmark, underscoring the potential of TPO as a scalable and efficient solution for advancing temporal reasoning in long-form video understanding. Project page: https://ruili33.github.io/tpo_website.
Authors:Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Wenyu Qin, Menghan Xia, Xintao Wang, Xiaohong Liu, Fei Yang, Pengfei Wan, Di Zhang, Kun Gai, Yujiu Yang, Wanli Ouyang
Abstract:
Video generation has achieved significant advances through rectified flow techniques, but issues like unsmooth motion and misalignment between videos and prompts persist. In this work, we develop a systematic pipeline that harnesses human feedback to mitigate these problems and refine the video generation model. Specifically, we begin by constructing a large-scale human preference dataset focused on modern video generation models, incorporating pairwise annotations across multi-dimensions. We then introduce VideoReward, a multi-dimensional video reward model, and examine how annotations and various design choices impact its rewarding efficacy. From a unified reinforcement learning perspective aimed at maximizing reward with KL regularization, we introduce three alignment algorithms for flow-based models by extending those from diffusion models. These include two training-time strategies: direct preference optimization for flow (Flow-DPO) and reward weighted regression for flow (Flow-RWR), and an inference-time technique, Flow-NRG, which applies reward guidance directly to noisy videos. Experimental results indicate that VideoReward significantly outperforms existing reward models, and Flow-DPO demonstrates superior performance compared to both Flow-RWR and standard supervised fine-tuning methods. Additionally, Flow-NRG lets users assign custom weights to multiple objectives during inference, meeting personalized video quality needs. Project page: https://gongyeliu.github.io/videoalign.
中文摘要:本研究通过构建大规模人类偏好数据集、开发VideoReward奖励模型,并实施三种对齐算法,建立了一个利用人类反馈的系统化流程,显著提升了视频生成的流畅度和提示词对齐效果。
English Summary: This study introduces a systematic pipeline using human feedback to enhance video generation by creating a large-scale preference dataset, developing the VideoReward model, and implementing three alignment algorithms that significantly improve motion smoothness and prompt alignment.
Authors:Shiling Deng, Serge Belongie, Peter Ebert Christensen
Abstract:
Memes have emerged as a powerful form of communication, integrating visual and textual elements to convey humor, satire, and cultural messages. Existing research has focused primarily on aspects such as emotion classification, meme generation, propagation, interpretation, figurative language, and sociolinguistics, but has often overlooked deeper meme comprehension and meme-text retrieval. To address these gaps, this study introduces ClassicMemes-50-templates (CM50), a large-scale dataset consisting of over 33,000 memes, centered around 50 popular meme templates. We also present an automated knowledge-grounded annotation pipeline leveraging large vision-language models to produce high-quality image captions, meme captions, and literary device labels overcoming the labor intensive demands of manual annotation. Additionally, we propose a meme-text retrieval CLIP model (mtrCLIP) that utilizes cross-modal embedding to enhance meme analysis, significantly improving retrieval performance. Our contributions include:(1) a novel dataset for large-scale meme study, (2) a scalable meme annotation framework, and (3) a fine-tuned CLIP for meme-text retrieval, all aimed at advancing the understanding and analysis of memes at scale.
中文: 本研究通过推出大规模CM50表情包数据集、自动化标注流程及优化的CLIP检索模型,解决了深度表情包理解的研究空白,推动了规模化表情包分析的发展。
English: This study introduces CM50, a large-scale meme dataset with automated annotations and a fine-tuned CLIP model for meme-text retrieval, addressing gaps in deep meme comprehension and advancing scalable meme analysis.
Authors:Frederik Pahde, Thomas Wiegand, Sebastian Lapuschkin, Wojciech Samek
Abstract:
Deep neural networks are increasingly employed in high-stakes medical applications, despite their tendency for shortcut learning in the presence of spurious correlations, which can have potentially fatal consequences in practice. Whereas a multitude of works address either the detection or mitigation of such shortcut behavior in isolation, the Reveal2Revise approach provides a comprehensive bias mitigation framework combining these steps. However, effectively addressing these biases often requires substantial labeling efforts from domain experts. In this work, we review the steps of the Reveal2Revise framework and enhance it with semi-automated interpretability-based bias annotation capabilities. This includes methods for the sample- and feature-level bias annotation, providing valuable information for bias mitigation methods to unlearn the undesired shortcut behavior. We show the applicability of the framework using four medical datasets across two modalities, featuring controlled and real-world spurious correlations caused by data artifacts. We successfully identify and mitigate these biases in VGG16, ResNet50, and contemporary Vision Transformer models, ultimately increasing their robustness and applicability for real-world medical tasks. Our code is available at https://github.com/frederikpahde/medical-ai-safety.
中文:Reveal2Revise框架通过半自动化的基于可解释性的偏差标注功能得到增强,能有效识别和缓解医学AI模型中的捷径学习问题,从而提升了模型在多种数据集和架构中的鲁棒性。
English: The Reveal2Revise framework is enhanced with semi-automated interpretability-based bias annotation to effectively identify and mitigate shortcut learning in medical AI models, improving their robustness across various datasets and architectures.
Authors:Zhi Sheng, Daisy Yuan, Jingtao Ding, Yong Li
Abstract:
Accurate prediction of mobile traffic, i.e., network traffic from cellular base stations, is crucial for optimizing network performance and supporting urban development. However, the non-stationary nature of mobile traffic, driven by human activity and environmental changes, leads to both regular patterns and abrupt variations. Diffusion models excel in capturing such complex temporal dynamics due to their ability to capture the inherent uncertainties. Most existing approaches prioritize designing novel denoising networks but often neglect the critical role of noise itself, potentially leading to sub-optimal performance. In this paper, we introduce a novel perspective by emphasizing the role of noise in the denoising process. Our analysis reveals that noise fundamentally shapes mobile traffic predictions, exhibiting distinct and consistent patterns. We propose NPDiff, a framework that decomposes noise into prior and residual components, with the prior} derived from data dynamics, enhancing the model's ability to capture both regular and abrupt variations. NPDiff can seamlessly integrate with various diffusion-based prediction models, delivering predictions that are effective, efficient, and robust. Extensive experiments demonstrate that it achieves superior performance with an improvement over 30\%, offering a new perspective on leveraging diffusion models in this domain. We provide code and data at https://github.com/tsinghua-fib-lab/NPDiff.
中文摘要:本文提出NPDiff框架,通过将噪声分解为先验和残差分量来改进移动流量预测,能更有效地捕捉规律模式和突发变化,实验表明其性能提升超过30%。
English Summary: This paper introduces NPDiff, a novel diffusion-based framework that enhances mobile traffic prediction by decomposing noise into prior and residual components, achieving over 30% performance improvement through better modeling of both regular patterns and abrupt variations.
Authors:Dan Zhang, Tao Feng, Lilong Xue, Yuandong Wang, Yuxiao Dong, Jie Tang
Abstract:
This survey delves into the realm of Parameter-Efficient Fine-Tuning (PEFT) within the context of Foundation Models (FMs). PEFT, a cost-effective fine-tuning technique, minimizes parameters and computational complexity while striving for optimal downstream task performance. FMs, like ChatGPT, DALL-E, and LLaVA specialize in language understanding, generative tasks, and multimodal tasks, trained on diverse datasets spanning text, images, and videos. The diversity of FMs guides various adaptation strategies for PEFT. Therefore, this survey aims to provide a comprehensive overview of PEFT techniques applied to diverse FMs and address critical gaps in understanding the techniques, trends, and applications. We start by providing a detailed development of FMs and PEFT. Subsequently, we systematically review the key categories and core mechanisms of PEFT across diverse FMs to offer a comprehensive understanding of trends. We also explore the most recent applications across various FMs to demonstrate the versatility of PEFT, shedding light on the integration of systematic PEFT methods with a range of FMs. Furthermore, we identify potential research and development directions for improving PEFTs in the future. This survey provides a valuable resource for both newcomers and experts seeking to understand and use the power of PEFT across FMs. All reviewed papers are listed at \url{https://github.com/THUDM/Awesome-Parameter-Efficient-Fine-Tuning-for-Foundation-Models}.
本综述全面探讨了基础模型的参数高效微调技术,系统分析了其核心机制、应用场景与发展趋势,为相关研究者提供了重要参考。
This survey comprehensively reviews parameter-efficient fine-tuning techniques for foundation models, analyzing their mechanisms, applications, and future research directions to serve as a valuable resource for researchers.
Authors:Mingzhao Wang, You Zhou, Zhiguang Cao, Yubin Xiao, Xuan Wu, Wei Pang, Yuan Jiang, Hui Yang, Peng Zhao, Yuanshu Li
Abstract:
Recent advances in neural models have shown considerable promise in solving Traveling Salesman Problems (TSPs) without relying on much hand-crafted engineering. However, while non-autoregressive (NAR) approaches benefit from faster inference through parallelism, they typically deliver solutions of inferior quality compared to autoregressive ones. To enhance the solution quality while maintaining fast inference, we propose DEITSP, a diffusion model with efficient iterations tailored for TSP that operates in a NAR manner. Firstly, we introduce a one-step diffusion model that integrates the controlled discrete noise addition process with self-consistency enhancement, enabling optimal solution prediction through simultaneous denoising of multiple solutions. Secondly, we design a dual-modality graph transformer to bolster the extraction and fusion of features from node and edge modalities, while further accelerating the inference with fewer layers. Thirdly, we develop an efficient iterative strategy that alternates between adding and removing noise to improve exploration compared to previous diffusion methods. Additionally, we devise a scheduling framework to progressively refine the solution space by adjusting noise levels, facilitating a smooth search for optimal solutions. Extensive experiments on real-world and large-scale TSP instances demonstrate that DEITSP performs favorably against existing neural approaches in terms of solution quality, inference latency, and generalization ability. Our code is available at $\href{https://github.com/DEITSP/DEITSP}{https://github.com/DEITSP/DEITSP}$.
中文: 提出的DEITSP模型通过非自回归扩散方法和高效迭代策略,在旅行商问题中提升了求解质量,并在质量、速度和泛化能力上优于现有方法。
English: The proposed DEITSP model enhances solution quality for Traveling Salesman Problems through a non-autoregressive diffusion approach with efficient iterations, outperforming existing methods in quality, speed, and generalization.
Authors:Abdulrahman Oladipupo Ibraheem
Abstract:
I introduce two novel loss functions for classification in deep learning. The two loss functions extend standard cross entropy loss by regularizing it with minimum entropy and Kullback-Leibler (K-L) divergence terms. The first of the two novel loss functions is termed mixed entropy loss (MIX-ENT for short), while the second one is termed minimum entropy regularized cross-entropy loss (MIN-ENT for short). The MIX-ENT function introduces a regularizer that can be shown to be equivalent to the sum of a minimum entropy term and a K-L divergence term. However, it should be noted that the K-L divergence term here is different from that in the standard cross-entropy loss function, in the sense that it swaps the roles of the target probability and the hypothesis probability. The MIN-ENT function simply adds a minimum entropy regularizer to the standard cross entropy loss function. In both MIX-ENT and MIN-ENT, the minimum entropy regularizer minimizes the entropy of the hypothesis probability distribution which is output by the neural network. Experiments on the EMNIST-Letters dataset shows that my implementation of MIX-ENT and MIN-ENT lets the VGG model climb from its previous 3rd position on the paperswithcode leaderboard to reach the 2nd position on the leaderboard, outperforming the Spinal-VGG model in so doing. Specifically, using standard cross-entropy, VGG achieves 95.86% while Spinal-VGG achieves 95.88% classification accuracies, whereas using VGG (without Spinal-VGG) our MIN-ENT achieved 95.933%, while our MIX-ENT achieved 95.927% accuracies. The pre-trained models for both MIX-ENT and MIN-ENT are at https://github.com/rahmanoladi/minimum entropy project.
Chinese: 本文提出了两种新的损失函数MIX-ENT和MIN-ENT,通过引入最小熵和K-L散度正则化改进标准交叉熵,在EMNIST-Letters数据集上提升了VGG模型的分类准确率,使其在排行榜上超越原有排名。
English: This paper introduces two novel loss functions, MIX-ENT and MIN-ENT, which enhance standard cross-entropy by incorporating minimum entropy and K-L divergence regularization, leading to improved classification accuracy on the EMNIST-Letters dataset and advancing VGG's leaderboard position.
Authors:Olaya Pérez-Mon, Juan José del Coz, Pablo González
Abstract:
Quantification, or prevalence estimation, is the task of predicting the prevalence of each class within an unknown bag of examples. Most existing quantification methods in the literature rely on prior probability shift assumptions to create a quantification model that uses the predictions of an underlying classifier to make optimal prevalence estimates. In this work, we present an end-to-end neural network that uses Gaussian distributions in latent spaces to obtain invariant representations of bags of examples. This approach addresses the quantification problem using deep learning, enabling the optimization of specific loss functions relevant to the problem and avoiding the need for an intermediate classifier, tackling the quantification problem as a direct optimization problem. Our method achieves state-of-the-art results, both against traditional quantification methods and other deep learning approaches for quantification. The code needed to reproduce all our experiments is publicly available at https://github.com/AICGijon/gmnet.
Chinese: 本文提出了一种端到端的神经网络,利用潜在空间中的高斯分布直接优化量化任务,无需依赖中间分类器即可实现最先进的成果。
English: This paper introduces an end-to-end neural network that leverages Gaussian distributions in latent spaces to directly optimize quantification tasks, achieving state-of-the-art results without relying on intermediate classifiers.
Authors:Younes Yousef, Lukas Galke, Ansgar Scherp
Abstract:
Recent approaches in hierarchical text classification (HTC) rely on the capabilities of a pre-trained transformer model and exploit the label semantics and a graph encoder for the label hierarchy. In this paper, we introduce an effective hierarchical text classifier RADAr (Transformer-based Autoregressive Decoder Architecture) that is based only on an off-the-shelf RoBERTa transformer to process the input and a custom autoregressive decoder with two decoder layers for generating the classification output. Thus, unlike existing approaches for HTC, the encoder of RADAr has no explicit encoding of the label hierarchy and the decoder solely relies on the label sequences of the samples observed during training. We demonstrate on three benchmark datasets that RADAr achieves results competitive to the state of the art with less training and inference time. Our model consistently performs better when organizing the label sequences from children to parents versus the inverse, as done in existing HTC approaches. Our experiments show that neither the label semantics nor an explicit graph encoder for the hierarchy is needed. This has strong practical implications for HTC as the architecture has fewer requirements and provides a speed-up by a factor of 2 at inference time. Moreover, training a separate decoder from scratch in conjunction with fine-tuning the encoder allows future researchers and practitioners to exchange the encoder part as new models arise. The source code is available at https://github.com/yousef-younes/RADAr.
Chinese: RADAr模型提出了一种基于Transformer的层次文本分类器,采用标准RoBERTa编码器和自定义自回归解码器,无需显式标签层次编码即可达到先进水平,同时将推理时间减半。
English: The RADAr model introduces a transformer-based hierarchical text classifier that uses a standard RoBERTa encoder and a custom autoregressive decoder, achieving competitive results without explicit label hierarchy encoding while reducing inference time by half.
Authors:Tao Liu, Kai Wang, Senmao Li, Joost van de Weijer, Fahad Shahbaz Khan, Shiqi Yang, Yaxing Wang, Jian Yang, Ming-Ming Cheng
Abstract:
Text-to-image generation models can create high-quality images from input prompts. However, they struggle to support the consistent generation of identity-preserving requirements for storytelling. Existing approaches to this problem typically require extensive training in large datasets or additional modifications to the original model architectures. This limits their applicability across different domains and diverse diffusion model configurations. In this paper, we first observe the inherent capability of language models, coined context consistency, to comprehend identity through context with a single prompt. Drawing inspiration from the inherent context consistency, we propose a novel training-free method for consistent text-to-image (T2I) generation, termed "One-Prompt-One-Story" (1Prompt1Story). Our approach 1Prompt1Story concatenates all prompts into a single input for T2I diffusion models, initially preserving character identities. We then refine the generation process using two novel techniques: Singular-Value Reweighting and Identity-Preserving Cross-Attention, ensuring better alignment with the input description for each frame. In our experiments, we compare our method against various existing consistent T2I generation approaches to demonstrate its effectiveness through quantitative metrics and qualitative assessments. Code is available at https://github.com/byliutao/1Prompt1Story.
Chinese Summary: 本文提出“1Prompt1Story”方法,通过合并提示词并采用奇异值重加权和身份保持交叉注意力技术,无需训练即可实现一致性文本到图像生成,在实验中优于现有方法。
English Summary: The paper introduces "1Prompt1Story," a training-free method that enhances consistent text-to-image generation by combining prompts and refining outputs with Singular-Value Reweighting and Identity-Preserving Cross-Attention, outperforming existing approaches.
Authors:Samer Attrah
Abstract:
Emotion estimation in general is a field that has been studied for a long time, and several approaches exist using machine learning. in this paper, we present an LSTM model, that processes the blend-shapes produced by the library MediaPipe, for a face detected in a live stream of a camera, to estimate the main emotion from the facial expressions, this model is trained on the FER2013 dataset and delivers a result of 71% accuracy and 62% f1-score which meets the accuracy benchmark of the FER2013 dataset, with significantly reduced computation costs. https://github.com/Samir-atra/Emotion_estimation_from_video_footage_with_LSTM_ML_algorithm
中文: 本文提出一种LSTM模型,通过处理实时视频中MediaPipe生成的面部混合形状来估算情绪,在FER2013数据集上达到71%准确率和62% F1分数,且显著降低了计算成本。
English: This paper introduces an LSTM model that analyzes MediaPipe-generated facial blend-shapes from live video to estimate emotions, achieving 71% accuracy and 62% F1-score on the FER2013 dataset with lower computational costs.
Authors:Yiming Tang, Abrar Anwar, Jesse Thomason
Abstract:
Understanding social signals in multi-party conversations is important for human-robot interaction and artificial social intelligence. Social signals include body pose, head pose, speech, and context-specific activities like acquiring and taking bites of food when dining. Past work in multi-party interaction tends to build task-specific models for predicting social signals. In this work, we address the challenge of predicting multimodal social signals in multi-party settings in a single model. We introduce M3PT, a causal transformer architecture with modality and temporal blockwise attention masking to simultaneously process multiple social cues across multiple participants and their temporal interactions. We train and evaluate M3PT on the Human-Human Commensality Dataset (HHCD), and demonstrate that using multiple modalities improves bite timing and speaking status prediction. Source code: https://github.com/AbrarAnwar/masked-social-signals/.
中文摘要:本研究提出M3PT模型,通过处理多方交互中的多模态社交线索及其时序关联,统一预测社交信号,在HHCD数据集上实现了进食时机和说话状态预测性能的提升。
English Summary: This research introduces M3PT, a unified transformer model that predicts multimodal social signals in multi-party interactions by processing multiple cues across participants and time, showing improved performance in bite timing and speaking status prediction on the HHCD dataset.
Authors:Joshua Park, Yongfeng Zhang
Abstract:
Multi-agent systems must decide which agent is the most appropriate for a given task. We propose a novel architecture for recommending which LLM agent out of many should perform a task given a natural language prompt by extending the Sentence-BERT (SBERT) encoder model. On test data, we are able to achieve a top-1 accuracy of 92.2% with each classification taking less than 300 milliseconds. In contrast to traditional classification methods, our architecture is computationally cheap, adaptive to new classes, interpretable, and controllable with arbitrary metrics through reinforcement learning. By encoding natural language prompts into sentence embeddings, our model captures the semantic content relevant to recommending an agent. The distance between sentence embeddings that belong to the same agent is then minimized through fine-tuning and aligned to human values through reinforcement learning from human feedback. This allows the classification of natural language prompts based on their nearest neighbors by measuring the cosine similarity between embeddings. This work is made possible through the generation of a synthetic dataset for agent recommendation, which we have open-sourced to the public along with the code for AgentRec recommendation system at https://github.com/joshprk/agentrec.
Chinese: 该研究提出了一种基于Sentence-BERT的新架构,通过自然语言提示推荐最适合任务的LLM代理,实现了92.2%的Top-1准确率,并通过强化学习提供了可解释性和适应性。
English: The study introduces a novel architecture using Sentence-BERT to recommend the most suitable LLM agent for tasks based on natural language prompts, achieving 92.2% top-1 accuracy efficiently and offering interpretability and adaptability through reinforcement learning.
Authors:Yang Bai, Christan Earl Grant, Daisy Zhe Wang
Abstract:
Multi-modal retrieval-augmented Question Answering (MRAQA), integrating text and images, has gained significant attention in information retrieval (IR) and natural language processing (NLP). Traditional ranking methods rely on small encoder-based language models, which are incompatible with modern decoder-based generative large language models (LLMs) that have advanced various NLP tasks. To bridge this gap, we propose RAMQA, a unified framework combining learning-to-rank methods with generative permutation-enhanced ranking techniques. We first train a pointwise multi-modal ranker using LLaVA as the backbone. Then, we apply instruction tuning to train a LLaMA model for re-ranking the top-k documents using an innovative autoregressive multi-task learning approach. Our generative ranking model generates re-ranked document IDs and specific answers from document candidates in various permutations. Experiments on two MRAQA benchmarks, WebQA and MultiModalQA, show significant improvements over strong baselines, highlighting the effectiveness of our approach. Code and data are available at: https://github.com/TonyBY/RAMQA
中文摘要:RAMQA框架通过结合排序学习方法和生成式重排技术,弥合了传统排序模型与大型语言模型之间的差距,在多模态问答基准测试中实现了显著性能提升。
English Summary: The proposed RAMQA framework integrates learning-to-rank methods with generative permutation techniques to bridge the gap between traditional ranking models and modern LLMs, achieving significant performance improvements on multi-modal QA benchmarks.
Authors:Alsu Sagirova, Yuri Kuratov, Mikhail Burtsev
Abstract:
Multi-agent reinforcement learning (MARL) demonstrates significant progress in solving cooperative and competitive multi-agent problems in various environments. One of the principal challenges in MARL is the need for explicit prediction of the agents' behavior to achieve cooperation. To resolve this issue, we propose the Shared Recurrent Memory Transformer (SRMT) which extends memory transformers to multi-agent settings by pooling and globally broadcasting individual working memories, enabling agents to exchange information implicitly and coordinate their actions. We evaluate SRMT on the Partially Observable Multi-Agent Pathfinding problem in a toy Bottleneck navigation task that requires agents to pass through a narrow corridor and on a POGEMA benchmark set of tasks. In the Bottleneck task, SRMT consistently outperforms a variety of reinforcement learning baselines, especially under sparse rewards, and generalizes effectively to longer corridors than those seen during training. On POGEMA maps, including Mazes, Random, and MovingAI, SRMT is competitive with recent MARL, hybrid, and planning-based algorithms. These results suggest that incorporating shared recurrent memory into the transformer-based architectures can enhance coordination in decentralized multi-agent systems. The source code for training and evaluation is available on GitHub: https://github.com/Aloriosa/srmt.
Chinese: 共享循环记忆变换器(SRMT)通过汇集并全局广播各智能体的工作记忆,在多智能体路径规划任务中表现出色,尤其在稀疏奖励环境下优于基线方法,并在POGEMA基准测试中与先进算法性能相当。
English: The Shared Recurrent Memory Transformer (SRMT) enhances multi-agent coordination by pooling and broadcasting agents' working memories, achieving superior performance in navigation tasks and competitive results on POGEMA benchmarks compared to existing methods.
Authors:Yichen Wu, Hongming Piao, Long-Kai Huang, Renzhen Wang, Wanhua Li, Hanspeter Pfister, Deyu Meng, Kede Ma, Ying Wei
Abstract:
Continual Learning (CL) with foundation models has recently emerged as a promising paradigm to exploit abundant knowledge acquired during pre-training for tackling sequential tasks. However, existing prompt-based and Low-Rank Adaptation-based (LoRA-based) methods often require expanding a prompt/LoRA pool or retaining samples of previous tasks, which poses significant scalability challenges as the number of tasks grows. To address these limitations, we propose Scalable Decoupled LoRA (SD-LoRA) for class incremental learning, which continually separates the learning of the magnitude and direction of LoRA components without rehearsal. Our empirical and theoretical analysis reveals that SD-LoRA tends to follow a low-loss trajectory and converges to an overlapping low-loss region for all learned tasks, resulting in an excellent stability-plasticity trade-off. Building upon these insights, we introduce two variants of SD-LoRA with further improved parameter efficiency. All parameters of SD-LoRAs can be end-to-end optimized for CL objectives. Meanwhile, they support efficient inference by allowing direct evaluation with the finally trained model, obviating the need for component selection. Extensive experiments across multiple CL benchmarks and foundation models consistently validate the effectiveness of SD-LoRA. The code is available at https://github.com/WuYichen-97/SD-Lora-CL.
中文:SD-LoRA通过解耦LoRA组件的幅度和方向学习,解决了持续学习中的扩展性难题,无需样本回放即可实现高效的类增量学习,并在多个基准测试中展现出优越的性能。
English: SD-LoRA addresses scalability challenges in continual learning by decoupling the optimization of LoRA components' magnitude and direction, enabling efficient, rehearsal-free class incremental learning with strong empirical performance and theoretical backing.
Authors:Qiongyan Wang, Yutong Xia, Siru ZHong, Weichuang Li, Yuankai Wu, Shifen Cheng, Junbo Zhang, Yu Zheng, Yuxuan Liang
Abstract:
Monitoring real-time air quality is essential for safeguarding public health and fostering social progress. However, the widespread deployment of air quality monitoring stations is constrained by their significant costs. To address this limitation, we introduce \emph{AirRadar}, a deep neural network designed to accurately infer real-time air quality in locations lacking monitoring stations by utilizing data from existing ones. By leveraging learnable mask tokens, AirRadar reconstructs air quality features in unmonitored regions. Specifically, it operates in two stages: first capturing spatial correlations and then adjusting for distribution shifts. We validate AirRadar's efficacy using a year-long dataset from 1,085 monitoring stations across China, demonstrating its superiority over multiple baselines, even with varying degrees of unobserved data. The source code can be accessed at https://github.com/CityMind-Lab/AirRadar.
中文: AirRadar是一种深度神经网络,通过利用现有监测站数据,采用捕捉空间相关性和调整分布偏移的两阶段方法,精准推断未监测区域的实时空气质量,在中国大规模数据集上验证了其优越性能。
English: AirRadar is a deep neural network that accurately infers real-time air quality in unmonitored areas by leveraging data from existing stations through a two-stage process of capturing spatial correlations and adjusting for distribution shifts, validated as superior on a large Chinese dataset.
Authors:Jiachen Lei, Julius Berner, Jiongxiao Wang, Zhongzhu Chen, Zhongjia Ba, Kui Ren, Jun Zhu, Anima Anandkumar
Abstract:
Robustness is essential for deep neural networks, especially in security-sensitive applications. To this end, randomized smoothing provides theoretical guarantees for certifying robustness against adversarial perturbations. Recently, diffusion models have been successfully employed for randomized smoothing to purify noise-perturbed samples before making predictions with a standard classifier. While these methods excel at small perturbation radii, they struggle with larger perturbations and incur a significant computational overhead during inference compared to classical methods. To address this, we reformulate the generative modeling task along the diffusion trajectories in pixel space as a discriminative task in the latent space. Specifically, we use instance discrimination to achieve consistent representations along the trajectories by aligning temporally adjacent points. After fine-tuning based on the learned representations, our model enables implicit denoising-then-classification via a single prediction, substantially reducing inference costs. We conduct extensive experiments on various datasets and achieve state-of-the-art performance with minimal computation budget during inference. For example, our method outperforms the certified accuracy of diffusion-based methods on ImageNet across all perturbation radii by 5.3% on average, with up to 11.6% at larger radii, while reducing inference costs by 85$\times$ on average. Codes are available at: https://github.com/jiachenlei/rRCM.
中文: 本研究提出了一种新方法,将生成式扩散模型重构为潜在空间中的判别任务,显著提升了对抗扰动的鲁棒性,同时将推理成本降低85倍,并在多个数据集上实现了最先进的认证精度。
English: This study introduces a novel method that reformulates generative diffusion modeling as a discriminative task in latent space, significantly enhancing robustness against adversarial perturbations while reducing inference costs by 85 times and achieving state-of-the-art certified accuracy across various datasets.
Authors:Yifan Hu, Guibin Zhang, Peiyuan Liu, Disen Lan, Naiqi Li, Dawei Cheng, Tao Dai, Shu-Tao Xia, Shirui Pan
Abstract:
Time series forecasting methods generally fall into two main categories: Channel Independent (CI) and Channel Dependent (CD) strategies. While CI overlooks important covariate relationships, CD captures all dependencies without distinction, introducing noise and reducing generalization. Recent advances in Channel Clustering (CC) aim to refine dependency modeling by grouping channels with similar characteristics and applying tailored modeling techniques. However, coarse-grained clustering struggles to capture complex, time-varying interactions effectively. To address these challenges, we propose TimeFilter, a GNN-based framework for adaptive and fine-grained dependency modeling. After constructing the graph from the input sequence, TimeFilter refines the learned spatial-temporal dependencies by filtering out irrelevant correlations while preserving the most critical ones in a patch-specific manner. Extensive experiments on 13 real-world datasets from diverse application domains demonstrate the state-of-the-art performance of TimeFilter. The code is available at https://github.com/TROUBADOUR000/TimeFilter.
中文: TimeFilter是一种基于图神经网络的框架,通过自适应地过滤无关的通道间依赖关系并保留关键关联,提升了时间序列预测性能,在多个现实数据集中实现了领先水平。
English: TimeFilter is a GNN-based framework that enhances time series forecasting by adaptively filtering out irrelevant inter-channel dependencies while preserving critical ones, achieving state-of-the-art results across diverse datasets.
Authors:Viktor Moskvoretskii, Maria Lysyuk, Mikhail Salnikov, Nikolay Ivanov, Sergey Pletenev, Daria Galimzianova, Nikita Krayko, Vasily Konovalov, Irina Nikishina, Alexander Panchenko
Abstract:
Retrieval Augmented Generation (RAG) improves correctness of Question Answering (QA) and addresses hallucinations in Large Language Models (LLMs), yet greatly increase computational costs. Besides, RAG is not always needed as may introduce irrelevant information. Recent adaptive retrieval methods integrate LLMs' intrinsic knowledge with external information appealing to LLM self-knowledge, but they often neglect efficiency evaluations and comparisons with uncertainty estimation techniques. We bridge this gap by conducting a comprehensive analysis of 35 adaptive retrieval methods, including 8 recent approaches and 27 uncertainty estimation techniques, across 6 datasets using 10 metrics for QA performance, self-knowledge, and efficiency. Our findings show that uncertainty estimation techniques often outperform complex pipelines in terms of efficiency and self-knowledge, while maintaining comparable QA performance.
中文摘要:对RAG系统中自适应检索方法的全面评估表明,不确定性估计技术在效率和自我认知方面常优于复杂流程,同时保持相当的问答性能。
English Summary: Adaptive retrieval methods in RAG systems are comprehensively evaluated, revealing that uncertainty estimation techniques often surpass complex pipelines in efficiency and self-knowledge while maintaining similar QA performance.
Authors:Jesus Renero, Idoia Ochoa, Roberto Maestre
Abstract:
Explainability techniques hold significant potential for enhancing the causal discovery process, which is crucial for understanding complex systems in areas like healthcare, economics, and artificial intelligence. However, no causal discovery methods currently incorporate explainability into their models to derive causal graphs. Thus, in this paper we explore this innovative approach, as it offers substantial potential and represents a promising new direction worth investigating. Specifically, we introduce REX, a causal discovery method that leverages machine learning (ML) models coupled with explainability techniques, specifically Shapley values, to identify and interpret significant causal relationships among variables.
Comparative evaluations on synthetic datasets comprising continuous tabular data reveal that REX outperforms state-of-the-art causal discovery methods across diverse data generation processes, including non-linear and additive noise models. Moreover, REX was tested on the Sachs single-cell protein-signaling dataset, achieving a precision of 0.952 and recovering key causal relationships with no incorrect edges. Taking together, these results showcase REX's effectiveness in accurately recovering true causal structures while minimizing false positive predictions, its robustness across diverse datasets, and its applicability to real-world problems. By combining ML and explainability techniques with causal discovery, REX bridges the gap between predictive modeling and causal inference, offering an effective tool for understanding complex causal structures. REX is publicly available at https://github.com/renero/causalgraph.
Chinese: 本文提出REX这一新型因果发现方法,通过将机器学习与可解释性技术(如Shapley值)相结合来准确识别因果关系,在合成和真实数据集上均展现出优于现有方法的性能表现。
English: This paper introduces REX, a novel causal discovery method that integrates machine learning with explainability techniques like Shapley values to accurately identify causal relationships, demonstrating superior performance over existing methods in both synthetic and real-world datasets.
Authors:Haocheng Luo, Tuan Truong, Tung Pham, Mehrtash Harandi, Dinh Phung, Trung Le
Abstract:
Sharpness-Aware Minimization (SAM) has attracted significant attention for its effectiveness in improving generalization across various tasks. However, its underlying principles remain poorly understood. In this work, we analyze SAM's training dynamics using the maximum eigenvalue of the Hessian as a measure of sharpness, and propose a third-order stochastic differential equation (SDE), which reveals that the dynamics are driven by a complex mixture of second- and third-order terms. We show that alignment between the perturbation vector and the top eigenvector is crucial for SAM's effectiveness in regularizing sharpness, but find that this alignment is often inadequate in practice, limiting SAM's efficiency. Building on these insights, we introduce Eigen-SAM, an algorithm that explicitly aims to regularize the top Hessian eigenvalue by aligning the perturbation vector with the leading eigenvector. We validate the effectiveness of our theory and the practical advantages of our proposed approach through comprehensive experiments. Code is available at https://github.com/RitianLuo/EigenSAM.
中文: 本研究分析了锐度感知最小化(SAM)方法,发现其有效性依赖于扰动向量与海森矩阵主特征向量的对齐,据此提出了Eigen-SAM算法,通过显式优化这种对齐来提升正则化效果和计算效率。
English: This study analyzes Sharpness-Aware Minimization (SAM) and reveals that its effectiveness depends on aligning the perturbation vector with the Hessian's top eigenvector, leading to the development of Eigen-SAM, which explicitly optimizes this alignment to improve regularization and efficiency.
Authors:Qiong Wu, Maoxin Ji, Pingyi Fan, Kezhi Wang, Nan Cheng, Wen Chen, Khaled B. Letaief
Abstract:
On-ramp merging presents a critical challenge in autonomous driving, as vehicles from merging lanes need to dynamically adjust their positions and speeds while monitoring traffic on the main road to prevent collisions. To address this challenge, we propose a novel merging control scheme based on reinforcement learning, which integrates lateral control mechanisms. This approach ensures the smooth integration of vehicles from the merging lane onto the main road, optimizing both fuel efficiency and passenger comfort. Furthermore, we recognize the impact of vehicle-to-vehicle (V2V) communication on control strategies and introduce an enhanced protocol leveraging Cellular Vehicle-to-Everything (C-V2X) Mode 4. This protocol aims to reduce the Age of Information (AoI) and improve communication reliability. In our simulations, we employ two AoI-based metrics to rigorously assess the protocol's effectiveness in autonomous driving scenarios. By combining the NS3 network simulator with Python, we simulate V2V communication and vehicle control simultaneously. The results demonstrate that the enhanced C-V2X Mode 4 outperforms the standard version, while the proposed control scheme ensures safe and reliable vehicle operation during on-ramp merging.
中文: 基于强化学习的匝道汇合控制方案与增强型C-V2X Mode 4协议,通过优化车辆协调和通信可靠性,显著提升了自动驾驶匝道汇入的安全性与效率。
English: The proposed reinforcement learning-based merging control scheme with enhanced C-V2X Mode 4 protocol significantly improves safety and efficiency in autonomous on-ramp merging by optimizing vehicle coordination and communication reliability.
Authors:Mingqi Yuan, Bo Li, Xin Jin, Wenjun Zeng
Abstract:
We introduce ADEPT: Adaptive Data ExPloiTation, a simple yet powerful framework to enhance the **data efficiency** and **generalization** in deep reinforcement learning (RL). Specifically, ADEPT adaptively manages the use of sampled data across different learning stages via multi-armed bandit (MAB) algorithms, optimizing data utilization while mitigating overfitting. Moreover, ADEPT can significantly reduce the computational overhead and accelerate a wide range of RL algorithms. We test ADEPT on benchmarks including Procgen, MiniGrid, and PyBullet. Extensive simulation demonstrates that ADEPT can achieve superior performance with remarkable computational efficiency, offering a practical solution to data-efficient RL. Our code is available at https://github.com/yuanmingqi/ADEPT.
Chinese: ADEPT是一种创新的强化学习框架,通过多臂老虎机算法自适应管理数据使用,显著提升数据效率和泛化能力,在降低计算开销的同时,于多个基准测试中实现了卓越性能。
English: ADEPT is a novel framework that improves data efficiency and generalization in deep reinforcement learning by adaptively managing data usage with multi-armed bandit algorithms, reducing computational costs while enhancing performance across various benchmarks.
Authors:Wei Tang, Yin-Fang Yang, Zhaofei Wang, Weijia Zhang, Min-Ling Zhang
Abstract:
Multi-instance partial-label learning (MIPL) is an emerging learning framework where each training sample is represented as a multi-instance bag associated with a candidate label set. Existing MIPL algorithms often overlook the margins for attention scores and predicted probabilities, leading to suboptimal generalization performance. A critical issue with these algorithms is that the highest prediction probability of the classifier may appear on a non-candidate label. In this paper, we propose an algorithm named MIPLMA, i.e., Multi-Instance Partial-Label learning with Margin Adjustment, which adjusts the margins for attention scores and predicted probabilities. We introduce a margin-aware attention mechanism to dynamically adjust the margins for attention scores and propose a margin distribution loss to constrain the margins between the predicted probabilities on candidate and non-candidate label sets. Experimental results demonstrate the superior performance of MIPLMA over existing MIPL algorithms, as well as other well-established multi-instance learning algorithms and partial-label learning algorithms.
中文: 提出的MIPLMA算法通过动态调整注意力分数和预测概率的边界,在多示例部分标记学习中取得优越性能,实验证明其超越现有方法。
English: The proposed MIPLMA algorithm improves multi-instance partial-label learning by dynamically adjusting margins for attention scores and predicted probabilities, outperforming existing methods in experimental evaluations.
Authors:Yongduo Sui, Jie Sun, Shuyao Wang, Zemin Liu, Qing Cui, Longfei Li, Xiang Wang
Abstract:
Invariant learning demonstrates substantial potential for enhancing the generalization of graph neural networks (GNNs) with out-of-distribution (OOD) data. It aims to recognize stable features in graph data for classification, based on the premise that these features causally determine the target label, and their influence is invariant to changes in distribution. Along this line, most studies have attempted to pinpoint these stable features by emphasizing explicit substructures in the graph, such as masked or attentive subgraphs, and primarily enforcing the invariance principle in the semantic space, i.e., graph representations. However, we argue that focusing only on the semantic space may not accurately identify these stable features. To address this, we introduce the Unified Invariant Learning (UIL) framework for graph classification. It provides a unified perspective on invariant graph learning, emphasizing both structural and semantic invariance principles to identify more robust stable features. In the graph space, UIL adheres to the structural invariance principle by reducing the distance between graphons over a set of stable features across different environments. Simultaneously, to confirm semantic invariance, UIL underscores that the acquired graph representations should demonstrate exemplary performance across diverse environments. We present both theoretical and empirical evidence to confirm our method's ability to recognize superior stable features. Moreover, through a series of comprehensive experiments complemented by in-depth analyses, we demonstrate that UIL considerably enhances OOD generalization, surpassing the performance of leading baseline methods. Our codes are available at https://github.com/yongduosui/UIL.
Chinese: 统一不变学习(UIL)框架通过同时强化结构和语义不变性原则,提升了图神经网络在分布外数据上的泛化能力,在多项实验中超越了现有基线方法的性能表现。
English: The Unified Invariant Learning (UIL) framework enhances graph neural network generalization by jointly enforcing structural and semantic invariance principles, outperforming existing methods in out-of-distribution scenarios through robust feature identification.
Authors:Kevin Ta, Patrick Foley, Mattson Thieme, Abhishek Pandey, Prashant Shah
Abstract:
Generating unique molecules with biochemically desired properties to serve as viable drug candidates is a difficult task that requires specialized domain expertise. In recent years, diffusion models have shown promising results in accelerating the drug design process through AI-driven molecular generation. However, training these models requires massive amounts of data, which are often isolated in proprietary silos. OpenFL is a federated learning framework that enables privacy-preserving collaborative training across these decentralized data sites. In this work, we present a federated discrete denoising diffusion model that was trained using OpenFL. The federated model achieves comparable performance with a model trained on centralized data when evaluating the uniqueness and validity of the generated molecules. This demonstrates the utility of federated learning in the drug design process.
OpenFL is available at: https://github.com/securefederatedai/openfl
中文:利用OpenFL的联邦学习实现了隐私保护的扩散模型协同训练,在生成独特且有效的药物候选分子方面达到了与集中式训练相当的性能。
English: Federated learning with OpenFL enables privacy-preserving collaborative training of diffusion models for molecular generation, achieving comparable performance to centralized training in producing unique and valid drug candidates.
Authors:Shanmin Wang, Chengguang Liu, Qingshan Liu
Abstract:
Multimodal sentiment analysis (MSA) identifies individuals' sentiment states in videos by integrating visual, audio, and text modalities. Despite progress in existing methods, the inherent modality heterogeneity limits the effective capture of interactive sentiment features across modalities. In this paper, by introducing a Multi-Modality Collaborative Learning (MMCL) framework, we facilitate cross-modal interactions and capture enhanced and complementary features from modality-common and modality-specific representations, respectively. Specifically, we design a parameter-free decoupling module and separate uni-modality into modality-common and modality-specific components through semantics assessment of cross-modal elements. For modality-specific representations, inspired by the act-reward mechanism in reinforcement learning, we design policy models to adaptively mine complementary sentiment features under the guidance of a joint reward. For modality-common representations, intra-modal attention is employed to highlight crucial components, playing enhanced roles among modalities. Experimental results, including superiority evaluations on four databases, effectiveness verification of each module, and assessment of complementary features, demonstrate that MMCL successfully learns collaborative features across modalities and significantly improves performance. The code can be available at https://github.com/smwanghhh/MMCL.
中文: 本文提出了一种多模态协同学习框架,通过分别从模态共有和模态特定表示中自适应地挖掘互补与增强特征,有效提升了多模态情感分析的性能,并在多个数据库上验证了其优越性。
English: This paper introduces a Multi-Modality Collaborative Learning framework that enhances multimodal sentiment analysis by adaptively capturing complementary and enhanced features from modality-specific and modality-common representations, achieving superior performance across multiple databases.
Authors:Yonghao Zhao, Changtao Li, Chi Shu, Qingbin Wu, Hong Li, Chuan Xu, Tianrui Li, Ziqiang Wang, Zhipeng Luo, Yazhou He
Abstract:
Survival prognosis is crucial for medical informatics. Practitioners often confront small-sized clinical data, especially cancer patient cases, which can be insufficient to induce useful patterns for survival predictions. This study deals with small sample survival analysis by leveraging transfer learning, a useful machine learning technique that can enhance the target analysis with related knowledge pre-learned from other data. We propose and develop various transfer learning methods designed for common survival models. For parametric models such as DeepSurv, Cox-CC (Cox-based neural networks), and DeepHit (end-to-end deep learning model), we apply standard transfer learning techniques like pretraining and fine-tuning. For non-parametric models such as Random Survival Forest, we propose a new transfer survival forest (TSF) model that transfers tree structures from source tasks and fine-tunes them with target data. We evaluated the transfer learning methods on colorectal cancer (CRC) prognosis. The source data are 27,379 SEER CRC stage I patients, and the target data are 728 CRC stage I patients from the West China Hospital. When enhanced by transfer learning, Cox-CC's $C^{td}$ value was boosted from 0.7868 to 0.8111, DeepHit's from 0.8085 to 0.8135, DeepSurv's from 0.7722 to 0.8043, and RSF's from 0.7940 to 0.8297 (the highest performance). All models trained with data as small as 50 demonstrated even more significant improvement. Conclusions: Therefore, the current survival models used for cancer prognosis can be enhanced and improved by properly designed transfer learning techniques. The source code used in this study is available at https://github.com/YonghaoZhao722/TSF.
中文: 本研究通过将迁移学习技术应用于参数化和非参数化生存模型,显著提升了小样本临床数据的生存预测性能,并在结直肠癌预后中验证了其有效性。
English: This study enhances survival prognosis for small clinical datasets by applying transfer learning techniques to both parametric and non-parametric survival models, demonstrating significant performance improvements in colorectal cancer prediction.
Authors:Ziming Liu, Yizhou Liu, Eric J. Michaud, Jeff Gore, Max Tegmark
Abstract:
We aim to understand physics of skill learning, i.e., how skills are learned in neural networks during training. We start by observing the Domino effect, i.e., skills are learned sequentially, and notably, some skills kick off learning right after others complete learning, similar to the sequential fall of domino cards. To understand the Domino effect and relevant behaviors of skill learning, we take physicists' approach of abstraction and simplification. We propose three models with varying complexities -- the Geometry model, the Resource model, and the Domino model, trading between reality and simplicity. The Domino effect can be reproduced in the Geometry model, whose resource interpretation inspires the Resource model, which can be further simplified to the Domino model. These models present different levels of abstraction and simplification; each is useful to study some aspects of skill learning. The Geometry model provides interesting insights into neural scaling laws and optimizers; the Resource model sheds light on the learning dynamics of compositional tasks; the Domino model reveals the benefits of modularity. These models are not only conceptually interesting -- e.g., we show how Chinchilla scaling laws can emerge from the Geometry model, but also are useful in practice by inspiring algorithmic development -- e.g., we show how simple algorithmic changes, motivated by these toy models, can speed up the training of deep learning models.
中文: 本研究通过几何模型、资源模型和骨牌模型三个抽象模型,探讨了神经网络中技能按顺序学习的“骨牌效应”,在简洁性与现实性间取得平衡,揭示了神经缩放规律、学习动态和模块化优势,并启发了加速训练的算法改进。
English: This study explores the sequential learning of skills in neural networks, termed the Domino effect, through three abstract models—Geometry, Resource, and Domino—that balance simplicity and reality to provide insights into neural scaling laws, learning dynamics, and modularity benefits, while also inspiring practical algorithmic improvements for faster training.
Authors:Kan Jen Cheng, Tingle Li, Gopala Anumanchipalli
Abstract:
Audio texture manipulation involves modifying the perceptual characteristics of a sound to achieve specific transformations, such as adding, removing, or replacing auditory elements. In this paper, we propose an exemplar-based analogy model for audio texture manipulation. Instead of conditioning on text-based instructions, our method uses paired speech examples, where one clip represents the original sound and another illustrates the desired transformation. The model learns to apply the same transformation to new input, allowing for the manipulation of sound textures. We construct a quadruplet dataset representing various editing tasks, and train a latent diffusion model in a self-supervised manner. We show through quantitative evaluations and perceptual studies that our model outperforms text-conditioned baselines and generalizes to real-world, out-of-distribution, and non-speech scenarios. Project page: https://berkeley-speech-group.github.io/audio-texture-analogy/
Authors:Hongjun Wang, Wonmin Byeon, Jiarui Xu, Jinwei Gu, Ka Chun Cheung, Xiaolong Wang, Kai Han, Jan Kautz, Sifei Liu
Abstract:
We present the Generalized Spatial Propagation Network (GSPN), a new attention mechanism optimized for vision tasks that inherently captures 2D spatial structures. Existing attention models, including transformers, linear attention, and state-space models like Mamba, process multi-dimensional data as 1D sequences, compromising spatial coherence and efficiency. GSPN overcomes these limitations by directly operating on spatially coherent image data and forming dense pairwise connections through a line-scan approach. Central to GSPN is the Stability-Context Condition, which ensures stable, context-aware propagation across 2D sequences and reduces the effective sequence length to $\sqrt{N}$ for a square map with N elements, significantly enhancing computational efficiency. With learnable, input-dependent weights and no reliance on positional embeddings, GSPN achieves superior spatial fidelity and state-of-the-art performance in vision tasks, including ImageNet classification, class-guided image generation, and text-to-image generation. Notably, GSPN accelerates SD-XL with softmax-attention by over $84\times$ when generating 16K images.
Authors:Geonwoo Seo
Abstract:
Wakeword detection plays a critical role in enabling AI assistants to listen to user voices and interact effectively. However, for languages other than English, there is a significant lack of pre-trained wakeword models. Additionally, systems that merely determine the presence of a wakeword can pose serious privacy concerns. In this paper, we propose an end-to-end approach that trains wakewords for Non-English languages, particulary Korean, and uses this to develop a Voice Authentication model to protect user privacy. Our implementation employs an open-source platform OpenWakeWord, which performs wakeword detection using an FCN (Fully-Connected Network) architecture. Once a wakeword is detected, our custom-developed code calculates cosine similarity for robust user authentication. Experimental results demonstrate the effectiveness of our approach, achieving a 16.79% and a 6.6% Equal Error Rate (EER) each in the Wakeword Detection and the Voice Authentication. These findings highlight the model's potential in providing secure and accurate wakeword detection and authentication for Korean users.
中文: 本文提出了一种端到端的方法,针对韩语等非英语语言训练唤醒词检测,并结合语音认证以解决隐私问题,在唤醒词检测和用户验证方面均取得了良好效果。
English: This paper introduces an end-to-end approach for training non-English wakeword detection, specifically for Korean, and integrates it with voice authentication to address privacy concerns, achieving effective performance in both wakeword detection and user verification.
Authors:Sean Man, Guy Ohayon, Ron Raphaeli, Michael Elad
Abstract:
Real-world image restoration deals with the recovery of images suffering from an unknown degradation. This task is typically addressed while being given only degraded images, without their corresponding ground-truth versions. In this hard setting, designing and evaluating restoration algorithms becomes highly challenging. This paper offers a suite of tools that can serve both the design and assessment of real-world image restoration algorithms. Our work starts by proposing a trained model that predicts the chain of degradations a given real-world measured input has gone through. We show how this estimator can be used to approximate the consistency -- the match between the measurements and any proposed recovered image. We also use this estimator as a guiding force for the design of a simple and highly-effective plug-and-play real-world image restoration algorithm, leveraging a pre-trained diffusion-based image prior. Furthermore, this work proposes no-reference proxy measures of MSE and LPIPS, which, without access to the ground-truth images, allow ranking of real-world image restoration algorithms according to their (approximate) MSE and LPIPS. The proposed suite provides a versatile, first of its kind framework for evaluating and comparing blind image restoration algorithms in real-world scenarios.
Authors:Hamid Nasiri, Peter Garraghan
Abstract:
Parameter-efficient fine-tuning methods, such as LoRA, reduces the number of trainable parameters. However, they often suffer from scalability issues and differences between their learning pattern and full fine-tuning. To overcome these limitations, we propose Efficient Weight-Decomposed Low-Rank Adaptation (EDoRA): a novel PEFT method that decomposes pre-trained weights into magnitude and directional components. By freezing low-rank matrices, initializing them by singular value decomposition, and introducing a small trainable matrix between them, EDoRA achieves substantial reduction in trainable parameters while maintaining learning capacity. Experimental results on the GLUE benchmark demonstrate that EDoRA achieves competitive or superior performance compared to state-of-the-art methods, such as LoRA and DoRA, with up to 30x fewer trainable parameters. This makes EDoRA a highly efficient solution for adapting LLMs to diverse tasks under memory-constrained settings. Code is available at https://github.com/Hamid-Nasiri/EDoRA .
Chinese: 提出的EDoRA方法通过分解权重并采用低秩适应,有效减少可训练参数,在性能上与LoRA等现有方法相当甚至更优,同时参数数量减少高达30倍。
English: The proposed EDoRA method efficiently reduces trainable parameters by decomposing weights and using low-rank adaptations, achieving competitive performance with up to 30x fewer parameters than existing methods like LoRA.
Authors:Le Thien Phuc Nguyen, Zhuoran Yu, Yong Jae Lee
Abstract:
Active Speaker Detection (ASD) aims to identify speaking individuals in complex visual scenes. While humans can easily detect speech by matching lip movements to audio, current ASD models struggle to establish this correspondence, often misclassifying non-speaking instances when audio and lip movements are unsynchronized. To address this limitation, we propose Lip landmark Assisted Speaker dEtection for Robustness (LASER). Unlike models that rely solely on facial frames, LASER explicitly focuses on lip movements by integrating lip landmarks in training. Specifically, given a face track, LASER extracts frame-level visual features and the 2D coordinates of lip landmarks using a lightweight detector. These coordinates are encoded into dense feature maps, providing spatial and structural information on lip positions. Recognizing that landmark detectors may sometimes fail under challenging conditions (e.g., low resolution, occlusions, extreme angles), we incorporate an auxiliary consistency loss to align predictions from both lip-aware and face-only features, ensuring reliable performance even when lip data is absent. Extensive experiments across multiple datasets show that LASER outperforms state-of-the-art models, especially in scenarios with desynchronized audio and visuals, demonstrating robust performance in real-world video contexts. Code is available at \url{https://github.com/plnguyen2908/LASER_ASD}.
中文摘要:提出的LASER模型通过整合唇部关键点和辅助一致性损失,显著提升了主动说话人检测的鲁棒性,在多种数据集中尤其在视听不同步场景下表现卓越。
English Summary: The proposed LASER model enhances Active Speaker Detection by integrating lip landmarks and an auxiliary consistency loss, achieving superior robustness and performance in handling audio-visual desynchronization across diverse datasets.
Authors:Riqiang Gao, Mamadou Diallo, Han Liu, Anthony Magliari, Jonathan Sackett, Wilko Verbakel, Sandra Meyers, Rafe Mcbeth, Masoud Zarepisheh, Simon Arberet, Martin Kraus, Florin C. Ghesu, Ali Kamen
Abstract:
Radiotherapy (RT) planning is complex, subjective, and time-intensive. Advances with artificial intelligence (AI) promise to improve its precision and efficiency, but progress is often limited by the scarcity of large, standardized datasets. To address this, we introduce the Automated Iterative RT Planning (AIRTP) system, a scalable solution for generating high-quality treatment plans. This scalable solution is designed to generate substantial volumes of consistently high-quality treatment plans, overcoming a key obstacle in the advancement of AI-driven RT planning. Our AIRTP pipeline adheres to clinical guidelines and automates essential steps, including organ-at-risk (OAR) contouring, helper structure creation, beam setup, optimization, and plan quality improvement, using AI integrated with RT planning software like Varian Eclipse. Furthermore, a novel approach for determining optimization parameters to reproduce 3D dose distributions, i.e. a method to convert dose predictions to deliverable treatment plans constrained by machine limitations is proposed. A comparative analysis of plan quality reveals that our automated pipeline produces treatment plans of quality comparable to those generated manually, which traditionally require several hours of labor per plan. Committed to public research, the first data release of our AIRTP pipeline includes nine cohorts covering head-and-neck and lung cancer sites to support an AAPM 2025 challenge. To our best knowledge, this dataset features more than 10 times number of plans compared to the largest existing well-curated public dataset. Repo: https://github.com/RiqiangGao/GDP-HMM_AAPMChallenge.
中文: AIRTP系统利用人工智能自动化放疗规划,通过可扩展的流程生成高质量且临床可比的治疗方案,同时发布大规模公共数据集以解决数据稀缺问题,推动研究发展。
English: The AIRTP system automates radiotherapy planning using AI to generate high-quality, clinically comparable treatment plans efficiently, addressing data scarcity with a scalable pipeline and releasing a large public dataset for research.
Authors:Pouya Hamadanian, Sadjad Fouladi
Abstract:
We introduce Glinthawk, an architecture for offline Large Language Model (LLM) inference. By leveraging a two-tiered structure, Glinthawk optimizes the utilization of the high-end accelerators ("Tier 1") by offloading the attention mechanism to lower-end compute tier ("Tier 2"). This separation allows the memory demand of the attention, known as the key-value cache, to scale independently from the model weights, enabling larger batch sizes and more efficient accelerator usage. Prototyped with NVIDIA T4 GPUs and standard CPU VMs, Glinthawk improves throughput by $5.9\times$ and reduces cost of generation by $2.8\times$, compared to paged attention baselines. For long sequence lengths, it achieves $16.3\times$ throughput improvement at $2.4\times$ less cost. Our evaluation shows that this architecture can tolerate moderate network latency with minimal performance degradation, making it highly effective for latency-tolerant, throughput-focused applications such as batch processing. The prototype is publicly available at https://github.com/microsoft/glinthawk.
中文:Glinthawk是一种离线大语言模型推理架构,通过将注意力机制与模型权重分离,实现了可扩展的键值缓存管理和高效加速器利用,从而显著提升吞吐量并降低成本。
English: Glinthawk is an offline LLM inference architecture that enhances throughput and reduces costs by separating attention mechanisms from model weights, allowing scalable key-value cache management and efficient accelerator use.
Authors:Ron Raphaeli, Sean Man, Michael Elad
Abstract:
Consistent improvement of image priors over the years has led to the development of better inverse problem solvers. Diffusion models are the newcomers to this arena, posing the strongest known prior to date. Recently, such models operating in a latent space have become increasingly predominant due to their efficiency. In recent works, these models have been applied to solve inverse problems. Working in the latent space typically requires multiple applications of an Autoencoder during the restoration process, which leads to both computational and restoration quality challenges. In this work, we propose a new approach for handling inverse problems with latent diffusion models, where a learned degradation function operates within the latent space, emulating a known image space degradation. Usage of the learned operator reduces the dependency on the Autoencoder to only the initial and final steps of the restoration process, facilitating faster sampling and superior restoration quality. We demonstrate the effectiveness of our method on a variety of image restoration tasks and datasets, achieving significant improvements over prior art.
Authors:Saeid Asgari Taghanaki, Joao Monteiro
Abstract:
Large language models (LLMs) have demonstrated remarkable proficiency in generating detailed and coherent explanations of complex concepts. However, the extent to which these models truly comprehend the concepts they articulate remains unclear. To assess the level of comprehension of a model relative to the content it generates, we implemented a self-evaluation pipeline where models: (i) given a topic generate an excerpt with information about the topic, (ii) given an excerpt generate question-answer pairs, and finally (iii) given a question generate an answer. We refer to this self-evaluation approach as Explain-Query-Test (EQT). Interestingly, the accuracy on generated questions resulting from running the EQT pipeline correlates strongly with the model performance as verified by typical benchmarks such as MMLU-Pro. In other words, EQT's performance is predictive of MMLU-Pro's, and EQT can be used to rank models without the need for any external source of evaluation data other than lists of topics of interest. Moreover, our results reveal a disparity between the models' ability to produce detailed explanations and their performance on questions related to those explanations. This gap highlights fundamental limitations in the internal knowledge representation and reasoning abilities of current LLMs. We release the code at https://github.com/asgsaeid/EQT.
中文: 大型语言模型能生成详细解释但缺乏真正理解,通过解释-查询-测试方法揭示其解释能力与推理能力之间存在差距。
English: Large language models can generate detailed explanations but lack true comprehension, as shown by the Explain-Query-Test method, which reveals a gap between their explanatory and reasoning abilities.
Authors:Akash Kundu
Abstract:
The Sachdev-Ye-Kitaev (SYK) model, known for its strong quantum correlations and chaotic behavior, serves as a key platform for quantum gravity studies. However, variationally preparing thermal states on near-term quantum processors for large systems ($N>12$, where $N$ is the number of Majorana fermions) presents a significant challenge due to the rapid growth in the complexity of parameterized quantum circuits. This paper addresses this challenge by integrating reinforcement learning (RL) with convolutional neural networks, employing an iterative approach to optimize the quantum circuit and its parameters. The refinement process is guided by a composite reward signal derived from entropy and the expectation values of the SYK Hamiltonian. This approach reduces the number of CNOT gates by two orders of magnitude for systems $N\geq12$ compared to traditional methods like first-order Trotterization. We demonstrate the effectiveness of the RL framework in both noiseless and noisy quantum hardware environments, maintaining high accuracy in thermal state preparation. This work advances a scalable, RL-based framework with applications for quantum gravity studies and out-of-time-ordered thermal correlators computation in quantum many-body systems on near-term quantum hardware. The code is available at https://github.com/Aqasch/solving_SYK_model_with_RL.
中文: 本文提出了一种结合强化学习与卷积神经网络的方法,用于在近期量子硬件上高效制备SYK模型的热态,大幅降低了电路复杂度并保持了高精度。
English: This paper introduces a reinforcement learning framework combined with convolutional neural networks to efficiently prepare thermal states of the Sachdev-Ye-Kitaev model on near-term quantum hardware, significantly reducing circuit complexity while maintaining high accuracy.
Authors:Jing Liu, Zhenchao Ma, Zepu Wang, Chenxuanyin Zou, Jiayang Ren, Zehua Wang, Liang Song, Bo Hu, Yang Liu, Victor C. M. Leung
Abstract:
Diffusion models (DMs) have emerged as a powerful class of generative AI models, showing remarkable potential in anomaly detection (AD) tasks across various domains, such as cybersecurity, fraud detection, healthcare, and manufacturing. The intersection of these two fields, termed diffusion models for anomaly detection (DMAD), offers promising solutions for identifying deviations in increasingly complex and high-dimensional data. In this survey, we review recent advances in DMAD research. We begin by presenting the fundamental concepts of AD and DMs, followed by a comprehensive analysis of classic DM architectures including DDPMs, DDIMs, and Score SDEs. We further categorize existing DMAD methods into reconstruction-based, density-based, and hybrid approaches, providing detailed examinations of their methodological innovations. We also explore the diverse tasks across different data modalities, encompassing image, time series, video, and multimodal data analysis. Furthermore, we discuss critical challenges and emerging research directions, including computational efficiency, model interpretability, robustness enhancement, edge-cloud collaboration, and integration with large language models. The collection of DMAD research papers and resources is available at https://github.com/fdjingliu/DMAD.
中文摘要:扩散模型已成为跨领域异常检测的强大工具,本综述对其方法进行分类,并探讨了计算效率与模型可解释性等关键挑战。
English Summary: Diffusion models have become a powerful tool for anomaly detection across various domains, with this survey categorizing methods and exploring challenges like computational efficiency and model interpretability.
Authors:Chung-ju Huang, Yuanpeng He, Xiao Han, Wenpin Jiao, Zhi Jin, Leye Wang
Abstract:
Cross-hospital collaboration has the potential to address disparities in medical resources across different regions. However, strict privacy regulations prohibit the direct sharing of sensitive patient information between hospitals. Vertical federated learning (VFL) offers a novel privacy-preserving machine learning paradigm that maximizes data utility across multiple hospitals. Traditional VFL methods, however, primarily benefit patients with overlapping data, leaving vulnerable non-overlapping patients without guaranteed improvements in medical prediction services. While some knowledge transfer techniques can enhance the prediction performance for non-overlapping patients, they fall short in addressing scenarios where overlapping and non-overlapping patients belong to different domains, resulting in challenges such as feature heterogeneity and label heterogeneity. To address these issues, we propose a novel unified vertical federated knowledge transfer framework (Unitrans). Our framework consists of three key steps. First, we extract the federated representation of overlapping patients by employing an effective vertical federated representation learning method to model multi-party joint features online. Next, each hospital learns a local knowledge transfer module offline, enabling the transfer of knowledge from the federated representation of overlapping patients to the enriched representation of local non-overlapping patients in a domain-adaptive manner. Finally, hospitals utilize these enriched local representations to enhance performance across various downstream medical prediction tasks. Experiments on real-world medical datasets validate the framework's dual effectiveness in both intra-domain and cross-domain knowledge transfer. The code of \method is available at \url{https://github.com/Chung-ju/Unitrans}.
中文: 提出的Unitrans框架通过联邦表征学习和领域自适应知识迁移,解决了垂直联邦学习中重叠与非重叠患者间的特征与标签异质性问题,并在真实医疗数据集上验证了其有效性。
English: The proposed Unitrans framework addresses feature and label heterogeneity in vertical federated learning by enabling domain-adaptive knowledge transfer from overlapping to non-overlapping patients through federated representation learning, validated on real medical datasets.
Authors:Zibin Wang, Zhiyuan Ouyang, Xiangyun Zhang
Abstract:
Flow-matching models provide a powerful framework for various applications, offering efficient sampling and flexible probability path modeling. These models are characterized by flows with low curvature in learned generative trajectories, which results in reduced truncation error at each sampling step. To further reduce curvature, we propose block matching. This novel approach leverages label information to partition the data distribution into blocks and match them with a prior distribution parameterized using the same label information, thereby learning straighter flows. We demonstrate that the variance of the prior distribution can control the curvature upper bound of forward trajectories in flow-matching models. By designing flexible regularization strategies to adjust this variance, we achieve optimal generation performance, effectively balancing the trade-off between maintaining diversity in generated samples and minimizing numerical solver errors. Our results demonstrate competitive performance with models of the same parameter scale.Code is available at \url{https://github.com/wpp13749/block_flow}.
中文摘要:本研究提出块匹配方法,通过将数据按标签分区并与参数化先验对齐来降低流匹配模型中的曲率,借助方差调节策略平衡生成多样性与数值误差,实现了优异性能。
English Summary: The study introduces block matching to reduce curvature in flow-matching models by partitioning data into labeled blocks and aligning them with a parameterized prior, achieving optimal generation through variance-controlled regularization.
Authors:Ahmad Mousavi, Ramin Zandvakili
Abstract:
Kernel-free quadratic surface support vector machines have recently gained traction due to their flexibility in modeling nonlinear decision boundaries without relying on kernel functions. However, the introduction of a full quadratic classifier significantly increases the number of model parameters, scaling quadratically with data dimensionality, which often leads to overfitting and makes interpretation difficult. To address these challenges, we propose a sparse variant of the QSVM by enforcing a cardinality constraint on the model parameters. While enhancing generalization and promoting sparsity, leveraging the $\ell_0$-norm inevitably incurs additional computational complexity. To tackle this, we develop a penalty decomposition algorithm capable of producing solutions that provably satisfy the first-order Lu-Zhang optimality conditions. Our approach accommodates both hinge and quadratic loss functions. In both cases, we demonstrate that the subproblems arising within the algorithm either admit closed-form solutions or can be solved efficiently through dual formulations, which contributes to the method's overall effectiveness. We also analyze the convergence behavior of the algorithm under both loss settings. Finally, we validate our approach on several real-world datasets, demonstrating its ability to reduce overfitting while maintaining strong classification performance. The complete implementation and experimental code are publicly available at https://github.com/raminzandvakili/L0-QSVM.
中文: 作者提出一种基于ℓ₀范数约束的稀疏二次曲面支持向量机,通过惩罚分解算法有效降低过拟合并提升模型可解释性,在多个真实数据集上验证了该方法在保持分类性能的同时具有收敛性保证。
English: The authors propose a sparse quadratic surface support vector machine using an ℓ₀-norm constraint to reduce overfitting and improve interpretability, developing an efficient penalty decomposition algorithm with proven convergence that maintains strong classification performance on real datasets.
Authors:Tal Zeevi, Lawrence H. Staib, John A. Onofrey
Abstract:
Monte-Carlo (MC) Dropout provides a practical solution for estimating predictive distributions in deterministic neural networks. Traditional dropout, applied within the signal space, may fail to account for frequency-related noise common in medical imaging, leading to biased predictive estimates. A novel approach extends Dropout to the frequency domain, allowing stochastic attenuation of signal frequencies during inference. This creates diverse global textural variations in feature maps while preserving structural integrity -- a factor we hypothesize and empirically show is contributing to accurately estimating uncertainties in semantic segmentation. We evaluated traditional MC-Dropout and the MC-frequency Dropout in three segmentation tasks involving different imaging modalities: (i) prostate zones in biparametric MRI, (ii) liver tumors in contrast-enhanced CT, and (iii) lungs in chest X-ray scans. Our results show that MC-Frequency Dropout improves calibration, convergence, and semantic uncertainty, thereby improving prediction scrutiny, boundary delineation, and has the potential to enhance medical decision-making.
中文:MC频率Dropout将dropout扩展至频域,通过增强校准、收敛和边界划分,在多种医学影像分割任务中改进了不确定性估计。
English: MC-Frequency Dropout extends dropout to the frequency domain, improving uncertainty estimation in medical image segmentation by enhancing calibration, convergence, and boundary delineation across various imaging modalities.
Authors:Yassir Bendou, Amine Ouasfi, Vincent Gripon, Adnane Boukhayma
Abstract:
The growing popularity of Contrastive Language-Image Pretraining (CLIP) has led to its widespread application in various visual downstream tasks. To enhance CLIP's effectiveness and versatility, efficient few-shot adaptation techniques have been widely adopted. Among these approaches, training-free methods, particularly caching methods exemplified by Tip-Adapter, have gained attention for their lightweight adaptation without the need for additional fine-tuning. In this paper, we revisit Tip-Adapter from a kernel perspective, showing that caching methods function as local adapters and are connected to a well-established kernel literature. Drawing on this insight, we offer a theoretical understanding of how these methods operate and suggest multiple avenues for enhancing the Tip-Adapter baseline. Notably, our analysis shows the importance of incorporating global information in local adapters. Therefore, we subsequently propose a global method that learns a proximal regularizer in a reproducing kernel Hilbert space (RKHS) using CLIP as a base learner. Our method, which we call ProKeR (Proximal Kernel ridge Regression), has a closed form solution and achieves state-of-the-art performances across 11 datasets in the standard few-shot adaptation benchmark.
Authors:Elad Levi, Ilan Kadar
Abstract:
Large Language Models (LLMs) are transforming artificial intelligence, evolving into task-oriented systems capable of autonomous planning and execution. One of the primary applications of LLMs is conversational AI systems, which must navigate multi-turn dialogues, integrate domain-specific APIs, and adhere to strict policy constraints. However, evaluating these agents remains a significant challenge, as traditional methods fail to capture the complexity and variability of real-world interactions. We introduce IntellAgent, a scalable, open-source multi-agent framework designed to evaluate conversational AI systems comprehensively. IntellAgent automates the creation of diverse, synthetic benchmarks by combining policy-driven graph modeling, realistic event generation, and interactive user-agent simulations. This innovative approach provides fine-grained diagnostics, addressing the limitations of static and manually curated benchmarks with coarse-grained metrics. IntellAgent represents a paradigm shift in evaluating conversational AI. By simulating realistic, multi-policy scenarios across varying levels of complexity, IntellAgent captures the nuanced interplay of agent capabilities and policy constraints. Unlike traditional methods, it employs a graph-based policy model to represent relationships, likelihoods, and complexities of policy interactions, enabling highly detailed diagnostics. IntellAgent also identifies critical performance gaps, offering actionable insights for targeted optimization. Its modular, open-source design supports seamless integration of new domains, policies, and APIs, fostering reproducibility and community collaboration. Our findings demonstrate that IntellAgent serves as an effective framework for advancing conversational AI by addressing challenges in bridging research and deployment. The framework is available at https://github.com/plurai-ai/intellagent
中文摘要:IntellAgent是一个开源的多智能体框架,通过模拟真实的多策略场景并生成精细化诊断,自动创建合成基准来全面评估对话式人工智能系统。
English Summary: IntellAgent is an open-source multi-agent framework that automates the creation of synthetic benchmarks to comprehensively evaluate conversational AI systems by simulating realistic multi-policy scenarios and providing fine-grained diagnostics.
Authors:Qi Cheems Wang, Zehao Xiao, Yixiu Mao, Yun Qu, Jiayi Shen, Yiqin Lv, Xiangyang Ji
Abstract:
Foundation models have revolutionized general-purpose problem-solving, offering rapid task adaptation through pretraining, meta-training, and finetuning. Recent crucial advances in these paradigms reveal the importance of challenging task prioritized sampling to enhance adaptation robustness under distribution shifts. However, ranking task difficulties over iteration as a preliminary step typically requires exhaustive task evaluation, which is practically unaffordable in computation and data-annotation. This study provides a novel perspective to illuminate the possibility of leveraging the dual importance of adaptation robustness and learning efficiency, particularly in scenarios where task evaluation is risky or costly, such as iterative agent-environment interactions for robotic policy evaluation or computationally intensive inference steps for finetuning foundation models. Firstly, we introduce Model Predictive Task Sampling (MPTS), a framework that bridges the task space and adaptation risk landscape, providing a theoretical foundation for robust active task sampling. MPTS employs a generative model to characterize the episodic optimization process and predicts task-specific adaptation risk via posterior inference. The resulting risk learner amortizes the costly evaluation of task adaptation performance and provably approximates task difficulty rankings. MPTS seamlessly integrates into zero-shot, few-shot, and supervised finetuning settings. Empirically, we conduct extensive experiments in pattern recognition using foundation models and sequential decision-making. Our results demonstrate that MPTS significantly enhances adaptation robustness for tail or out-of-distribution (OOD) tasks and improves learning efficiency compared to state-of-the-art (SOTA) methods. The code is available at the project site https://github.com/thu-rllab/MPTS.
Chinese: 本研究提出了模型预测任务采样(MPTS)框架,通过预测任务难度来提升基础模型的适应鲁棒性和学习效率,无需全面评估,特别适用于处理尾部或分布外任务。
English: This study introduces Model Predictive Task Sampling (MPTS), a framework that efficiently predicts task difficulty and enhances adaptation robustness and learning efficiency for foundation models without exhaustive evaluations, particularly benefiting tail and out-of-distribution tasks.
Authors:Haichao Wei, Yunxiang Ren, Zhoutong Fu, Aman Lunia, Yi-Lin Chen, Alice Leung, Ya Xu
Abstract:
Large Language Models (LLMs) demand significant computational resources, making it essential to enhance their capabilities without retraining from scratch. A key challenge in this domain is \textit{catastrophic forgetting} (CF), which hampers performance during Continuous Pre-training (CPT) and Continuous Supervised Fine-Tuning (CSFT). We propose \textbf{Control LLM}, a novel approach that leverages parallel pre-trained and expanded transformer blocks, aligning their hidden-states through interpolation strategies This method effectively preserves performance on existing tasks while seamlessly integrating new knowledge.
Extensive experiments demonstrate the effectiveness of Control LLM in both CPT and CSFT. On Llama3.1-8B-Instruct, it achieves significant improvements in mathematical reasoning ($+14.4\%$ on Math-Hard) and coding performance ($+10\%$ on MBPP-PLUS). On Llama3.1-8B, it enhances multilingual capabilities ($+10.6\%$ on C-Eval, $+6.8\%$ on CMMLU, and $+30.2\%$ on CMMLU-0shot-CoT). It surpasses existing methods and achieves SOTA among open-source models tuned from the same base model, using substantially less data and compute. Crucially, these gains are realized while preserving strong original capabilities, with minimal degradation ($<4.3\% \text{on MMLU}$) compared to $>35\%$ in open-source Math and Coding models. This approach has been successfully deployed in LinkedIn's GenAI-powered job seeker and Ads unit products.
To support further research, we release the training and evaluation code (https://github.com/linkedin/ControlLLM) along with models trained on public datasets (https://huggingface.co/ControlLLM) to the community.
中文: Control LLM通过并行Transformer模块和插值策略的新方法,在持续训练中有效防止灾难性遗忘,在数学推理、编程和多语言任务上实现最优性能,同时保持原有能力且性能下降极小。
English: Control LLM is a novel method that uses parallel transformer blocks and interpolation to prevent catastrophic forgetting during continuous training, achieving state-of-the-art performance in mathematical reasoning, coding, and multilingual tasks while preserving original capabilities with minimal degradation.
Authors:Weiyu Chen, Baijiong Lin, Xiaoyuan Zhang, Xi Lin, Han Zhao, Qingfu Zhang, James T. Kwok
Abstract:
Many modern deep learning applications require balancing multiple objectives that are often conflicting. Examples include multi-task learning, fairness-aware learning, and the alignment of Large Language Models (LLMs). This leads to multi-objective deep learning, which tries to find optimal trade-offs or Pareto-optimal solutions by adapting mathematical principles from the field of Multi-Objective Optimization (MOO). However, directly applying gradient-based MOO techniques to deep neural networks presents unique challenges, including high computational costs, optimization instability, and the difficulty of effectively incorporating user preferences. This paper provides a comprehensive survey of gradient-based techniques for multi-objective deep learning. We systematically categorize existing algorithms based on their outputs: (i) methods that find a single, well-balanced solution, (ii) methods that generate a finite set of diverse Pareto-optimal solutions, and (iii) methods that learn a continuous Pareto set of solutions. In addition to this taxonomy, the survey covers theoretical analyses, key applications, practical resources, and highlights open challenges and promising directions for future research. A comprehensive list of multi-objective deep learning algorithms is available at https://github.com/Baijiong-Lin/Awesome-Multi-Objective-Deep-Learning.
中文摘要:本文系统综述了基于梯度的多目标深度学习优化方法,按解决方案类型分类并探讨了计算成本高、优化不稳定等关键挑战。
English Summary: This paper surveys gradient-based multi-objective optimization techniques for deep learning, categorizing them by solution types and addressing challenges like computational costs and optimization instability.
Authors:Young Seok Jeon, Hongfei Yang, Huazhu Fu, Mengling Feng
Abstract:
3D models surpass 2D models in CT/MRI segmentation by effectively capturing inter-slice relationships. However, the added depth dimension substantially increases memory consumption. While patch-based training alleviates memory constraints, it significantly slows down the inference speed due to the sliding window (SW) approach. We propose No-More-Sliding-Window (NMSW), a novel end-to-end trainable framework that enhances the efficiency of generic 3D segmentation backbone during an inference step by eliminating the need for SW. NMSW employs a differentiable Top-k module to selectively sample only the most relevant patches, thereby minimizing redundant computations. When patch-level predictions are insufficient, the framework intelligently leverages coarse global predictions to refine results. Evaluated across 3 tasks using 3 segmentation backbones, NMSW achieves competitive accuracy compared to SW inference while significantly reducing computational complexity by 91% (88.0 to 8.00 TMACs). Moreover, it delivers a 9.1x faster inference on the H100 GPU (99.0 to 8.3 sec) and a 11.1x faster inference on the Xeon Gold CPU (2110 to 189 sec). NMSW is model-agnostic, further boosting efficiency when integrated with any existing efficient segmentation backbones. The code is avaialble: https://github.com/Youngseok0001/open_nmsw.
中文摘要:提出的NMSW框架通过选择性采样关键切片并融合全局预测,在保持精度的同时消除了3D医学图像分割中的滑动窗口操作,实现了91%的计算量削减和超过9倍的推理加速。
English Summary: The proposed NMSW framework eliminates sliding window inference in 3D medical image segmentation by selectively sampling relevant patches and leveraging global predictions, achieving 91% computational reduction and over 9x faster inference while maintaining competitive accuracy.
Authors:Pengcheng Zhao, Zhixian He, Fuwei Zhang, Shujin Lin, Fan Zhou
Abstract:
Video Moment Retrieval and Highlight Detection aim to find corresponding content in the video based on a text query. Existing models usually first use contrastive learning methods to align video and text features, then fuse and extract multimodal information, and finally use a Transformer Decoder to decode multimodal information. However, existing methods face several issues: (1) Overlapping semantic information between different samples in the dataset hinders the model's multimodal aligning performance; (2) Existing models are not able to efficiently extract local features of the video; (3) The Transformer Decoder used by the existing model cannot adequately decode multimodal features. To address the above issues, we proposed the LD-DETR model for Video Moment Retrieval and Highlight Detection tasks. Specifically, we first distilled the similarity matrix into the identity matrix to mitigate the impact of overlapping semantic information. Then, we designed a method that enables convolutional layers to extract multimodal local features more efficiently. Finally, we fed the output of the Transformer Decoder back into itself to adequately decode multimodal information. We evaluated LD-DETR on four public benchmarks and conducted extensive experiments to demonstrate the superiority and effectiveness of our approach. Our model outperforms the State-Of-The-Art models on QVHighlight, Charades-STA and TACoS datasets. Our code is available at https://github.com/qingchen239/ld-detr.
Chinese: LD-DETR模型通过将相似度矩阵蒸馏为恒等矩阵缓解语义重叠问题,利用卷积层高效提取局部特征,并通过Transformer解码器自反馈充分解码多模态信息,在多个公开数据集上实现了最先进的视频片段检索和高亮检测性能。
English: The LD-DETR model addresses key challenges in video moment retrieval and highlight detection by mitigating overlapping semantic information through similarity matrix distillation, enhancing local feature extraction with convolutional layers, and improving multimodal decoding via Transformer Decoder feedback, achieving state-of-the-art performance on multiple benchmarks.
Authors:Yaniv Shulman
Abstract:
Local Polynomial Regression (LPR) is a widely used nonparametric method for modeling complex relationships due to its flexibility and simplicity. It estimates a regression function by fitting low-degree polynomials to localized subsets of the data, weighted by proximity. However, traditional LPR is sensitive to outliers and high-leverage points, which can significantly affect estimation accuracy. This paper revisits the kernel function used to compute regression weights and proposes a novel framework that incorporates both predictor and response variables in the weighting mechanism. The focus of this work is a conditional density kernel that robustly estimates weights by mitigating the influence of outliers through localized density estimation. A related joint density kernel is also discussed in an appendix. The proposed method is implemented in Python and is publicly available at https://github.com/yaniv-shulman/rsklpr, demonstrating competitive performance in synthetic benchmark experiments. Compared to standard LPR, the proposed approach consistently improves robustness and accuracy, especially in heteroscedastic and noisy environments, without requiring multiple iterations. This advancement provides a promising extension to traditional LPR, opening new possibilities for robust regression applications.
中文: 本文提出了局部多项式回归的鲁棒扩展方法,通过引入条件密度核函数进行局部密度估计来降低异常值敏感性,在噪声环境中无需迭代即可提升估计精度。
English: This paper introduces a robust extension to Local Polynomial Regression by incorporating a conditional density kernel that mitigates outlier sensitivity through localized density estimation, improving accuracy in noisy environments without iterative steps.
Authors:Andrey Risukhin, Kavel Rao, Ben Caffee, Alan Fan
Abstract:
Autonomous agents' interactions with humans are increasingly focused on adapting to their changing preferences in order to improve assistance in real-world tasks. Effective agents must learn to accurately infer human goals, which are often hidden, to collaborate well. However, existing Multi-Agent Reinforcement Learning (MARL) environments lack the necessary attributes required to rigorously evaluate these agents' learning capabilities. To this end, we introduce ColorGrid, a novel MARL environment with customizable non-stationarity, asymmetry, and reward structure. We investigate the performance of Independent Proximal Policy Optimization (IPPO), a state-of-the-art (SOTA) MARL algorithm, in ColorGrid and find through extensive ablations that, particularly with simultaneous non-stationary and asymmetric goals between a ``leader'' agent representing a human and a ``follower'' assistant agent, ColorGrid is unsolved by IPPO. To support benchmarking future MARL algorithms, we release our environment code, model checkpoints, and trajectory visualizations at https://github.com/andreyrisukhin/ColorGrid.
中文摘要:本文提出ColorGrid这一新型多智能体强化学习环境,旨在评估智能体适应非平稳和不对称人类偏好的能力,实验表明当前最先进的IPPO算法无法解决这些复杂场景。
English Summary: The paper introduces ColorGrid, a novel Multi-Agent Reinforcement Learning environment designed to evaluate agents' ability to adapt to non-stationary and asymmetric human preferences, where current state-of-the-art algorithms like IPPO fail to solve these complex scenarios.
Authors:Taehee Jeong
Abstract:
Retrieval-augmented generation (RAG) is a promising technique that has shown great potential in addressing some of the limitations of large language models (LLMs). LLMs have two major limitations: they can contain outdated information due to their training data, and they can generate factually inaccurate responses, a phenomenon known as hallucinations. RAG aims to mitigate these issues by leveraging a database of relevant documents, which are stored as embedding vectors in a high-dimensional space. However, one of the challenges of using high-dimensional embeddings is that they require a significant amount of memory to store. This can be a major issue, especially when dealing with large databases of documents. To alleviate this problem, we propose the use of 4-bit quantization to store the embedding vectors. This involves reducing the precision of the vectors from 32-bit floating-point numbers to 4-bit integers, which can significantly reduce the memory requirements. Our approach has several benefits. Firstly, it significantly reduces the memory storage requirements of the high-dimensional vector database, making it more feasible to deploy RAG systems in resource-constrained environments. Secondly, it speeds up the searching process, as the reduced precision of the vectors allows for faster computation. Our code is available at https://github.com/taeheej/4bit-Quantization-in-Vector-Embedding-for-RAG
中文: 本文提出在检索增强生成系统中使用4位量化存储嵌入向量,显著降低内存需求并加速检索过程,同时有效缓解大语言模型的过时信息和幻觉问题。
English: This paper introduces 4-bit quantization for embedding vectors in retrieval-augmented generation systems, significantly reducing memory storage requirements and accelerating search processes while addressing limitations of large language models.
Authors:Daniel Severo, Giuseppe Ottaviano, Matthew Muckley, Karen Ullrich, Matthijs Douze
Abstract:
Approximate nearest neighbor search for vectors relies on indexes that are most often accessed from RAM. Therefore, storage is the factor limiting the size of the database that can be served from a machine. Lossy vector compression, i.e., embedding quantization, has been applied extensively to reduce the size of indexes. However, for inverted file and graph-based indices, auxiliary data such as vector ids and links (edges) can represent most of the storage cost. We introduce and evaluate lossless compression schemes for these cases. These approaches are based on asymmetric numeral systems or wavelet trees that exploit the fact that the ordering of ids is irrelevant within the data structures. In some settings, we are able to compress the vector ids by a factor 7, with no impact on accuracy or search runtime. On billion-scale datasets, this results in a reduction of 30% of the index size. Furthermore, we show that for some datasets, these methods can also compress the quantized vector codes losslessly, by exploiting sub-optimalities in the original quantization algorithm. The source code for our approach available at https://github.com/facebookresearch/vector_db_id_compression.
中文: 本研究提出针对向量数据库索引的无损压缩技术,通过压缩向量ID和量化编码等辅助数据,在十亿级数据集上实现高达7倍的压缩比和30%的索引规模缩减,且不影响检索性能。
English: The study introduces lossless compression techniques for vector database indexes that reduce storage by compressing auxiliary data like vector IDs and quantized codes, achieving up to a 7-fold compression with no performance loss and a 30% index size reduction on billion-scale datasets.
Authors:Aitor Belenguer, Jose A. Pascual, Javier Navaridas
Abstract:
Fully decentralized learning algorithms are still in an early stage of development. Creating modular Gossip Learning strategies is not trivial due to convergence challenges and Byzantine faults intrinsic in systems of decentralized nature. Our contribution provides a novel means to simulate custom Gossip Learning systems by leveraging the state-of-the-art Flower Framework. Specifically, we introduce GLow, which will allow researchers to train and assess scalability and convergence of devices, across custom network topologies, before making a physical deployment. The Flower Framework is selected for being a simulation featured library with a very active community on Federated Learning research. However, Flower exclusively includes vanilla Federated Learning strategies and, thus, is not originally designed to perform simulations without a centralized authority. GLow is presented to fill this gap and make simulation of Gossip Learning systems possible. Results achieved by GLow in the MNIST and CIFAR10 datasets, show accuracies over 0.98 and 0.75 respectively. More importantly, GLow performs similarly in terms of accuracy and convergence to its analogous Centralized and Federated approaches in all designed experiments.
中文摘要:完全去中心化学习面临收敛和拜占庭故障的挑战,而基于Flower框架开发的GLow工具实现了Gossip学习系统的可扩展模拟,在MNIST和CIFAR10数据集上达到了与中心化和联邦学习方法相媲美的高准确率。
English Summary: Fully decentralized learning faces challenges in convergence and Byzantine faults, but GLow, a novel tool built on the Flower Framework, enables scalable simulation of Gossip Learning systems, achieving high accuracy comparable to centralized and federated approaches on MNIST and CIFAR10 datasets.
Authors:Xiaolu Hou, Mingcheng Li, Dingkang Yang, Jiawei Chen, Ziyun Qian, Xiao Zhao, Yue Jiang, Jinjie Wei, Qingyao Xu, Lihua Zhang
Abstract:
With the widespread use of virtual reality applications, 3D scene generation has become a new challenging research frontier. 3D scenes have highly complex structures and need to ensure that the output is dense, coherent, and contains all necessary structures. Many current 3D scene generation methods rely on pre-trained text-to-image diffusion models and monocular depth estimators. However, the generated scenes occupy large amounts of storage space and often lack effective regularisation methods, leading to geometric distortions. To this end, we propose BloomScene, a lightweight structured 3D Gaussian splatting for crossmodal scene generation, which creates diverse and high-quality 3D scenes from text or image inputs. Specifically, a crossmodal progressive scene generation framework is proposed to generate coherent scenes utilizing incremental point cloud reconstruction and 3D Gaussian splatting. Additionally, we propose a hierarchical depth prior-based regularization mechanism that utilizes multi-level constraints on depth accuracy and smoothness to enhance the realism and continuity of the generated scenes. Ultimately, we propose a structured context-guided compression mechanism that exploits structured hash grids to model the context of unorganized anchor attributes, which significantly eliminates structural redundancy and reduces storage overhead. Comprehensive experiments across multiple scenes demonstrate the significant potential and advantages of our framework compared with several baselines.
中文: 本文提出BloomScene,一种轻量级3D高斯溅射方法,通过跨模态渐进生成、分层深度正则化和结构化压缩机制,从文本或图像输入生成高质量3D场景,有效减少存储需求并提升场景真实感。
English: The abstract introduces BloomScene, a lightweight 3D Gaussian splatting method that generates high-quality 3D scenes from text or image inputs using crossmodal progressive generation, hierarchical depth regularization, and structured compression to reduce storage and enhance realism.
Authors:Karl El Hajal, Enno Hermann, Ajinkya Kulkarni, Mathew Magimai. -Doss
Abstract:
Automatic speech recognition (ASR) systems are well known to perform poorly on dysarthric speech. Previous works have addressed this by speaking rate modification to reduce the mismatch with typical speech. Unfortunately, these approaches rely on transcribed speech data to estimate speaking rates and phoneme durations, which might not be available for unseen speakers. Therefore, we combine unsupervised rhythm and voice conversion methods based on self-supervised speech representations to map dysarthric to typical speech. We evaluate the outputs with a large ASR model pre-trained on healthy speech without further fine-tuning and find that the proposed rhythm conversion especially improves performance for speakers of the Torgo corpus with more severe cases of dysarthria. Code and audio samples are available at https://idiap.github.io/RnV .
Authors:Kazuma Onishi, Katsuhiko Hayashi
Abstract:
Extreme multi-label learning (XML) is a task of assigning multiple labels from an extremely large set of labels to each data instance. Many current high-performance XML models are composed of a lot of hyperparameters, which complicates the tuning process. Additionally, the models themselves are adapted specifically to XML, which complicates their reimplementation. To remedy this problem, we propose a simple method based on ridge regression for XML. The proposed method not only has a closed-form solution but also is composed of a single hyperparameter. Since there are no precedents on applying ridge regression to XML, this paper verified the performance of the method by using various XML benchmark datasets. Furthermore, we enhanced the prediction of low-frequency labels in XML, which hold informative content. This prediction is essential yet challenging because of the limited amount of data. Here, we employed a simple frequency-based weighting. This approach greatly simplifies the process compared with existing techniques. Experimental results revealed that it can achieve levels of performance comparable to, or even exceeding, those of models with numerous hyperparameters. Additionally, we found that the frequency-based weighting significantly improved the predictive performance for low-frequency labels, while requiring almost no changes in implementation. The source code for the proposed method is available on github at https://github.com/cars1015/XML-ridge.
中文: 本文提出了一种基于岭回归的极大多标签学习方法,仅使用单一超参数,并通过频率加权提升低频标签的预测效果,其性能达到甚至超越了复杂模型。
English: This paper introduces a simple ridge regression-based method for extreme multi-label learning that uses only one hyperparameter and employs frequency-based weighting to enhance predictions for low-frequency labels, achieving performance comparable to or better than complex models.
Authors:Yichen He, Guanhua Huang, Peiyuan Feng, Yuan Lin, Yuchen Zhang, Hang Li, Weinan E
Abstract:
We introduce PaSa, an advanced Paper Search agent powered by large language models. PaSa can autonomously make a series of decisions, including invoking search tools, reading papers, and selecting relevant references, to ultimately obtain comprehensive and accurate results for complex scholar queries. We optimize PaSa using reinforcement learning with a synthetic dataset, AutoScholarQuery, which includes 35k fine-grained academic queries and corresponding papers sourced from top-tier AI conference publications. Additionally, we develop RealScholarQuery, a benchmark collecting real-world academic queries to assess PaSa performance in more realistic scenarios. Despite being trained on synthetic data, PaSa significantly outperforms existing baselines on RealScholarQuery, including Google, Google Scholar, Google with GPT-4o for paraphrased queries, ChatGPT (search-enabled GPT-4o), GPT-o1, and PaSa-GPT-4o (PaSa implemented by prompting GPT-4o). Notably, PaSa-7B surpasses the best Google-based baseline, Google with GPT-4o, by 37.78% in recall@20 and 39.90% in recall@50, and exceeds PaSa-GPT-4o by 30.36% in recall and 4.25% in precision. Model, datasets, and code are available at https://github.com/bytedance/pasa.
Chinese: PaSa是一种基于大语言模型的高级论文搜索代理,能通过工具调用和强化学习自主处理复杂学术查询,尽管使用合成数据训练,但在真实场景基准测试中显著优于现有基线模型。
English: PaSa is an advanced paper search agent powered by large language models that autonomously handles complex academic queries through tool invocation and reinforcement learning, significantly outperforming existing baselines on real-world benchmarks despite synthetic training.
Authors:Ali Can Karaca, M. Enes Ozelbas, Saadettin Berber, Orkhan Karimli, Turabi Yildirim, M. Fatih Amasyali
Abstract:
Remote sensing change captioning (RSICC) aims to describe changes between bitemporal images in natural language. Existing methods often fail under challenges like illumination differences, viewpoint changes, blur effects, leading to inaccuracies, especially in no-change regions. Moreover, the images acquired at different spatial resolutions and have registration errors tend to affect the captions. To address these issues, we introduce SECOND-CC, a novel RSICC dataset featuring high-resolution RGB image pairs, semantic segmentation maps, and diverse real-world scenarios. SECOND-CC which contains 6,041 pairs of bitemporal RS images and 30,205 sentences describing the differences between images. Additionally, we propose MModalCC, a multimodal framework that integrates semantic and visual data using advanced attention mechanisms, including Cross-Modal Cross Attention (CMCA) and Multimodal Gated Cross Attention (MGCA). Detailed ablation studies and attention visualizations further demonstrate its effectiveness and ability to address RSICC challenges. Comprehensive experiments show that MModalCC outperforms state-of-the-art RSICC methods, including RSICCformer, Chg2Cap, and PSNet with +4.6% improvement on BLEU4 score and +9.6% improvement on CIDEr score. We will make our dataset and codebase publicly available to facilitate future research at https://github.com/ChangeCapsInRS/SecondCC
Chinese: 本研究提出了高分辨率遥感数据集SECOND-CC和融合语义与视觉信息的多模态框架MModalCC,通过先进的注意力机制显著提升了变化描述准确性,在BLEU4和CIDEr指标上分别实现了4.6%和9.6%的性能提升。
English: The study introduces SECOND-CC, a high-resolution remote sensing dataset, and MModalCC, a multimodal framework that integrates semantic and visual data to improve change captioning accuracy, achieving state-of-the-art performance with significant gains in BLEU4 and CIDEr scores.
Authors:Keita Miwa, Kento Sasaki, Hidehisa Arai, Tsubasa Takahashi, Yu Yamaguchi
Abstract:
Current image tokenization methods require a large number of tokens to capture the information contained within images. Although the amount of information varies across images, most image tokenizers only support fixed-length tokenization, leading to inefficiency in token allocation. In this study, we introduce One-D-Piece, a discrete image tokenizer designed for variable-length tokenization, achieving quality-controllable mechanism. To enable variable compression rate, we introduce a simple but effective regularization mechanism named "Tail Token Drop" into discrete one-dimensional image tokenizers. This method encourages critical information to concentrate at the head of the token sequence, enabling support of variadic tokenization, while preserving state-of-the-art reconstruction quality. We evaluate our tokenizer across multiple reconstruction quality metrics and find that it delivers significantly better perceptual quality than existing quality-controllable compression methods, including JPEG and WebP, at smaller byte sizes. Furthermore, we assess our tokenizer on various downstream computer vision tasks, including image classification, object detection, semantic segmentation, and depth estimation, confirming its adaptability to numerous applications compared to other variable-rate methods. Our approach demonstrates the versatility of variable-length discrete image tokenization, establishing a new paradigm in both compression efficiency and reconstruction performance. Finally, we validate the effectiveness of tail token drop via detailed analysis of tokenizers.
Authors:Xigui Li, Yuanye Zhou, Feiyang Xiao, Xin Guo, Yichi Zhang, Chen Jiang, Jianchao Ge, Xiansheng Wang, Qimeng Wang, Taiwei Zhang, Chensen Lin, Yuan Cheng, Yuan Qi
Abstract:
Intracranial aneurysm (IA) is a common cerebrovascular disease that is usually asymptomatic but may cause severe subarachnoid hemorrhage (SAH) if ruptured. Although clinical practice is usually based on individual factors and morphological features of the aneurysm, its pathophysiology and hemodynamic mechanisms remain controversial. To address the limitations of current research, this study constructed a comprehensive hemodynamic dataset of intracranial aneurysms. The dataset is based on 466 real aneurysm models, and 10,000 synthetic models were generated by resection and deformation operations, including 466 aneurysm-free models and 9,534 deformed aneurysm models. The dataset also provides medical image-like segmentation mask files to support insightful analysis. In addition, the dataset contains hemodynamic data measured at eight steady-state flow rates (0.001 to 0.004 kg/s), including critical parameters such as flow velocity, pressure, and wall shear stress, providing a valuable resource for investigating aneurysm pathogenesis and clinical prediction. This dataset will help advance the understanding of the pathologic features and hemodynamic mechanisms of intracranial aneurysms and support in-depth research in related fields. Dataset hosted at https://github.com/Xigui-Li/Aneumo.
中文: 本研究基于466个真实和1万个合成颅内动脉瘤模型构建了全面的血流动力学数据集,提供流速、压力和壁面剪应力等关键参数,将推动对动脉瘤发病机制和血流动力学机理的研究。
English: This study developed a comprehensive hemodynamic dataset using 466 real and 10,000 synthetic intracranial aneurysm models, providing flow velocity, pressure, and wall shear stress data to advance understanding of aneurysm pathogenesis and mechanisms.
Authors:J. Pablo Muñoz, Jinjie Yuan, Nilesh Jain
Abstract:
Recently, state-of-the-art approaches for pruning large pre-trained models (LPMs) have demonstrated that the training-free removal of non-critical residual blocks in Transformers is viable for reducing model size, achieving results that outperform previous training-free pruning approaches. Motivated by these findings, we extend BlockPruner (Zhong et al., 2024) and propose MultiPruner, a pruning approach that surpasses recent training-free pruning methods by adopting a multidimensional, iterative, fine-grained pruning strategy. In MultiPruner, multidimensional pruning reinstates the structural balance in block-pruned models by sequentially compressing along three dimensions: i) residual blocks, ii) channels of multilayer perceptrons (MLP), and iii) attention heads. This solution enhances zero-shot accuracy on downstream tasks compared to other techniques while improving model compression ratios, producing compressed models with fewer computing and memory requirements. Extensive experiments demonstrate the advantages of the proposed method across various large pre-trained models. The code and pruning configurations are available at https://github.com/IntelLabs/Hardware-Aware-Automated-Machine-Learning.
中文:MultiPruner通过多维修剪策略,依次压缩残差块、多层感知机通道和注意力头,在无需训练的情况下提升了零样本准确率并实现了更高的模型压缩比。
English: MultiPruner advances pruning of large pre-trained models by employing a multidimensional strategy that compresses residual blocks, MLP channels, and attention heads, achieving superior zero-shot accuracy and higher compression ratios without training.
Authors:Bowen Wen, Matthew Trepte, Joseph Aribido, Jan Kautz, Orazio Gallo, Stan Birchfield
Abstract:
Tremendous progress has been made in deep stereo matching to excel on benchmark datasets through per-domain fine-tuning. However, achieving strong zero-shot generalization - a hallmark of foundation models in other computer vision tasks - remains challenging for stereo matching. We introduce FoundationStereo, a foundation model for stereo depth estimation designed to achieve strong zero-shot generalization. To this end, we first construct a large-scale (1M stereo pairs) synthetic training dataset featuring large diversity and high photorealism, followed by an automatic self-curation pipeline to remove ambiguous samples. We then design a number of network architecture components to enhance scalability, including a side-tuning feature backbone that adapts rich monocular priors from vision foundation models to mitigate the sim-to-real gap, and long-range context reasoning for effective cost volume filtering. Together, these components lead to strong robustness and accuracy across domains, establishing a new standard in zero-shot stereo depth estimation. Project page: https://nvlabs.github.io/FoundationStereo/
Authors:Jingchen Sun, Shaobo Han, Wataru Kohno, Changyou Chen
Abstract:
Contrastive Language-Audio Pretraining (CLAP) models have demonstrated unprecedented performance in various acoustic signal recognition tasks. Fiber-optic-based acoustic recognition is one of the most important downstream tasks and plays a significant role in environmental sensing. Adapting CLAP for fiber-optic acoustic recognition has become an active research area. As a non-conventional acoustic sensor, fiber-optic acoustic recognition presents a challenging, domain-specific, low-shot deployment environment with significant domain shifts due to unique frequency response and noise characteristics. To address these challenges, we propose a support-based adaptation method, CLAP-S, which linearly interpolates a CLAP Adapter with the Support Set, leveraging both implicit knowledge through fine-tuning and explicit knowledge retrieved from memory for cross-domain generalization. Experimental results show that our method delivers competitive performance on both laboratory-recorded fiber-optic ESC-50 datasets and a real-world fiber-optic gunshot-firework dataset. Our research also provides valuable insights for other downstream acoustic recognition tasks. The code and gunshot-firework dataset are available at https://github.com/Jingchensun/clap-s.
中文:CLAP-S方法通过结合微调与支持集知识,成功将对比语言-音频预训练模型应用于光纤声学识别领域,在专业数据集上展现出卓越性能。
English: The proposed CLAP-S method effectively adapts Contrastive Language-Audio Pretraining models to fiber-optic acoustic recognition by combining fine-tuning with support set knowledge, demonstrating strong performance across specialized datasets.
Authors:Yuexi Du, Jiazhen Zhang, Tal Zeevi, Nicha C. Dvornek, John A. Onofrey
Abstract:
Convolutional neural networks (CNNs) are essential tools for computer vision tasks, but they lack traditionally desired properties of extracted features that could further improve model performance, e.g., rotational equivariance. Such properties are ubiquitous in biomedical images, which often lack explicit orientation. While current work largely relies on data augmentation or explicit modules to capture orientation information, this comes at the expense of increased training costs or ineffective approximations of the desired equivariance. To overcome these challenges, we propose a novel and efficient implementation of the Symmetric Rotation-Equivariant (SRE) Convolution (SRE-Conv) kernel, designed to learn rotation-invariant features while simultaneously compressing the model size. The SRE-Conv kernel can easily be incorporated into any CNN backbone. We validate the ability of a deep SRE-CNN to capture equivariance to rotation using the public MedMNISTv2 dataset (16 total tasks). SRE-Conv-CNN demonstrated improved rotated image classification performance accuracy on all 16 test datasets in both 2D and 3D images, all while increasing efficiency with fewer parameters and reduced memory footprint. The code is available at https://github.com/XYPB/SRE-Conv.
中文摘要:本研究提出了一种对称旋转等变(SRE)卷积核,能在学习旋转不变特征的同时压缩模型规模,经生物医学图像数据集验证,其在提升分类精度的同时显著提高了计算效率。
English Summary: The study introduces a Symmetric Rotation-Equivariant (SRE) Convolution kernel that enhances CNN performance by learning rotation-invariant features while reducing model size, validated through improved accuracy and efficiency on biomedical image datasets.
Authors:Zekun Xi, Wenbiao Yin, Jizhan Fang, Jialong Wu, Runnan Fang, Yong Jiang, Pengjun Xie, Fei Huang, Huajun Chen, Ningyu Zhang
Abstract:
Machine writing with large language models often relies on retrieval-augmented generation. However, these approaches remain confined within the boundaries of the model's predefined scope, limiting the generation of content with rich information. Specifically, vanilla-retrieved information tends to lack depth, novelty, and suffers from redundancy, which negatively impacts the quality of generated articles, leading to shallow, unoriginal, and repetitive outputs. To address these issues, we propose OmniThink, a slow-thinking machine writing framework that emulates the human-like process of iterative expansion and reflection. The core idea behind OmniThink is to simulate the cognitive behavior of learners as they slowly deepen their knowledge of the topics. Experimental results demonstrate that OmniThink improves the knowledge density of generated articles without compromising metrics such as coherence and depth. Human evaluations and expert feedback further highlight the potential of OmniThink to address real-world challenges in the generation of long-form articles. Code is available at https://github.com/zjunlp/OmniThink.
中文摘要:OmniThink是一种慢思考机器写作框架,通过模拟人类迭代学习过程来克服检索增强生成的局限性,能在保持连贯性和深度的同时提高生成文章的知识密度。
English Summary: OmniThink is a slow-thinking machine writing framework designed to overcome the limitations of retrieval-augmented generation by mimicking human iterative learning, resulting in articles with higher knowledge density while maintaining coherence and depth.
Authors:Hongbo Zhao, Fei Zhu, Bolin Ni, Feng Zhu, Gaofeng Meng, Zhaoxiang Zhang
Abstract:
For privacy and security concerns, the need to erase unwanted information from pre-trained vision models is becoming evident nowadays. In real-world scenarios, erasure requests originate at any time from both users and model owners, and these requests usually form a sequence. Therefore, under such a setting, selective information is expected to be continuously removed from a pre-trained model while maintaining the rest. We define this problem as continual forgetting and identify three key challenges. (i) For unwanted knowledge, efficient and effective deleting is crucial. (ii) For remaining knowledge, the impact brought by the forgetting procedure should be minimal. (iii) In real-world scenarios, the training samples may be scarce or partially missing during the process of forgetting. To address them, we first propose Group Sparse LoRA (GS-LoRA). Specifically, towards (i), we introduce LoRA modules to fine-tune the FFN layers in Transformer blocks for each forgetting task independently, and towards (ii), a simple group sparse regularization is adopted, enabling automatic selection of specific LoRA groups and zeroing out the others. To further extend GS-LoRA to more practical scenarios, we incorporate prototype information as additional supervision and introduce a more practical approach, GS-LoRA++. For each forgotten class, we move the logits away from its original prototype. For the remaining classes, we pull the logits closer to their respective prototypes. We conduct extensive experiments on face recognition, object detection and image classification and demonstrate that our method manages to forget specific classes with minimal impact on other classes. Codes have been released on https://github.com/bjzhb666/GS-LoRA.
中文: 针对预训练视觉模型需持续删除特定信息的需求,本文提出GS-LoRA及其增强版GS-LoRA++,通过任务独立的LoRA模块结合组稀疏正则化和原型监督机制,在有效擦除目标知识的同时最大程度保持模型其余性能。
English: To address the need for continuous removal of specific information from pre-trained vision models while preserving overall performance, this paper introduces GS-LoRA and its enhanced version GS-LoRA++, which utilize task-specific LoRA modules with group sparse regularization and prototype supervision to effectively erase targeted knowledge with minimal impact on retained capabilities.
Authors:Masatoshi Uehara, Yulai Zhao, Chenyu Wang, Xiner Li, Aviv Regev, Sergey Levine, Tommaso Biancalani
Abstract:
This tutorial provides an in-depth guide on inference-time guidance and alignment methods for optimizing downstream reward functions in diffusion models. While diffusion models are renowned for their generative modeling capabilities, practical applications in fields such as biology often require sample generation that maximizes specific metrics (e.g., stability, affinity in proteins, closeness to target structures). In these scenarios, diffusion models can be adapted not only to generate realistic samples but also to explicitly maximize desired measures at inference time without fine-tuning. This tutorial explores the foundational aspects of such inference-time algorithms. We review these methods from a unified perspective, demonstrating that current techniques -- such as Sequential Monte Carlo (SMC)-based guidance, value-based sampling, and classifier guidance -- aim to approximate soft optimal denoising processes (a.k.a. policies in RL) that combine pre-trained denoising processes with value functions serving as look-ahead functions that predict from intermediate states to terminal rewards. Within this framework, we present several novel algorithms not yet covered in the literature. Furthermore, we discuss (1) fine-tuning methods combined with inference-time techniques, (2) inference-time algorithms based on search algorithms such as Monte Carlo tree search, which have received limited attention in current research, and (3) connections between inference-time algorithms in language models and diffusion models. The code of this tutorial on protein design is available at https://github.com/masa-ue/AlignInversePro
本教程深入探讨了扩散模型中基于推理时引导与对齐的方法来优化下游奖励函数,提出了新颖算法并建立了与语言模型及搜索方法的理论关联。
This tutorial offers a comprehensive overview of inference-time guidance and alignment techniques for optimizing reward functions in diffusion models, introducing novel algorithms and exploring connections with language models and search-based methods.
Authors:Hanrong Zhang, Yifei Yao, Zixuan Wang, Jiayuan Su, Mengxuan Li, Peng Peng, Hongwei Wang
Abstract:
Class-incremental fault diagnosis requires a model to adapt to new fault classes while retaining previous knowledge. However, limited research exists for imbalanced and long-tailed data. Extracting discriminative features from few-shot fault data is challenging, and adding new fault classes often demands costly model retraining. Moreover, incremental training of existing methods risks catastrophic forgetting, and severe class imbalance can bias the model's decisions toward normal classes. To tackle these issues, we introduce a Supervised Contrastive knowledge distiLlation for class Incremental Fault Diagnosis (SCLIFD) framework proposing supervised contrastive knowledge distillation for improved representation learning capability and less forgetting, a novel prioritized exemplar selection method for sample replay to alleviate catastrophic forgetting, and the Random Forest Classifier to address the class imbalance. Extensive experimentation on simulated and real-world industrial datasets across various imbalance ratios demonstrates the superiority of SCLIFD over existing approaches. Our code can be found at https://github.com/Zhang-Henry/SCLIFD_TII.
中文:SCLIFD框架通过结合监督对比知识蒸馏以增强特征学习并减少遗忘、优先样本回放方法以及随机森林分类器来缓解类别不平衡,有效解决了类别增量故障诊断中的关键难题,并在不平衡数据集上展现出卓越性能。
English: The SCLIFD framework addresses class-incremental fault diagnosis challenges by integrating supervised contrastive knowledge distillation to enhance feature learning and reduce forgetting, a prioritized exemplar selection method for sample replay, and a Random Forest Classifier to mitigate class imbalance, demonstrating superior performance on imbalanced datasets.
Authors:Jan Skvrna, Lukas Neumann
Abstract:
Inferring object 3D position and orientation from a single RGB camera is a foundational task in computer vision with many important applications. Traditionally, 3D object detection methods are trained in a fully-supervised setup, requiring LiDAR and vast amounts of human annotations, which are laborious, costly, and do not scale well with the ever-increasing amounts of data being captured.
We present a novel method to train a 3D object detector from a single RGB camera without domain-specific human annotations, making orders of magnitude more data available for training. The method uses newly proposed Local Object Motion Model to disentangle object movement source between subsequent frames, is approximately 700 times faster than previous work and compensates camera focal length differences to aggregate multiple datasets.
The method is evaluated on three public datasets, where despite using no human labels, it outperforms prior work by a significant margin. It also shows its versatility as a pre-training tool for fully-supervised training and shows that combining pseudo-labels from multiple datasets can achieve comparable accuracy to using human labels from a single dataset. The source code and model are available at https://github.com/jskvrna/MonoSOWA.
中文: 本文提出了一种仅使用单目RGB相机图像且无需人工标注来训练3D物体检测器的新方法,在性能和效率上均显著优于现有技术。
English: This paper introduces a novel method for training a 3D object detector using only single RGB camera images without human annotations, achieving superior performance and efficiency over previous approaches.
Authors:Tobias Fiedler, Leon Hermann, Florian Müller, Sarel Cohen, Peter Chin, Tobias Friedrich, Eilon Vaadia
Abstract:
The decoding of continuously spoken speech from neuronal activity has the potential to become an important clinical solution for paralyzed patients. Deep Learning Brain Computer Interfaces (BCIs) have recently successfully mapped neuronal activity to text contents in subjects who attempted to formulate speech. However, only small BCI datasets are available. In contrast, labeled data and pre-trained models for the closely related task of speech recognition from audio are widely available. One such model is Wav2Vec2 which has been trained in a self-supervised fashion to create meaningful representations of speech audio data. In this study, we show that patterns learned by Wav2Vec2 are transferable to brain data. Specifically, we replace its audio feature extractor with an untrained Brain Feature Extractor (BFE) model. We then execute full fine-tuning with pre-trained weights for Wav2Vec2, training ''from scratch'' without pre-trained weights as well as freezing a pre-trained Wav2Vec2 and training only the BFE each for 45 different BFE architectures. Across these experiments, the best run is from full fine-tuning with pre-trained weights, achieving a Character Error Rate (CER) of 18.54\%, outperforming the best training from scratch run by 20.46\% and that of frozen Wav2Vec2 training by 15.92\% percentage points. These results indicate that knowledge transfer from audio speech recognition to brain decoding is possible and significantly improves brain decoding performance for the same architectures. Related source code is available at https://github.com/tfiedlerdev/Wav2Vec2ForBrain.
中文: 本研究证明将预训练的Wav2Vec2模型从音频语音识别迁移至脑机接口能显著提升大脑语音解码性能,通过完全微调实现了18.54%的字符错误率。
English: This study demonstrates that transferring pre-trained Wav2Vec2 models from audio speech recognition to brain-computer interfaces significantly enhances speech decoding performance, achieving a character error rate of 18.54% through full fine-tuning.
Authors:Veronika Spieker, Hannah Eichhorn, Wenqi Huang, Jonathan K. Stelter, Tabita Catalan, Rickmer F. Braren, Daniel Rueckert, Francisco Sahli Costabal, Kerstin Hammernik, Dimitrios C. Karampinos, Claudia Prieto, Julia A. Schnabel
Abstract:
Neural implicit k-space representations (NIK) have shown promising results for dynamic magnetic resonance imaging (MRI) at high temporal resolutions. Yet, reducing acquisition time, and thereby available training data, results in severe performance drops due to overfitting. To address this, we introduce a novel self-supervised k-space loss function $\mathcal{L}_\mathrm{PISCO}$, applicable for regularization of NIK-based reconstructions. The proposed loss function is based on the concept of parallel imaging-inspired self-consistency (PISCO), enforcing a consistent global k-space neighborhood relationship without requiring additional data. Quantitative and qualitative evaluations on static and dynamic MR reconstructions show that integrating PISCO significantly improves NIK representations. Particularly for high acceleration factors (R$\geq$54), NIK with PISCO achieves superior spatio-temporal reconstruction quality compared to state-of-the-art methods. Furthermore, an extensive analysis of the loss assumptions and stability shows PISCO's potential as versatile self-supervised k-space loss function for further applications and architectures. Code is available at: https://github.com/compai-lab/2025-pisco-spieker
中文: 本研究提出了一种自监督k空间损失函数PISCO,通过强制全局邻域一致性且无需额外数据,显著提升了动态MRI中神经隐式k空间表示的性能,在高加速因子下实现了卓越的时空重建质量。
English: The study introduces a self-supervised k-space loss function, PISCO, which enhances neural implicit k-space representations for dynamic MRI by enforcing global neighborhood consistency without extra data, achieving superior reconstruction quality at high accelerations.
Authors:Zichang Ge, Changyu Chen, Arunesh Sinha, Pradeep Varakantham
Abstract:
In real-world sequential decision making tasks like autonomous driving, robotics, and healthcare, learning from observed state-action trajectories is critical for tasks like imitation, classification, and clustering. For example, self-driving cars must replicate human driving behaviors, while robots and healthcare systems benefit from modeling decision sequences, whether or not they come from expert data. Existing trajectory encoding methods often focus on specific tasks or rely on reward signals, limiting their ability to generalize across domains and tasks. Inspired by the success of embedding models like CLIP and BERT in static domains, we propose a novel method for embedding state-action trajectories into a latent space that captures the skills and competencies in the dynamic underlying decision-making processes. This method operates without the need for reward labels, enabling better generalization across diverse domains and tasks. Our contributions are threefold: (1) We introduce a trajectory embedding approach that captures multiple abilities from state-action data. (2) The learned embeddings exhibit strong representational power across downstream tasks, including imitation, classification, clustering, and regression. (3) The embeddings demonstrate unique properties, such as controlling agent behaviors in IQ-Learn and an additive structure in the latent space. Experimental results confirm that our method outperforms traditional approaches, offering more flexible and powerful trajectory representations for various applications. Our code is available at https://github.com/Erasmo1015/vte.
中文: 本文提出了一种将状态-行动轨迹嵌入到潜在空间的新方法,以捕捉底层决策技能,无需奖励标签即可实现跨领域泛化,并在模仿和分类等任务中优于传统方法。
English: This paper introduces a novel method for embedding state-action trajectories into a latent space to capture underlying decision-making skills, enabling reward-free generalization across diverse domains and outperforming traditional approaches in tasks like imitation and classification.
Authors:Kanta Masuki, Yuto Ashida
Abstract:
Diffusion models represent a class of generative models that produce data by denoising a sample corrupted by white noise. Despite the success of diffusion models in computer vision, audio synthesis, and point cloud generation, so far they overlook inherent multiscale structures in data and have a slow generation process due to many iteration steps. In physics, the renormalization group offers a fundamental framework for linking different scales and giving an accurate coarse-grained model. Here we introduce a renormalization group-based diffusion model that leverages multiscale nature of data distributions for realizing a high-quality data generation. In the spirit of renormalization group procedures, we define a flow equation that progressively erases data information from fine-scale details to coarse-grained structures. Through reversing the renormalization group flows, our model is able to generate high-quality samples in a coarse-to-fine manner. We validate the versatility of the model through applications to protein structure prediction and image generation. Our model consistently outperforms conventional diffusion models across standard evaluation metrics, enhancing sample quality and/or accelerating sampling speed by an order of magnitude. The proposed method alleviates the need for data-dependent tuning of hyperparameters in the generative diffusion models, showing promise for systematically increasing sample efficiency based on the concept of the renormalization group.
Chinese: 本文提出了一种基于重整化群的扩散模型,利用数据的多尺度结构提升样本质量并加速生成过程,在蛋白质结构预测和图像生成等应用中优于传统扩散模型。
English: This paper introduces a renormalization group-based diffusion model that leverages multiscale data structures to enhance sample quality and accelerate generation speed, outperforming conventional diffusion models in applications like protein structure prediction and image generation.
Authors:Jianzi Xiang, Cailu Wan, Zhu Cao
Abstract:
Semantic segmentation is essential for comprehending images, but the process necessitates a substantial amount of detailed annotations at the pixel level. Acquiring such annotations can be costly in the real-world. Unsupervised domain adaptation (UDA) for semantic segmentation is a technique that uses virtual data with labels to train a model and adapts it to real data without labels. Some recent works use contrastive learning, which is a powerful method for self-supervised learning, to help with this technique. However, these works do not take into account the diversity of features within each class when using contrastive learning, which leads to errors in class prediction. We analyze the limitations of these works and propose a novel framework called Pseudo-label Guided Pixel Contrast (PGPC), which overcomes the disadvantages of previous methods. We also investigate how to use more information from target images without adding noise from pseudo-labels. We test our method on two standard UDA benchmarks and show that it outperforms existing methods. Specifically, we achieve relative improvements of 5.1% mIoU and 4.6% mIoU on the Grand Theft Auto V (GTA5) to Cityscapes and SYNTHIA to Cityscapes tasks based on DAFormer, respectively. Furthermore, our approach can enhance the performance of other UDA approaches without increasing model complexity. Code is available at https://github.com/embar111/pgpc
中文: 语义分割依赖昂贵的像素级标注,无监督域自适应结合对比学习虽能辅助,但现有方法忽略了类内特征多样性,导致预测错误;我们提出的伪标签引导像素对比(PGPC)框架通过有效利用目标图像信息解决了这一问题,在标准基准测试中显著提升性能且未增加模型复杂度。
English: Semantic segmentation requires costly pixel-level annotations, and while unsupervised domain adaptation with contrastive learning helps, existing methods overlook intra-class feature diversity, leading to prediction errors; our proposed Pseudo-label Guided Pixel Contrast (PGPC) framework addresses this by leveraging target image information effectively, achieving significant improvements on standard benchmarks without added model complexity.
Authors:Saman Motamed, Laura Culp, Kevin Swersky, Priyank Jaini, Robert Geirhos
Abstract:
AI video generation is undergoing a revolution, with quality and realism advancing rapidly. These advances have led to a passionate scientific debate: Do video models learn "world models" that discover laws of physics -- or, alternatively, are they merely sophisticated pixel predictors that achieve visual realism without understanding the physical principles of reality? We address this question by developing Physics-IQ, a comprehensive benchmark dataset that can only be solved by acquiring a deep understanding of various physical principles, like fluid dynamics, optics, solid mechanics, magnetism and thermodynamics. We find that across a range of current models (Sora, Runway, Pika, Lumiere, Stable Video Diffusion, and VideoPoet), physical understanding is severely limited, and unrelated to visual realism. At the same time, some test cases can already be successfully solved. This indicates that acquiring certain physical principles from observation alone may be possible, but significant challenges remain. While we expect rapid advances ahead, our work demonstrates that visual realism does not imply physical understanding. Our project page is at https://physics-iq.github.io; code at https://github.com/google-deepmind/physics-IQ-benchmark.
中文: AI视频生成在视觉真实感上取得显著进展,但通过Physics-IQ基准测试发现,现有模型对物理原理的理解仍非常有限,表明视觉逼真度并不等同于物理认知能力。
English: AI video generation achieves high visual realism but lacks deep physical understanding, as demonstrated by the Physics-IQ benchmark, which shows current models struggle with fundamental physical principles despite some progress.
Authors:Ishan Amin, Sanjeev Raja, Aditi Krishnapriyan
Abstract:
The foundation model (FM) paradigm is transforming Machine Learning Force Fields (MLFFs), leveraging general-purpose representations and scalable training to perform a variety of computational chemistry tasks. Although MLFF FMs have begun to close the accuracy gap relative to first-principles methods, there is still a strong need for faster inference speed. Additionally, while research is increasingly focused on general-purpose models which transfer across chemical space, practitioners typically only study a small subset of systems at a given time. This underscores the need for fast, specialized MLFFs relevant to specific downstream applications, which preserve test-time physical soundness while maintaining train-time scalability. In this work, we introduce a method for transferring general-purpose representations from MLFF foundation models to smaller, faster MLFFs specialized to specific regions of chemical space. We formulate our approach as a knowledge distillation procedure, where the smaller "student" MLFF is trained to match the Hessians of the energy predictions of the "teacher" foundation model. Our specialized MLFFs can be up to 20 $\times$ faster than the original foundation model, while retaining, and in some cases exceeding, its performance and that of undistilled models. We also show that distilling from a teacher model with a direct force parameterization into a student model trained with conservative forces (i.e., computed as derivatives of the potential energy) successfully leverages the representations from the large-scale teacher for improved accuracy, while maintaining energy conservation during test-time molecular dynamics simulations. More broadly, our work suggests a new paradigm for MLFF development, in which foundation models are released along with smaller, specialized simulation "engines" for common chemical subsets.
中文: 基础模型正通过通用表示和可扩展训练改变机器学习力场,但需借助知识蒸馏方法开发更快速、针对特定化学空间的专用模型,以提升推理速度并保持物理准确性。
English: Foundation models are revolutionizing machine learning force fields by enabling scalable training and versatile applications, yet there is a need for faster, specialized models tailored to specific chemical systems through knowledge distillation techniques.
Authors:Keisuke Kamo, Hideaki Iiduka
Abstract:
Stochastic gradient descent with momentum (SGDM), in which a momentum term is added to SGD, has been well studied in both theory and practice. The theoretical studies show that the settings of the learning rate and momentum weight affect the convergence of SGDM. Meanwhile, the practical studies have shown that the batch-size setting strongly affects the performance of SGDM. In this paper, we focus on mini-batch SGDM with a constant learning rate and constant momentum weight, which is frequently used to train deep neural networks. We show theoretically that using a constant batch size does not always minimize the expectation of the full gradient norm of the empirical loss in training a deep neural network, whereas using an increasing batch size definitely minimizes it; that is, an increasing batch size improves the convergence of mini-batch SGDM. We also provide numerical results supporting our analyses, indicating specifically that mini-batch SGDM with an increasing batch size converges to stationary points faster than with a constant batch size, while also reducing computational cost. Python implementations of the optimizers used in the numerical experiments are available at https://github.com/iiduka-researches/NSHB_increasing_batchsize_acml25/.
中文: 本研究证明,在带动量的随机梯度下降法中,采用递增批次大小比固定批次大小能更快收敛到驻点并降低计算成本。
English: This study demonstrates that using an increasing batch size in mini-batch stochastic gradient descent with momentum (SGDM) improves convergence to stationary points and reduces computational cost compared to a constant batch size.
Authors:Zhipeng Ye, Feng Jiang, Qiufeng Wang, Kaizhu Huang, Jiaqi Huang
Abstract:
CLIP (Contrastive Language-Image Pre-training) has attained great success in pattern recognition and computer vision. Transferring CLIP to downstream tasks (e.g. zero- or few-shot classification) is a hot topic in multimodal learning. However, current studies primarily focus on either prompt learning for text or adapter tuning for vision, without fully exploiting the complementary information and correlations among image-text pairs. In this paper, we propose an Image Description Enhanced CLIP-Adapter (IDEA) method to adapt CLIP to few-shot image classification tasks. This method captures fine-grained features by leveraging both visual features and textual descriptions of images. IDEA is a training-free method for CLIP, and it can be comparable to or even exceeds state-of-the-art models on multiple tasks. Furthermore, we introduce Trainable-IDEA (T-IDEA), which extends IDEA by adding two lightweight learnable components (i.e., a projector and a learnable latent space), further enhancing the model's performance and achieving SOTA results on 11 datasets. As one important contribution, we employ the Llama model and design a comprehensive pipeline to generate textual descriptions for images of 11 datasets, resulting in a total of 1,637,795 image-text pairs, named "IMD-11". Our code and data are released at https://github.com/FourierAI/IDEA.
中文: IDEA方法通过结合视觉特征和文本描述,无需训练即可提升CLIP在小样本图像分类中的性能,其可训练扩展T-IDEA利用生成的图像-文本对在11个数据集上取得了领先成果。
English: The IDEA method enhances CLIP for few-shot image classification by integrating visual features and textual descriptions, achieving competitive or superior performance without training, while its trainable extension T-IDEA sets new benchmarks on 11 datasets using generated image-text pairs.
Authors:Jaemyung Yu, Jaehyun Choi, Dong-Jae Lee, HyeongGwon Hong, Junmo Kim
Abstract:
Unsupervised representation learning has significantly advanced various machine learning tasks. In the computer vision domain, state-of-the-art approaches utilize transformations like random crop and color jitter to achieve invariant representations, embedding semantically the same inputs despite transformations. However, this can degrade performance in tasks requiring precise features, such as localization or flower classification. To address this, recent research incorporates equivariant representation learning, which captures transformation-sensitive information. However, current methods depend on transformation labels and thus struggle with interdependency and complex transformations. We propose Self-supervised Transformation Learning (STL), replacing transformation labels with transformation representations derived from image pairs. The proposed method ensures transformation representation is image-invariant and learns corresponding equivariant transformations, enhancing performance without increased batch complexity. We demonstrate the approach's effectiveness across diverse classification and detection tasks, outperforming existing methods in 7 out of 11 benchmarks and excelling in detection. By integrating complex transformations like AugMix, unusable by prior equivariant methods, this approach enhances performance across tasks, underscoring its adaptability and resilience. Additionally, its compatibility with various base models highlights its flexibility and broad applicability. The code is available at https://github.com/jaemyung-u/stl.
中文: 提出的自监督变换学习(STL)方法用图像不变的变换表示替代变换标签,在不增加批次复杂度的前提下实现有效的等变学习,并在多个基准测试中超越现有方法。
English: The proposed Self-supervised Transformation Learning (STL) method replaces transformation labels with image-invariant transformation representations, enabling effective equivariant learning without increasing batch complexity and outperforming existing methods across multiple benchmarks.
Authors:Oscar Ramos-Soto, Jorge Ramos-Frutos, Ezequiel Perez-Zarate, Diego Oliva, Sandra E. Balderas-Mata
Abstract:
Feature extraction techniques are crucial in medical image classification; however, classical feature extractors, in addition to traditional machine learning classifiers, often exhibit significant limitations in providing sufficient discriminative information for complex image sets. While Convolutional Neural Networks (CNNs) and Vision Transformer (ViT) have shown promise in feature extraction, they are prone to overfitting due to the inherent characteristics of medical imaging data, including small sample sizes or high intra-class variance. In this work, the Medical Image Attention-based Feature Extractor (MIAFEx) is proposed, a novel method that employs a learnable refinement mechanism to enhance the classification token within the Transformer encoder architecture. This mechanism adjusts the token based on learned weights, improving the extraction of salient features and enhancing the model's adaptability to the challenges presented by medical imaging data. The MIAFEx output feature quality is compared against classical feature extractors using traditional and hybrid classifiers. Also, the performance of these features is compared against modern CNN and ViT models in classification tasks, demonstrating their superiority in accuracy and robustness across multiple complex medical imaging datasets. This advantage is particularly pronounced in scenarios with limited training data, where traditional and modern models often struggle to generalize effectively. The source code of this proposal can be found at https://github.com/Oscar-RamosS/Medical-Image-Attention-based-Feature-Extractor-MIAFEx
Chinese: 提出的MIAFEx方法通过可学习的优化机制改进Transformer架构中的分类标记,在医学图像特征提取中显著优于传统和现代模型,尤其在数据有限时展现出更优的准确性与鲁棒性。
English: The proposed MIAFEx method enhances Transformer-based feature extraction for medical images by refining classification tokens with a learnable mechanism, outperforming both classical and modern models in accuracy and robustness, especially with limited data.
Authors:Yiran Tao, Jehan Yang, Dan Ding, Zackory Erickson
Abstract:
Teleoperating high degrees-of-freedom (DoF) robotic manipulators via low-DoF controllers like joysticks often requires frequent switching between control modes, where each mode maps controller movements to specific robot actions. Manually performing this frequent switching can make teleoperation cumbersome and inefficient. On the other hand, existing automatic mode-switching solutions, such as heuristic-based or learning-based methods, are often task-specific and lack generalizability. In this paper, we introduce LLM-Driven Automatic Mode Switching (LAMS), a novel approach that leverages Large Language Models (LLMs) to automatically switch control modes based on task context. Unlike existing methods, LAMS requires no prior task demonstrations and incrementally improves by integrating user-generated mode-switching examples. We validate LAMS through an ablation study and a user study with 10 participants on complex, long-horizon tasks, demonstrating that LAMS effectively reduces manual mode switches, is preferred over alternative methods, and improves performance over time. The project website with supplementary materials is at https://lams-assistance.github.io/.
Authors:Matthieu Kirchmeyer, Pedro O. Pinheiro, Saeed Saremi
Abstract:
We introduce a new representation for 3D molecules based on their continuous atomic density fields. Using this representation, we propose a new model based on walk-jump sampling for unconditional 3D molecule generation in the continuous space using neural fields. Our model, FuncMol, encodes molecular fields into latent codes using a conditional neural field, samples noisy codes from a Gaussian-smoothed distribution with Langevin MCMC (walk), denoises these samples in a single step (jump), and finally decodes them into molecular fields. FuncMol performs all-atom generation of 3D molecules without assumptions on the molecular structure and scales well with the size of molecules, unlike most approaches. Our method achieves competitive results on drug-like molecules and easily scales to macro-cyclic peptides, with at least one order of magnitude faster sampling. The code is available at https://github.com/prescient-design/funcmol.
中文: 我们提出FuncMol模型,利用连续原子密度场和行走-跳跃采样方法,实现高效的无条件三维分子生成,在保持竞争力的同时显著提升了采样速度。
English: We propose FuncMol, a novel model that uses continuous atomic density fields and walk-jump sampling for efficient unconditional 3D molecule generation, achieving competitive results and faster sampling speeds.
Authors:Anastasios N. Angelopoulos, Michael I. Jordan, Ryan J. Tibshirani
Abstract:
We present a new perspective on online learning that we refer to as gradient equilibrium: a sequence of iterates achieves gradient equilibrium if the average of gradients of losses along the sequence converges to zero. In general, this condition is not implied by, nor implies, sublinear regret. It turns out that gradient equilibrium is achievable by standard online learning methods such as gradient descent and mirror descent with constant step sizes (rather than decaying step sizes, as is usually required for no regret). Further, as we show through examples, gradient equilibrium translates into an interpretable and meaningful property in online prediction problems spanning regression, classification, quantile estimation, and others. Notably, we show that the gradient equilibrium framework can be used to develop a debiasing scheme for black-box predictions under arbitrary distribution shift, based on simple post hoc online descent updates. We also show that post hoc gradient updates can be used to calibrate predicted quantiles under distribution shift, and that the framework leads to unbiased Elo scores for pairwise preference prediction.
中文: 梯度均衡这一在线学习新视角指损失函数梯度序列的平均值收敛于零,可通过恒定步长方法实现,并在回归、分类等预测问题中提供可解释的优势,如应对分布偏移的去偏和校准。
English: The concept of gradient equilibrium in online learning is achieved when the average of gradients converges to zero, which can be attained through constant step-size methods like gradient descent and offers interpretable benefits across various prediction tasks, including debiasing under distribution shifts.
Authors:Wennuo Yang, Shiling Wu, Yuzhi Zhou, Cheng Luo, Xilin He, Weicheng Xie, Linlin Shen, Siyang Song
Abstract:
Multivariate Time Series Classification (MTSC) enables the analysis if complex temporal data, and thus serves as a cornerstone in various real-world applications, ranging from healthcare to finance. Since the relationship among variables in MTS usually contain crucial cues, a large number of graph-based MTSC approaches have been proposed, as the graph topology and edges can explicitly represent relationships among variables (channels), where not only various MTS graph representation learning strategies but also different Graph Neural Networks (GNNs) have been explored. Despite such progresses, there is no comprehensive study that fairly benchmarks and investigates the performances of existing widely-used graph representation learning strategies/GNN classifiers in the application of different MTSC tasks. In this paper, we present the first benchmark which systematically investigates the effectiveness of the widely-used three node feature definition strategies, four edge feature learning strategies and five GNN architecture, resulting in 60 different variants for graph-based MTSC. These variants are developed and evaluated with a standardized data pipeline and training/validation/testing strategy on 26 widely-used suspensor MTSC datasets. Our experiments highlight that node features significantly influence MTSC performance, while the visualization of edge features illustrates why adaptive edge learning outperforms other edge feature learning methods. The code of the proposed benchmark is publicly available at \url{https://github.com/CVI-yangwn/Benchmark-GNN-for-Multivariate-Time-Series-Classification}.
中文: 本文首次建立了基于图的多变量时间序列分类基准,通过系统评估26个数据集上的60种节点/边特征策略与图神经网络架构组合,发现节点特征对分类性能影响显著,且自适应边学习方法最具优势。
English: This paper introduces the first comprehensive benchmark for graph-based Multivariate Time Series Classification, systematically evaluating 60 combinations of node/edge feature strategies and GNN architectures across 26 datasets, revealing that node features critically impact performance while adaptive edge learning proves most effective.
Authors:Rui Daniel, M. Rita Verdelho, Catarina Barata, Carlos Santiago
Abstract:
Deep Learning for medical imaging faces challenges in adapting and generalizing to new contexts. Additionally, it often lacks sufficient labeled data for specific tasks requiring significant annotation effort. Continual Learning (CL) tackles adaptability and generalizability by enabling lifelong learning from a data stream while mitigating forgetting of previously learned knowledge. Active Learning (AL) reduces the number of required annotations for effective training. This work explores both approaches (CAL) to develop a novel framework for robust medical image analysis. Based on the automatic recognition of shifts in image characteristics, Replay-Base Architecture for Context Adaptation (RBACA) employs a CL rehearsal method to continually learn from diverse contexts, and an AL component to select the most informative instances for annotation. A novel approach to evaluate CAL methods is established using a defined metric denominated IL-Score, which allows for the simultaneous assessment of transfer learning, forgetting, and final model performance. We show that RBACA works in domain and class-incremental learning scenarios, by assessing its IL-Score on the segmentation and diagnosis of cardiac images. The results show that RBACA outperforms a baseline framework without CAL, and a state-of-the-art CAL method across various memory sizes and annotation budgets. Our code is available in https://github.com/RuiDaniel/RBACA .
中文摘要:本研究提出RBACA框架,融合持续学习与主动学习,通过最小化标注需求适应新场景,在心脏图像分割与诊断中展现出卓越性能。
English Summary: This study introduces RBACA, a novel framework combining continual and active learning to enhance medical image analysis by adapting to new contexts with minimal annotation, achieving superior performance in cardiac image segmentation and diagnosis.
Authors:Jinjun Peng, Leyi Cui, Kele Huang, Junfeng Yang, Baishakhi Ray
Abstract:
Large Language Models (LLMs) have significantly aided developers by generating or assisting in code writing, enhancing productivity across various tasks. While identifying incorrect code is often straightforward, detecting vulnerabilities in functionally correct code is more challenging, especially for developers with limited security knowledge, which poses considerable security risks of using LLM-generated code and underscores the need for robust evaluation benchmarks that assess both functional correctness and security. Current benchmarks like CyberSecEval and SecurityEval attempt to solve it but are hindered by unclear and impractical specifications, failing to assess both functionality and security accurately. To tackle these deficiencies, we introduce CWEval, a novel outcome-driven evaluation framework designed to enhance the evaluation of secure code generation by LLMs. This framework not only assesses code functionality but also its security simultaneously with high-quality task specifications and outcome-driven test oracles which provides high accuracy. Coupled with CWEval-bench, a multilingual, security-critical coding benchmark, CWEval provides a rigorous empirical security evaluation on LLM-generated code, overcoming previous benchmarks' shortcomings. Through our evaluations, CWEval reveals a notable portion of functional but insecure code produced by LLMs, and shows a serious inaccuracy of previous evaluations, ultimately contributing significantly to the field of secure code generation. We open-source our artifact at: https://github.com/Co1lin/CWEval .
Chinese: CWEval 是一个结果驱动的评估框架,通过高质量任务规范和结果驱动的测试机制,同时评估代码功能性和安全性,有效克服了以往基准测试的不足,显著提升了LLM生成代码的安全评估水平。
English: CWEval is an outcome-driven framework that enhances secure code generation evaluation by simultaneously assessing both functionality and security, overcoming limitations of previous benchmarks through high-quality specifications and multilingual testing.
Authors:Yin Fang, Xinle Deng, Kangwei Liu, Ningyu Zhang, Jingyang Qian, Penghui Yang, Xiaohui Fan, Huajun Chen
Abstract:
Large language models excel at interpreting complex natural language instructions, enabling them to perform a wide range of tasks. In the life sciences, single-cell RNA sequencing (scRNA-seq) data serves as the "language of cellular biology", capturing intricate gene expression patterns at the single-cell level. However, interacting with this "language" through conventional tools is often inefficient and unintuitive, posing challenges for researchers. To address these limitations, we present InstructCell, a multi-modal AI copilot that leverages natural language as a medium for more direct and flexible single-cell analysis. We construct a comprehensive multi-modal instruction dataset that pairs text-based instructions with scRNA-seq profiles from diverse tissues and species. Building on this, we develop a multi-modal cell language architecture capable of simultaneously interpreting and processing both modalities. InstructCell empowers researchers to accomplish critical tasks-such as cell type annotation, conditional pseudo-cell generation, and drug sensitivity prediction-using straightforward natural language commands. Extensive evaluations demonstrate that InstructCell consistently meets or exceeds the performance of existing single-cell foundation models, while adapting to diverse experimental conditions. More importantly, InstructCell provides an accessible and intuitive tool for exploring complex single-cell data, lowering technical barriers and enabling deeper biological insights.
中文: InstructCell作为多模态AI助手,通过自然语言实现对单细胞RNA测序数据的直观灵活分析,在细胞注释和药物预测等任务中优于现有模型,同时显著降低了复杂生物数据的技术门槛。
English: InstructCell is a multimodal AI copilot that uses natural language to enable intuitive and flexible analysis of single-cell RNA sequencing data, outperforming existing models in tasks like cell annotation and drug prediction while making complex biological data more accessible.
Authors:Qian Zeng, Jie Song, Han Zheng, Hao Jiang, Mingli Song
Abstract:
Diffusion models have achieved cutting-edge performance in image generation. However, their lengthy denoising process and computationally intensive score estimation network impede their scalability in low-latency and resource-constrained scenarios. Post-training quantization (PTQ) compresses and accelerates diffusion models without retraining, but it inevitably introduces additional quantization noise, resulting in mean and variance deviations. In this work, we propose D2-DPM, a dual denoising mechanism aimed at precisely mitigating the adverse effects of quantization noise on the noise estimation network. Specifically, we first unravel the impact of quantization noise on the sampling equation into two components: the mean deviation and the variance deviation. The mean deviation alters the drift coefficient of the sampling equation, influencing the trajectory trend, while the variance deviation magnifies the diffusion coefficient, impacting the convergence of the sampling trajectory. The proposed D2-DPM is thus devised to denoise the quantization noise at each time step, and then denoise the noisy sample through the inverse diffusion iterations. Experimental results demonstrate that D2-DPM achieves superior generation quality, yielding a 1.42 lower FID than the full-precision model while achieving 3.99x compression and 11.67x bit-operation acceleration.
中文: D2-DPM提出双重去噪机制,有效抵消扩散模型量化噪声影响,在实现显著压缩和加速的同时提升了生成质量。
English: D2-DPM introduces a dual denoising mechanism to counteract quantization noise effects in diffusion models, achieving enhanced generation quality with significant compression and acceleration.
Authors:Xudong Wang, Qingbo Hao, Xu Cheng, Yingyuan Xiao
Abstract:
Federated learning has emerged as a key paradigm in privacy-preserving computing due to its "data usable but not visible" property, enabling users to collaboratively train models without sharing raw data. Motivated by this, federated recommendation systems offer a promising architecture that balances user privacy with recommendation accuracy through distributed collaborative learning. However, existing federated recommendation methods often neglect the underlying semantic or behavioral relationships between users during parameter aggregation, which limits their recommendation effectiveness. To overcome this limitation, graph-based federated recommendation systems have been proposed to leverage neighborhood information. Yet, conventional graph construction methods usually require access to raw user data or explicit social links, which contradicts the strict privacy requirements of federated learning. In this work, we propose UFGraphFR (User Text-feature-based Graph Federated Recommendation), a novel personalized federated recommendation framework that constructs a user graph based on clients' locally embedded text features. Our core assumption is that users with similar textual feature descriptions exhibit similar preferences. Accordingly, UFGraphFR introduces two key components: (1) a privacy-preserving user relationship graph constructed from the joint embedding layer's weight matrix without leaking raw user attributes; (2) a Transformer-based architecture to model temporal dependencies in user-item interaction sequences. Experimental results on benchmark datasets such as MovieLens and HetRec2011 demonstrate that UFGraphFR achieves recommendation accuracy comparable to both centralized and state-of-the-art federated baselines while preserving user privacy. The code is available at: https://github.com/trueWangSyutung/UFGraphFR.
中文:UFGraphFR提出了一种基于嵌入文本特征构建用户图谱的隐私保护联邦推荐框架,通过Transformer架构建模时序依赖,在保护用户隐私的同时实现了与先进方法相当的推荐精度。
English: UFGraphFR introduces a privacy-preserving federated recommendation framework that constructs user graphs from embedded text features and employs Transformer architecture to enhance recommendation accuracy without compromising user data security.
Authors:Xiao Xu, Qiong Wu, Pingyi Fan, Kezhi Wang
Abstract:
Vehicle-to-Infrastructure (V2I) technology enables information exchange between vehicles and road infrastructure. Specifically, when a vehicle approaches a roadside unit (RSU), it can exchange information with the RSU to obtain accurate data that assists in driving. With the release of the 3rd Generation Partnership Project (3GPP) Release 16, which includes the 5G New Radio (NR) Vehicle-to-Everything (V2X) standards, vehicles typically adopt mode-2 communication using sensing-based semi-persistent scheduling (SPS) for resource allocation. In this approach, vehicles identify candidate resources within a selection window and exclude ineligible resources based on information from a sensing window. However, vehicles often drive at different speeds, resulting in varying amounts of data transmission with RSUs as they pass by, which leads to unfair access. Therefore, it is essential to design an access scheme that accounts for different vehicle speeds to achieve fair access across the network. This paper formulates an optimization problem for vehicular networks and proposes a multi-objective optimization scheme to address it by adjusting the selection window in the SPS mechanism of 5G NR V2I mode-2. Simulation results demonstrate the effectiveness of the proposed scheme
中文: 车对基础设施通信使车辆与路边单元能够交换数据,但不同车速导致接入不公,因此提出一种多目标优化方案,通过调整5G NR V2I模式2中的选择窗口来确保公平性。
English: Vehicle-to-Infrastructure communication enables data exchange between vehicles and roadside units, but varying speeds cause unfair access, prompting a proposed multi-objective optimization scheme that adjusts the selection window in 5G NR V2I mode-2 to ensure fairness.
Authors:Mohamed A. Taha
Abstract:
Long-range sequence modeling is a crucial aspect of natural language processing and time series analysis. However, traditional models like Recurrent Neural Networks (RNNs) and Transformers suffer from computational and memory inefficiencies, especially when dealing with long sequences. This paper introduces Logarithmic Memory Networks (LMNs), a novel architecture that leverages a hierarchical logarithmic tree structure to efficiently store and retrieve past information. LMNs dynamically summarize historical context, significantly reducing the memory footprint and computational complexity of attention mechanisms from O(n2) to O(log(n)). The model employs a single-vector, targeted attention mechanism to access stored information, and the memory block construction worker (summarizer) layer operates in two modes: a parallel execution mode during training for efficient processing of hierarchical tree structures and a sequential execution mode during inference, which acts as a memory management system. It also implicitly encodes positional information, eliminating the need for explicit positional encodings. These features make LMNs a robust and scalable solution for processing long-range sequences in resource-constrained environments, offering practical improvements in efficiency and scalability. The code is publicly available under the MIT License on GitHub: https://github.com/AhmedBoin/LogarithmicMemory.
中文摘要:本文提出对数记忆网络(LMN),通过分层树结构和目标注意力机制将计算复杂度从O(n²)降至O(log(n)),无需显式位置编码即可实现长序列的高效处理。
English Summary: This paper introduces Logarithmic Memory Networks (LMNs), a novel architecture that reduces computational complexity from O(n²) to O(log(n)) through hierarchical tree structures and targeted attention, enabling efficient long-sequence processing without explicit positional encodings.
Authors:Yaowen Ye, Cassidy Laidlaw, Jacob Steinhardt
Abstract:
Language model (LM) post-training relies on two stages of human supervision: task demonstrations for supervised finetuning (SFT), followed by preference comparisons for reinforcement learning from human feedback (RLHF). As LMs become more capable, the tasks they are given become harder to supervise. Will post-training remain effective under unreliable supervision? To test this, we simulate unreliable demonstrations and comparison feedback using small LMs and time-constrained humans. We find that in the presence of unreliable supervision, SFT still retains some effectiveness, but DPO (a common RLHF algorithm) fails to improve the model beyond SFT. To address this, we propose iterative label refinement (ILR) as an alternative to RLHF. ILR improves the SFT data by using comparison feedback to decide whether human demonstrations should be replaced by model-generated alternatives, then retrains the model via SFT on the updated data. SFT+ILR outperforms SFT+DPO on several tasks with unreliable supervision (math, coding, and safe instruction-following). Our findings suggest that as LMs are used for complex tasks where human supervision is unreliable, RLHF may no longer be the best use of human comparison feedback; instead, it is better to direct feedback towards improving the training data rather than continually training the model. Our code and data are available at https://github.com/helloelwin/iterative-label-refinement.
中文: 当人类监督不可靠时,监督微调(SFT)仍保持部分有效性,但直接偏好优化(DPO)无法进一步优化模型,因此提出迭代标签优化(ILR)作为更优替代方案,通过提升训练数据质量而非持续训练模型来改善性能。
English: When human supervision becomes unreliable, supervised fine-tuning (SFT) remains partially effective, but direct preference optimization (DPO) fails to enhance the model further, prompting the introduction of iterative label refinement (ILR) as a superior alternative that improves training data quality instead of continuous model training.
Authors:Boye Niu, Yiliao Song, Kai Lian, Yifan Shen, Yu Yao, Kun Zhang, Tongliang Liu
Abstract:
Multi-agent frameworks powered by large language models (LLMs) have demonstrated great success in automated planning and task execution. However, the effective adjustment of agentic workflows during execution has not been well studied. An effective workflow adjustment is crucial in real-world scenarios, as the initial plan must adjust to unforeseen challenges and changing conditions in real time to ensure the efficient execution of complex tasks. In this paper, we define workflows as an activity-on-vertex (AOV) graph, which allows continuous workflow refinement by LLM agents through dynamic subtask allocation adjustment based on historical performance and previous AOVs. To further enhance framework performance, we emphasize modularity in workflow design based on evaluating parallelism and dependency complexity. With this design, our proposed multi-agent framework achieves efficient concurrent execution of subtasks, effective goal achievement, and enhanced error tolerance. Empirical results across various practical tasks demonstrate significant improvements in the efficiency of multi-agent frameworks through dynamic workflow refinement and modularization. The code is available at: https://github.com/tmllab/2025_ICLR_FLOW.
中文: 本文提出一种多智能体框架,通过LLM智能体动态优化工作流和模块化设计,显著提升了现实场景中任务执行的效率与容错能力。
English: This paper introduces a multi-agent framework that enhances automated task execution by dynamically refining workflows using LLM agents and modular design, significantly improving efficiency and error tolerance in real-world scenarios.
Authors:Farnoosh Koleini, Muhammad Usama Saleem, Pu Wang, Hongfei Xue, Ahmed Helmy, Abbey Fenwick
Abstract:
Recent advancements in 3D human pose estimation from single-camera images and videos have relied on parametric models, like SMPL. However, these models oversimplify anatomical structures, limiting their accuracy in capturing true joint locations and movements, which reduces their applicability in biomechanics, healthcare, and robotics. Biomechanically accurate pose estimation, on the other hand, typically requires costly marker-based motion capture systems and optimization techniques in specialized labs. To bridge this gap, we propose BioPose, a novel learning-based framework for predicting biomechanically accurate 3D human pose directly from monocular videos. BioPose includes three key components: a Multi-Query Human Mesh Recovery model (MQ-HMR), a Neural Inverse Kinematics (NeurIK) model, and a 2D-informed pose refinement technique. MQ-HMR leverages a multi-query deformable transformer to extract multi-scale fine-grained image features, enabling precise human mesh recovery. NeurIK treats the mesh vertices as virtual markers, applying a spatial-temporal network to regress biomechanically accurate 3D poses under anatomical constraints. To further improve 3D pose estimations, a 2D-informed refinement step optimizes the query tokens during inference by aligning the 3D structure with 2D pose observations. Experiments on benchmark datasets demonstrate that BioPose significantly outperforms state-of-the-art methods. Project website: \url{https://m-usamasaleem.github.io/publication/BioPose/BioPose.html}.
Authors:Difei Gu, Yunhe Gao, Yang Zhou, Mu Zhou, Dimitris Metaxas
Abstract:
Automated chest radiographs interpretation requires both accurate disease classification and detailed radiology report generation, presenting a significant challenge in the clinical workflow. Current approaches either focus on classification accuracy at the expense of interpretability or generate detailed but potentially unreliable reports through image captioning techniques. In this study, we present RadAlign, a novel framework that combines the predictive accuracy of vision-language models (VLMs) with the reasoning capabilities of large language models (LLMs). Inspired by the radiologist's workflow, RadAlign first employs a specialized VLM to align visual features with key medical concepts, achieving superior disease classification with an average AUC of 0.885 across multiple diseases. These recognized medical conditions, represented as text-based concepts in the aligned visual-language space, are then used to prompt LLM-based report generation. Enhanced by a retrieval-augmented generation mechanism that grounds outputs in similar historical cases, RadAlign delivers superior report quality with a GREEN score of 0.678, outperforming state-of-the-art methods' 0.634. Our framework maintains strong clinical interpretability while reducing hallucinations, advancing automated medical imaging and report analysis through integrated predictive and generative AI. Code is available at https://github.com/difeigu/RadAlign.
Chinese: RadAlign是一种创新框架,通过结合视觉语言模型进行精确疾病分类和大型语言模型生成详细放射学报告,以0.885的平均AUC和0.678的GREEN评分实现卓越性能,同时提升临床可解释性。
English: RadAlign is a novel framework that integrates vision-language models for accurate disease classification and large language models for generating detailed radiology reports, achieving superior performance with an AUC of 0.885 and a GREEN score of 0.678 while enhancing clinical interpretability.
Authors:Brendan Mallery, James M. Murphy, Shuchin Aeron
Abstract:
We consider synthesis and analysis of probability measures using the entropy-regularized Wasserstein-2 cost and its unbiased version, the Sinkhorn divergence. The synthesis problem consists of computing the barycenter, with respect to these costs, of reference measures given a set of coefficients belonging to the simplex. The analysis problem consists of finding the coefficients for the closest barycenter in the Wasserstein-2 distance to a given measure. Under the weakest assumptions on the measures thus far in the literature, we compute the derivative of the entropy-regularized Wasserstein-2 cost. We leverage this to establish a characterization of barycenters with respect to the entropy-regularized Wasserstein-2 cost as solutions that correspond to a fixed point of an average of the entropy-regularized displacement maps. This characterization yields a finite-dimensional, convex, quadratic program for solving the analysis problem when the measure being analyzed is a barycenter with respect to the entropy-regularized Wasserstein-2 cost. We show that these coefficients, as well as the value of the barycenter functional, can be estimated from samples with dimension-independent rates of convergence, and that barycentric coefficients are stable with respect to perturbations in the Wasserstein-2 metric. We employ the barycentric coefficients as features for classification of corrupted point cloud data, and show that compared to neural network baselines, our approach is more efficient in small training data regimes.
中文: 本研究利用熵正则化Wasserstein-2成本与Sinkhorn散度进行概率测度的综合与分析,建立了重心特征描述,并在小样本场景下展示了优于神经网络的高效分类应用。
English: This study explores the synthesis and analysis of probability measures using entropy-regularized Wasserstein-2 cost and Sinkhorn divergence, establishing barycenter characterizations and demonstrating efficient classification applications with superior performance in data-scarce scenarios.
Authors:Daniel Steininger, Julia Simon, Andreas Trondl, Markus Murschitz
Abstract:
Timber represents an increasingly valuable and versatile resource. However, forestry operations such as harvesting, handling and measuring logs still require substantial human labor in remote environments posing significant safety risks. Progressively automating these tasks has the potential of increasing their efficiency as well as safety, but requires an accurate detection of individual logs as well as live trees and their context. Although initial approaches have been proposed for this challenging application domain, specialized data and algorithms are still too scarce to develop robust solutions. To mitigate this gap, we introduce the TimberVision dataset, consisting of more than 2k annotated RGB images containing a total of 51k trunk components including cut and lateral surfaces, thereby surpassing any existing dataset in this domain in terms of both quantity and detail by a large margin. Based on this data, we conduct a series of ablation experiments for oriented object detection and instance segmentation and evaluate the influence of multiple scene parameters on model performance. We introduce a generic framework to fuse the components detected by our models for both tasks into unified trunk representations. Furthermore, we automatically derive geometric properties and apply multi-object tracking to further enhance robustness. Our detection and tracking approach provides highly descriptive and accurate trunk representations solely from RGB image data, even under challenging environmental conditions. Our solution is suitable for a wide range of application scenarios and can be readily combined with other sensor modalities.
中文: TimberVision数据集通过提供包含5.1万个树干部件的2000多张标注图像,解决了林业自动化中专业数据匮乏的问题,利用先进的计算机视觉技术实现了对原木和树木的精准检测与追踪。
English: The TimberVision dataset addresses the scarcity of specialized data for automating forestry tasks by providing over 2,000 annotated images with 51,000 trunk components, enabling robust detection and tracking of logs and trees through advanced computer vision techniques.
Authors:Haochuan Zhang, Chunhua Yang, Jie Han, Liyang Qin, Xiaoli Wang
Abstract:
Multi-modal language model has made advanced progress in vision and audio, but still faces significant challenges in dealing with complex reasoning tasks in the time series domain. The reasons are twofold. First, labels for multi-modal time series data are coarse and devoid of analysis or reasoning processes. Training with these data cannot improve the model's reasoning capabilities. Second, due to the lack of precise tokenization in processing time series, the representation patterns for temporal and textual information are inconsistent, which hampers the effectiveness of multi-modal alignment. To address these challenges, we propose a multi-modal time series data construction approach and a multi-modal time series language model (TLM), TempoGPT. Specially, we construct multi-modal data for complex reasoning tasks by analyzing the variable-system relationships within a white-box system. Additionally, proposed TempoGPT achieves consistent representation between temporal and textual information by quantizing temporal embeddings, where temporal embeddings are quantized into a series of discrete tokens using a predefined codebook; subsequently, a shared embedding layer processes both temporal and textual tokens. Extensive experiments demonstrate that TempoGPT accurately perceives temporal information, logically infers conclusions, and achieves state-of-the-art in the constructed complex time series reasoning tasks. Moreover, we quantitatively demonstrate the effectiveness of quantizing temporal embeddings in enhancing multi-modal alignment and the reasoning capabilities of TLMs. Code and data are available at https://github.com/zhanghaochuan20/TempoGPT.
中文: 多模态语言模型在时间序列复杂推理中面临标注粗糙和表征不一致的挑战,而提出的TempoGPT通过构建白盒系统数据和时间嵌入量化,实现了时序与文本信息对齐,显著提升了推理性能。
English: Multi-modal language models struggle with complex time series reasoning due to coarse labels and inconsistent temporal-textual representations, but the proposed TempoGPT addresses these by constructing specialized data and quantizing temporal embeddings for improved alignment and performance.
Authors:Csaba Tóth, Danilo Jr Dela Cruz, Harald Oberhauser
Abstract:
The signature kernel is a positive definite kernel for sequential and temporal data that has become increasingly popular in machine learning applications due to powerful theoretical guarantees, strong empirical performance, and recently introduced various scalable variations. In this chapter, we give a short introduction to $\texttt{KSig}$, a $\texttt{Scikit-Learn}$ compatible Python package that implements various GPU-accelerated algorithms for computing signature kernels, and performing downstream learning tasks. We also introduce a new algorithm based on tensor sketches which gives strong performance compared to existing algorithms. The package is available at https://github.com/tgcsaba/ksig.
中文: 签名核是处理序列数据的高效工具,KSig软件包提供了GPU加速的签名核计算和学习任务实现,其中包含基于张量素描的新算法,性能优异。
English: The signature kernel is a highly effective tool for sequential data analysis, and the KSig package offers GPU-accelerated implementations for computing these kernels and performing learning tasks, including a new tensor sketch-based algorithm with strong performance.
Authors:Hoang-Thang Ta, Duy-Quy Thai, Anh Tran, Grigori Sidorov, Alexander Gelbukh
Abstract:
Kolmogorov-Arnold Networks (KANs) represent an innovation in neural network architectures, offering a compelling alternative to Multi-Layer Perceptrons (MLPs) in models such as Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Transformers. By advancing network design, KANs drive groundbreaking research and enable transformative applications across various scientific domains involving neural networks. However, existing KANs often require significantly more parameters in their network layers than MLPs. To address this limitation, this paper introduces PRKANs (Parameter-Reduced Kolmogorov-Arnold Networks), which employ several methods to reduce the parameter count in KAN layers, making them comparable to MLP layers. Experimental results on the MNIST and Fashion-MNIST datasets demonstrate that PRKANs outperform several existing KANs, and their variant with attention mechanisms rivals the performance of MLPs, albeit with slightly longer training times. Furthermore, the study highlights the advantages of Gaussian Radial Basis Functions (GRBFs) and layer normalization in KAN designs. The repository for this work is available at: https://github.com/hoangthangta/All-KAN.
中文: PRKANs通过参数精简技术改进了Kolmogorov-Arnold网络,在基准数据集上实现了与多层感知机相当的性能,同时保留了KANs的创新优势。
English: PRKANs introduce parameter reduction techniques to Kolmogorov-Arnold Networks, achieving performance comparable to MLPs on benchmark datasets while maintaining KANs' innovative advantages.
Authors:Henry Li, Ronen Basri, Yuval Kluger
Abstract:
Cascaded models are multi-scale generative models with a marked capacity for producing perceptually impressive samples at high resolutions. In this work, we show that they can also be excellent likelihood models, so long as we overcome a fundamental difficulty with probabilistic multi-scale models: the intractability of the likelihood function. Chiefly, in cascaded models each intermediary scale introduces extraneous variables that cannot be tractably marginalized out for likelihood evaluation. This issue vanishes by modeling the diffusion process on latent spaces induced by a class of transformations we call hierarchical volume-preserving maps, which decompose spatially structured data in a hierarchical fashion without introducing local distortions in the latent space. We demonstrate that two such maps are well-known in the literature for multiscale modeling: Laplacian pyramids and wavelet transforms. Not only do such reparameterizations allow the likelihood function to be directly expressed as a joint likelihood over the scales, we show that the Laplacian pyramid and wavelet transform also produces significant improvements to the state-of-the-art on a selection of benchmarks in likelihood modeling, including density estimation, lossless compression, and out-of-distribution detection. Investigating the theoretical basis of our empirical gains we uncover deep connections to score matching under the Earth Mover's Distance (EMD), which is a well-known surrogate for perceptual similarity. Code can be found at \href{https://github.com/lihenryhfl/pcdm}{this https url}.
中文摘要:级联模型通过采用拉普拉斯金字塔和小波变换等分层保体积映射,解决了多尺度似然评估的难处理性问题,从而实现了最先进的似然建模性能。
English Summary: Cascaded models can achieve state-of-the-art likelihood modeling by using hierarchical volume-preserving transformations like Laplacian pyramids and wavelet transforms, which overcome the intractability of multi-scale likelihood evaluation.
Authors:Juntao Ren, Priya Sundaresan, Dorsa Sadigh, Sanjiban Choudhury, Jeannette Bohg
Abstract:
Teaching robots to autonomously complete everyday tasks remains a challenge. Imitation Learning (IL) is a powerful approach that imbues robots with skills via demonstrations, but is limited by the labor-intensive process of collecting teleoperated robot data. Human videos offer a scalable alternative, but it remains difficult to directly train IL policies from them due to the lack of robot action labels. To address this, we propose to represent actions as short-horizon 2D trajectories on an image. These actions, or motion tracks, capture the predicted direction of motion for either human hands or robot end-effectors. We instantiate an IL policy called Motion Track Policy (MT-pi) which receives image observations and outputs motion tracks as actions. By leveraging this unified, cross-embodiment action space, MT-pi completes tasks with high success given just minutes of human video and limited additional robot demonstrations. At test time, we predict motion tracks from two camera views, recovering 6DoF trajectories via multi-view synthesis. MT-pi achieves an average success rate of 86.5% across 4 real-world tasks, outperforming state-of-the-art IL baselines which do not leverage human data or our action space by 40%, and generalizes to scenarios seen only in human videos. Code and videos are available on our website https://portal-cornell.github.io/motion_track_policy/.
Authors:Jimeng Shi, Azam Shirali, Bowen Jin, Sizhe Zhou, Wei Hu, Rahuul Rangaraj, Shaowen Wang, Jiawei Han, Zhaonan Wang, Upmanu Lall, Yanzhao Wu, Leonardo Bobadilla, Giri Narasimhan
Abstract:
Physics-based numerical models have been the bedrock of atmospheric sciences for decades, offering robust solutions but often at the cost of significant computational resources. Deep learning (DL) models have emerged as powerful tools in meteorology, capable of analyzing complex weather and climate data by learning intricate dependencies and providing rapid predictions once trained. While these models demonstrate promising performance in weather prediction, often surpassing traditional physics-based methods, they still face critical challenges. This paper presents a comprehensive survey of recent deep learning and foundation models for weather prediction. We propose a taxonomy to classify existing models based on their training paradigms: deterministic predictive learning, probabilistic generative learning, and pre-training and fine-tuning. For each paradigm, we delve into the underlying model architectures, address major challenges, offer key insights, and propose targeted directions for future research. Furthermore, we explore real-world applications of these methods and provide a curated summary of open-source code repositories and widely used datasets, aiming to bridge research advancements with practical implementations while fostering open and trustworthy scientific practices in adopting cutting-edge artificial intelligence for weather prediction. The related sources are available at https://github.com/JimengShi/ DL-Foundation-Models-Weather.
中文摘要:本文综述了用于天气预报的深度学习和基础模型,按训练范式分类并探讨了挑战、应用及开源资源,旨在推动人工智能在气象领域的应用发展。
English Summary: This paper surveys deep learning and foundation models for weather prediction, categorizing them by training paradigms and addressing challenges, applications, and open-source resources to advance AI in meteorology.
Authors:Raghav Singhal, Zachary Horvitz, Ryan Teehan, Mengye Ren, Zhou Yu, Kathleen McKeown, Rajesh Ranganath
Abstract:
Diffusion models produce impressive results in modalities ranging from images and video to protein design and text. However, generating samples with user-specified properties remains a challenge. Recent research proposes fine-tuning models to maximize rewards that capture desired properties, but these methods require expensive training and are prone to mode collapse. In this work, we present Feynman-Kac (FK) steering, an inference-time framework for steering diffusion models with reward functions. FK steering works by sampling a system of multiple interacting diffusion processes, called particles, and resampling particles at intermediate steps based on scores computed using functions called potentials. Potentials are defined using rewards for intermediate states and are selected such that a high value indicates that the particle will yield a high-reward sample. We explore various choices of potentials, intermediate rewards, and samplers. We evaluate FK steering on text-to-image and text diffusion models. For steering text-to-image models with a human preference reward, we find that FK steering a 0.8B parameter model outperforms a 2.6B parameter fine-tuned model on prompt fidelity, with faster sampling and no training. For steering text diffusion models with rewards for text quality and specific text attributes, we find that FK steering generates lower perplexity, more linguistically acceptable outputs and enables gradient-free control of attributes like toxicity. Our results demonstrate that inference-time scaling and steering of diffusion models - even with off-the-shelf rewards - can provide significant sample quality gains and controllability benefits. Code is available at https://github.com/zacharyhorvitz/Fk-Diffusion-Steering .
中文: 本文提出Feynman-Kac引导方法,这是一种无需训练即可通过奖励函数在推理时指导扩散模型的框架,在文本到图像生成和文本扩散任务中,其效果优于经过微调的模型,显著提升了样本质量和可控性。
English: This paper introduces Feynman-Kac (FK) steering, an inference-time framework that guides diffusion models using reward functions to enhance sample quality and controllability without requiring training, achieving superior results in text-to-image generation and text diffusion tasks compared to fine-tuned models.
Authors:Tianjin Huang, Ziquan Zhu, Gaojie Jin, Lu Liu, Zhangyang Wang, Shiwei Liu
Abstract:
Large Language Models (LLMs) have demonstrated exceptional performance across diverse tasks, yet their training remains highly resource-intensive and susceptible to critical challenges such as training instability. A predominant source of this instability stems from gradient and loss spikes, which disrupt the learning process, often leading to costly interventions like checkpoint recovery and experiment restarts, further amplifying inefficiencies. This paper presents a comprehensive investigation into gradient spikes observed during LLM training, revealing their prevalence across multiple architectures and datasets. Our analysis shows that these spikes can be up to $1000\times$ larger than typical gradients, substantially deteriorating model performance. To address this issue, we propose Spike-Aware Adam with Momentum Reset SPAM, a novel optimizer designed to counteract gradient spikes through momentum reset and spike-aware gradient clipping. Extensive experiments, including both pre-training and fine-tuning, demonstrate that SPAM consistently surpasses Adam and its variants across various tasks, including (1) LLM pre-training from 60M to 1B, (2) 4-bit LLM pre-training,(3) reinforcement learning, and (4) Time Series Forecasting. Additionally, SPAM facilitates memory-efficient training by enabling sparse momentum, where only a subset of momentum terms are maintained and updated. When operating under memory constraints, SPAM outperforms state-of-the-art memory-efficient optimizers such as GaLore and Adam-Mini. Our work underscores the importance of mitigating gradient spikes in LLM training and introduces an effective optimization strategy that enhances both training stability and resource efficiency at scale. Code is available at https://github.com/TianjinYellow/SPAM-Optimizer.git
中文: 本文提出SPAM优化器,通过动量重置和梯度裁剪有效缓解大语言模型训练中的梯度尖峰问题,在多种任务中提升训练稳定性与资源效率,性能优于现有方法。
English: This paper introduces SPAM, a novel optimizer that mitigates gradient spikes in LLM training through momentum reset and adaptive clipping, improving stability and efficiency across various tasks and outperforming existing methods.
Authors:Tomohiko Nakamura, Kwanghee Choi, Keigo Hojo, Yoshiaki Bando, Satoru Fukayama, Shinji Watanabe
Abstract:
Self-supervised speech models (S3Ms) have become a common tool for the speech processing community, leveraging representations for downstream tasks. Clustering S3M representations yields discrete speech units (DSUs), which serve as compact representations for speech signals. DSUs are typically obtained by k-means clustering. Using DSUs often leads to strong performance in various tasks, including automatic speech recognition (ASR). However, even with the high dimensionality and redundancy of S3M representations, preprocessing S3M representations for better clustering remains unexplored, even though it can affect the quality of DSUs. In this paper, we investigate the potential of linear preprocessing methods for extracting DSUs. We evaluate standardization, principal component analysis, whitening, and independent component analysis (ICA) on DSU-based ASR benchmarks and demonstrate their effectiveness as preprocessing for k-means. We also conduct extensive analyses of their behavior, such as orthogonality or interpretability of individual components of ICA.
Chinese: 本文研究了标准化、主成分分析、白化和独立成分分析等线性预处理方法,用于优化自监督语音模型表示的聚类以生成离散语音单元,并验证了它们在提升自动语音识别性能方面的有效性。
English: This paper explores linear preprocessing methods like standardization, PCA, whitening, and ICA to enhance the clustering of self-supervised speech model representations for discrete speech units, demonstrating their effectiveness in improving automatic speech recognition performance.
Authors:Yifan Zhang, Yifeng Liu, Huizhuo Yuan, Zhen Qin, Yang Yuan, Quanquan Gu, Andrew C Yao
Abstract:
Scaling language models to handle longer input sequences typically necessitates large key-value (KV) caches, resulting in substantial memory overhead during inference. In this paper, we propose Tensor Product Attention (TPA), a novel attention mechanism that uses tensor decompositions to represent queries, keys, and values compactly, substantially shrinking the KV cache size at inference time. By factorizing these representations into contextual low-rank components and seamlessly integrating with Rotary Position Embedding (RoPE), TPA achieves improved model quality alongside memory efficiency. Based on TPA, we introduce the Tensor Product Attention Transformer,(T6), a new model architecture for sequence modeling. Through extensive empirical evaluation on language modeling tasks, we demonstrate that T6 surpasses or matches the performance of standard Transformer baselines, including Multi-Head Attention (MHA), Multi-Query Attention (MQA), Grouped-Query Attention (GQA), and Multi-Head Latent Attention (MLA) across various metrics, including perplexity and a range of established evaluation benchmarks. Notably, TPA's memory efficiency and computational efficiency at the decoding stage enable processing longer sequences under fixed resource constraints, addressing a critical scalability challenge in modern language models. The code is available at https://github.com/tensorgi/T6.
中文: 本文提出张量积注意力(TPA),这是一种通过张量分解压缩键值缓存的新机制,能在语言任务中减少内存占用,同时保持或提升模型性能。
English: The paper introduces Tensor Product Attention (TPA), a novel mechanism that compresses key-value caches using tensor decompositions to reduce memory usage while maintaining or improving model performance in language tasks.
Authors:Jerry Chee, Arturs Backurs, Rainie Heck, Li Zhang, Janardhan Kulkarni, Thomas Rothvoss, Sivakanth Gopi
Abstract:
Quantizing the weights of a neural network has two steps: (1) Finding a good low bit-complexity representation for weights (which we call the quantization grid) and (2) Rounding the original weights to values in the quantization grid. In this paper, we study the problem of rounding optimally given any quantization grid. The simplest and most commonly used way to round is Round-to-Nearest (RTN). By rounding in a data-dependent way instead, one can improve the quality of the quantized model significantly.
We study the rounding problem from the lens of \emph{discrepancy theory}, which studies how well we can round a continuous solution to a discrete solution without affecting solution quality too much. We prove that given $m=\mathrm{poly}(1/ε)$ samples from the data distribution, we can round all but $O(m)$ model weights such that the expected approximation error of the quantized model on the true data distribution is $\le ε$ as long as the space of gradients of the original model is approximately low rank (which we empirically validate).
Our proof, which is algorithmic, inspired a simple and practical rounding algorithm called \emph{DiscQuant}. In our experiments, we demonstrate that DiscQuant significantly improves over the prior state-of-the-art rounding method called GPTQ and the baseline RTN over a range of benchmarks on Phi3mini-3.8B and Llama3.1-8B. For example, rounding Phi3mini-3.8B to a fixed quantization grid with 3.25 bits per parameter using DiscQuant gets 64\% accuracy on the GSM8k dataset, whereas GPTQ achieves 54\% and RTN achieves 31\% (the original model achieves 84\%). We make our code available at https://github.com/jerry-chee/DiscQuant.
中文摘要:本文提出DiscQuant方法,基于差异理论通过数据依赖的舍入优化神经网络权重量化,在多个基准测试中显著超越了GPTQ和舍入至最近等现有技术。
English Summary: This paper introduces DiscQuant, a data-dependent rounding method that leverages discrepancy theory to optimize neural network weight quantization, significantly outperforming prior techniques like GPTQ and RTN across multiple benchmarks.
Authors:José Ramón Pareja Monturiol, Alejandro Pozas-Kerstjens, David Pérez-GarcÃa
Abstract:
We present a tensorization algorithm for constructing tensor train representations of functions, drawing on sketching and cross interpolation ideas. The method only requires black-box access to the target function and a small set of sample points defining the domain of interest. Thus, it is particularly well-suited for machine learning models, where the domain of interest is naturally defined by the training dataset. We show that this approach can be used to enhance the privacy and interpretability of neural network models. Specifically, we apply our decomposition to (i) obfuscate neural networks whose parameters encode patterns tied to the training data distribution, and (ii) estimate topological phases of matter that are easily accessible from the tensor train representation. Additionally, we show that this tensorization can serve as an efficient initialization method for optimizing tensor trains in general settings, and that, for model compression, our algorithm achieves a superior trade-off between memory and time complexity compared to conventional tensorization methods of neural networks.
中文: 该张量化算法通过素描和交叉插值构建张量链表示,仅需黑盒函数访问和小样本集,特别适用于提升机器学习模型的隐私性、可解释性及效率。
English: This tensorization algorithm constructs tensor train representations using sketching and cross interpolation, requiring only black-box function access and a small sample set, making it ideal for enhancing privacy, interpretability, and efficiency in machine learning models.
Authors:Daojun Liang, Haixia Zhang, Dongfeng Yuan
Abstract:
Long-term and Large-scale Wireless Traffic Forecasting (LL-WTF) is pivotal for strategic network management and comprehensive planning on a macro scale. However, LL-WTF poses greater challenges than short-term ones due to the pronounced non-stationarity of extended wireless traffic and the vast number of nodes distributed at the city scale. To cope with this, we propose a Progressive Supervision method based on Label Decomposition (PSLD). Specifically, we first introduce a Random Subgraph Sampling (RSS) algorithm designed to sample a tractable subset from large-scale traffic data, thereby enabling efficient network training. Then, PSLD employs label decomposition to obtain multiple easy-to-learn components, which are learned progressively at shallow layers and combined at deep layers to effectively cope with the non-stationary problem raised by LL-WTF tasks. Finally, we compare the proposed method with various state-of-the-art (SOTA) methods on three large-scale WT datasets. Extensive experimental results demonstrate that the proposed PSLD significantly outperforms existing methods, with an average 2%, 4%, and 11% performance improvement on three WT datasets, respectively. In addition, we built an open source library for WT forecasting (WTFlib) to facilitate related research, which contains numerous SOTA methods and provides a strong benchmark.Experiments can be reproduced through https://github.com/Anoise/WTFlib.
中文: 提出的基于标签分解的渐进监督方法(PSLD)通过随机子图采样和分层学习策略,有效解决了长期大规模无线流量预测中的非平稳性问题,在多个数据集上显著超越了现有最优方法。
English: The proposed Progressive Supervision with Label Decomposition (PSLD) method effectively addresses long-term wireless traffic forecasting challenges by employing random subgraph sampling and progressive learning, achieving significant performance improvements over state-of-the-art methods across multiple datasets.
Authors:Gent Wu
Abstract:
Vision Transformers (ViTs) have demonstrated remarkable success on large-scale datasets, but their performance on smaller datasets often falls short of convolutional neural networks (CNNs). This paper explores the design and optimization of Tiny ViTs for small datasets, using CIFAR-10 as a benchmark. We systematically evaluate the impact of data augmentation, patch token initialization, low-rank compression, and multi-class token strategies on model performance. Our experiments reveal that low-rank compression of queries in Multi-Head Latent Attention (MLA) incurs minimal performance loss, indicating redundancy in ViTs. Additionally, introducing multiple CLS tokens improves global representation capacity, boosting accuracy. These findings provide a comprehensive framework for optimizing Tiny ViTs, offering practical insights for efficient and effective designs. Code is available at https://github.com/erow/PoorViTs.
中文: 本文通过评估低秩压缩和多类别令牌等策略,系统优化了适用于小数据集的微型视觉变换器,发现压缩查询带来的性能损失极小,且引入多个CLS令牌能提升全局表示能力与准确率。
English: This paper systematically optimizes Tiny Vision Transformers for small datasets by evaluating strategies like low-rank compression and multi-class tokens, finding minimal performance loss with compressed queries and enhanced accuracy through multiple CLS tokens.
Authors:Seul Lee, Karsten Kreis, Srimukh Prasad Veccham, Meng Liu, Danny Reidenbach, Yuxing Peng, Saee Paliwal, Weili Nie, Arash Vahdat
Abstract:
Drug discovery is a complex process that involves multiple stages and tasks. However, existing molecular generative models can only tackle some of these tasks. We present Generalist Molecular generative model (GenMol), a versatile framework that uses only a single discrete diffusion model to handle diverse drug discovery scenarios. GenMol generates Sequential Attachment-based Fragment Embedding (SAFE) sequences through non-autoregressive bidirectional parallel decoding, thereby allowing the utilization of a molecular context that does not rely on the specific token ordering while having better sampling efficiency. GenMol uses fragments as basic building blocks for molecules and introduces fragment remasking, a strategy that optimizes molecules by regenerating masked fragments, enabling effective exploration of chemical space. We further propose molecular context guidance (MCG), a guidance method tailored for masked discrete diffusion of GenMol. GenMol significantly outperforms the previous GPT-based model in de novo generation and fragment-constrained generation, and achieves state-of-the-art performance in goal-directed hit generation and lead optimization. These results demonstrate that GenMol can tackle a wide range of drug discovery tasks, providing a unified and versatile approach for molecular design. Our code is available at https://github.com/NVIDIA-Digital-Bio/genmol.
中文:GenMol是一种通用的分子生成框架,它采用单一离散扩散模型,通过基于片段的构建策略和创新的片段重掩蔽与分子上下文引导技术,在多种药物发现任务中超越现有模型,为分子设计提供了统一且高效的解决方案。
English: GenMol is a versatile molecular generative framework that uses a single discrete diffusion model with fragment-based building blocks and innovative strategies like fragment remasking and molecular context guidance to outperform previous models across diverse drug discovery tasks, offering a unified approach for molecular design.
Authors:Julius Berner, Lorenz Richter, Marcin Sendera, Jarrid Rector-Brooks, Nikolay Malkin
Abstract:
We study the problem of training neural stochastic differential equations, or diffusion models, to sample from a Boltzmann distribution without access to target samples. Existing methods for training such models enforce time-reversal of the generative and noising processes, using either differentiable simulation or off-policy reinforcement learning (RL). We prove equivalences between families of objectives in the limit of infinitesimal discretization steps, linking entropic RL methods (GFlowNets) with continuous-time objects (partial differential equations and path space measures). We further show that an appropriate choice of coarse time discretization during training allows greatly improved sample efficiency and the use of time-local objectives, achieving competitive performance on standard sampling benchmarks with reduced computational cost.
中文摘要:本研究提出了一种无需目标数据即可训练神经扩散模型从玻尔兹曼分布中采样的方法,证明了熵强化学习与连续时间目标之间的等价性,并通过优化时间离散化实现了更高的采样效率。
English Summary: This research introduces a method for training neural diffusion models to sample from Boltzmann distributions without target data, demonstrating equivalence between entropic reinforcement learning and continuous-time objectives while achieving improved efficiency through optimized time discretization.
Authors:Steffen Dereich, Arnulf Jentzen, Adrian Riekert
Abstract:
Deep learning methods - usually consisting of a class of deep neural networks (DNNs) trained by a stochastic gradient descent (SGD) optimization method - are nowadays omnipresent in data-driven learning problems as well as in scientific computing tasks such as optimal control (OC) and partial differential equation (PDE) problems. In practically relevant learning tasks, often not the plain-vanilla standard SGD optimization method is employed to train the considered class of DNNs but instead more sophisticated adaptive and accelerated variants of the standard SGD method such as the popular Adam optimizer are used. Inspired by the classical Polyak-Ruppert averaging approach, in this work we apply averaged variants of the Adam optimizer to train DNNs to approximately solve exemplary scientific computing problems in the form of PDEs and OC problems. We test the averaged variants of Adam in a series of learning problems including physics-informed neural network (PINN), deep backward stochastic differential equation (deep BSDE), and deep Kolmogorov approximations for PDEs (such as heat, Black-Scholes, Burgers, and Allen-Cahn PDEs), including DNN approximations for OC problems, and including DNN approximations for image classification problems (ResNet for CIFAR-10). In each of the numerical examples the employed averaged variants of Adam outperform the standard Adam and the standard SGD optimizers, particularly, in the situation of the scientific machine learning problems. The Python source codes for the numerical experiments associated to this work can be found on GitHub at https://github.com/deeplearningmethods/averaged-adam.
中文: 本研究证明,在训练深度神经网络解决偏微分方程和最优控制等科学计算问题时,采用平均化变体的Adam优化器比标准Adam和随机梯度下降方法表现更优。
English: This study demonstrates that averaged variants of the Adam optimizer outperform standard Adam and SGD methods in training deep neural networks for scientific computing tasks like PDEs and optimal control problems.
Authors:Kevin Mancini, Islem Rekik
Abstract:
Graph Neural Networks (GNNs) are popular deep learning models designed to process graph-structured data through recursive neighborhood aggregations in the message passing process. When applied to semi-supervised node classification, the message-passing enables GNNs to understand short-range spatial interactions, but also causes them to suffer from over-smoothing and over-squashing. These challenges hinder model expressiveness and prevent the use of deeper models to capture long-range node interactions (LRIs) within the graph. Popular solutions for LRIs detection are either too expensive to process large graphs due to high time complexity or fail to generalize across diverse graph structures. To address these limitations, we propose a mechanism called \emph{information flow control}, which leverages a novel connectivity measure, called \emph{information flow score}, to address over-smoothing and over-squashing with linear computational overhead, supported by theoretical evidence. Finally, to prove the efficacy of our methodology we design DeltaGNN, the first scalable and generalizable approach for detecting long-range and short-range interactions. We benchmark our model across 10 real-world datasets, including graphs with varying sizes, topologies, densities, and homophilic ratios, showing superior performance with limited computational complexity. The implementation of the proposed methods are publicly available at https://github.com/basiralab/DeltaGNN.
Chinese: 图神经网络在捕捉长程交互时面临过度平滑和挤压的挑战,而提出的DeltaGNN通过信息流控制机制,以线性计算成本有效解决了这些问题,并在多样化图结构中展现出优越性能。
English: Graph Neural Networks face challenges like over-smoothing and over-squashing that limit their ability to capture long-range interactions, but the proposed DeltaGNN with information flow control effectively addresses these issues while maintaining computational efficiency across diverse datasets.
Authors:Sauda Adiv Hanum, Ashim Dey, Muhammad Ashad Kabir
Abstract:
The skin, as the largest organ of the human body, is vulnerable to a diverse array of conditions collectively known as skin lesions, which encompass various dermatoses. Diagnosing these lesions presents significant challenges for medical practitioners due to the subtle visual differences that are often imperceptible to the naked eye. While not all skin lesions are life-threatening, certain types can act as early indicators of severe diseases, including skin cancers, underscoring the critical need for timely and accurate diagnostic methods. Deep learning algorithms have demonstrated remarkable potential in facilitating the early detection and prognosis of skin lesions. This study advances the field by curating a comprehensive and diverse dataset comprising 39 categories of skin lesions, synthesized from five publicly available datasets. Using this dataset, the performance of five state-of-the-art deep learning models -- MobileNetV2, Xception, InceptionV3, EfficientNetB1, and Vision Transformer - is rigorously evaluated. To enhance the accuracy and robustness of these models, attention mechanisms such as the Efficient Channel Attention (ECA) and the Convolutional Block Attention Module (CBAM) are incorporated into their architectures. Comprehensive evaluation across multiple performance metrics reveals that the Vision Transformer model integrated with CBAM outperforms others, achieving an accuracy of 93.46%, precision of 94%, recall of 93%, F1-score of 93%, and specificity of 93.67%. These results underscore the significant potential of the proposed system in supporting medical professionals with accurate and efficient prognostic tools for diagnosing a broad spectrum of skin lesions. The dataset and code used in this study can be found at https://github.com/akabircs/Skin-Lesions-Classification.
Chinese: 本研究通过将注意力机制与五种先进模型相结合,开发了一个用于诊断39种皮肤病变的深度学习系统,结果表明集成CBAM的Vision Transformer模型表现最佳,准确率达93.46%。
English: This study develops a deep learning system for diagnosing 39 types of skin lesions by integrating attention mechanisms with five advanced models, demonstrating that the Vision Transformer with CBAM achieves superior performance with 93.46% accuracy.
Authors:Zhifan Song, Yuan Zhang, Abd Al Rahman M. Abu Ebayyeh
Abstract:
Detecting small targets in drone imagery is challenging due to low resolution, complex backgrounds, and dynamic scenes. We propose EDNet, a novel edge-target detection framework built on an enhanced YOLOv10 architecture, optimized for real-time applications without post-processing. EDNet incorporates an XSmall detection head and a Cross Concat strategy to improve feature fusion and multi-scale context awareness for detecting tiny targets in diverse environments. Our unique C2f-FCA block employs Faster Context Attention to enhance feature extraction while reducing computational complexity. The WIoU loss function is employed for improved bounding box regression. With seven model sizes ranging from Tiny to XL, EDNet accommodates various deployment environments, enabling local real-time inference and ensuring data privacy. Notably, EDNet achieves up to a 5.6% gain in mAP@50 with significantly fewer parameters. On an iPhone 12, EDNet variants operate at speeds ranging from 16 to 55 FPS, providing a scalable and efficient solution for edge-based object detection in challenging drone imagery. The source code and pre-trained models are available at: https://github.com/zsniko/EDNet.
中文摘要:EDNet是基于YOLOv10优化的实时目标检测框架,通过改进特征融合和多尺度上下文感知技术,有效提升无人机图像中小目标检测性能,在多种模型尺寸下以更少参数实现更高精度。
English Summary: EDNet is an optimized real-time object detection framework based on YOLOv10 that enhances small target detection in drone imagery through improved feature fusion and multi-scale context awareness, achieving higher accuracy with fewer parameters across multiple model sizes.
Authors:Soyeong Jeong, Kangsan Kim, Jinheon Baek, Sung Ju Hwang
Abstract:
Retrieval-Augmented Generation (RAG) is a powerful strategy for improving the factual accuracy of models by retrieving external knowledge relevant to queries and incorporating it into the generation process. However, existing approaches primarily focus on text, with some recent advancements considering images, and they largely overlook videos, a rich source of multimodal knowledge capable of representing contextual details more effectively than any other modality. While very recent studies explore the use of videos in response generation, they either predefine query-associated videos without retrieval or convert videos into textual descriptions losing multimodal richness. To tackle these, we introduce VideoRAG, a framework that not only dynamically retrieves videos based on their relevance with queries but also utilizes both visual and textual information. The operation of VideoRAG is powered by recent Large Video Language Models (LVLMs), which enable the direct processing of video content to represent it for retrieval and the seamless integration of retrieved videos jointly with queries for response generation. Also, inspired by that the context size of LVLMs may not be sufficient to process all frames in extremely long videos and not all frames are equally important, we introduce a video frame selection mechanism to extract the most informative subset of frames, along with a strategy to extract textual information from videos (as it can aid the understanding of video content) when their subtitles are not available. We experimentally validate the effectiveness of VideoRAG, showcasing that it is superior to relevant baselines. Code is available at https://github.com/starsuzi/VideoRAG.
中文: VideoRAG是一种创新框架,通过动态检索相关视频并利用大型视频语言模型处理多模态内容来增强回答生成,它通过引入帧选择机制和文本提取策略弥补了现有方法的不足,从而提高了事实准确性。
English: VideoRAG is a novel framework that dynamically retrieves relevant videos and leverages their multimodal content through Large Video Language Models to enhance response generation, addressing gaps in existing methods by incorporating frame selection and text extraction for improved accuracy.
Authors:Sunwoo Kim, Minkyu Kim, Dongmin Park
Abstract:
Diffusion models excel in generative tasks, but aligning them with specific objectives while maintaining their versatility remains challenging. Existing fine-tuning methods often suffer from reward over-optimization, while approximate guidance approaches fail to optimize target rewards effectively. Addressing these limitations, we propose a training-free, test-time method based on Sequential Monte Carlo (SMC) to sample from the reward-aligned target distribution. Our approach, tailored for diffusion sampling and incorporating tempering techniques, achieves comparable or superior target rewards to fine-tuning methods while preserving diversity and cross-reward generalization. We demonstrate its effectiveness in single-reward optimization, multi-objective scenarios, and online black-box optimization. This work offers a robust solution for aligning diffusion models with diverse downstream objectives without compromising their general capabilities. Code is available at https://github.com/krafton-ai/DAS.
Chinese: 本文提出了一种无需训练、基于序贯蒙特卡洛的测试时方法,能在保持扩散模型多样性和泛化能力的同时有效对齐目标奖励,其性能优于现有微调方法。
English: This paper introduces a training-free, test-time Sequential Monte Carlo method that effectively aligns diffusion models with target rewards while preserving their diversity and generalization, outperforming existing fine-tuning approaches.
Authors:Taywon Min, Haeone Lee, Yongchan Kwon, Kimin Lee
Abstract:
In Reinforcement Learning from Human Feedback (RLHF), it is crucial to learn suitable reward models from human feedback to align large language models (LLMs) with human intentions. However, human feedback can often be noisy, inconsistent, or biased, especially when evaluating complex responses. Such feedback can lead to misaligned reward signals, potentially causing unintended side effects during the RLHF process. To address these challenges, we explore the use of influence functions to measure the impact of human feedback on the performance of reward models. We propose a compute-efficient approximation method that enables the application of influence functions to LLM-based reward models and large-scale preference datasets. Our experiments showcase two key applications of influence functions: (1) detecting common labeler biases in human feedback datasets and (2) guiding labelers in refining their strategies to better align with expert feedback. By quantifying the impact of human feedback, we believe that influence functions can enhance feedback interpretability and contribute to scalable oversight in RLHF, helping labelers provide more accurate and consistent feedback. Source code is available at https://github.com/mintaywon/IF_RLHF
中文: 本研究引入影响函数来评估人类反馈在RLHF中对奖励模型的影响,能够高效检测偏差并指导标注者提升反馈的准确性和一致性。
English: This study introduces influence functions to assess the impact of human feedback on reward models in RLHF, enabling efficient detection of biases and guidance for labelers to improve feedback accuracy and consistency.
Authors:Yinghao Zhu, Xiaochen Zheng, Ahmed Allam, Michael Krauthammer
Abstract:
We propose TAMER, a Test-time Adaptive MoE-driven framework for Electronic Health Record (EHR) Representation learning. TAMER introduces a framework where a Mixture-of-Experts (MoE) architecture is co-designed with Test-Time Adaptation (TTA) to jointly mitigate the intertwined challenges of patient heterogeneity and distribution shifts in EHR modeling. The MoE focuses on latent patient subgroups through domain-aware expert specialization, while TTA enables real-time adaptation to evolving health status distributions when new patient samples are introduced. Extensive experiments across four real-world EHR datasets demonstrate that TAMER consistently improves predictive performance for both mortality and readmission risk tasks when combined with diverse EHR modeling backbones. TAMER offers a promising approach for dynamic and personalized EHR-based predictions in practical clinical settings.
中文: TAMER提出一种测试时自适应框架,通过专家混合架构解决电子健康记录建模中的患者异质性和分布偏移问题,在多个真实数据集上显著提升了死亡率和再入院风险的预测性能。
English: TAMER is a test-time adaptive framework using a Mixture-of-Experts architecture to address patient heterogeneity and distribution shifts in EHR modeling, demonstrating improved performance for mortality and readmission predictions across multiple datasets.
Authors:Ayush Khot, Xiwei Wang, Avik Roy, Volodymyr Kindratenko, Mark S. Neubauer
Abstract:
Current methods commonly used for uncertainty quantification (UQ) in deep learning (DL) models utilize Bayesian methods which are computationally expensive and time-consuming. In this paper, we provide a detailed study of UQ based on evidential deep learning (EDL) for deep neural network models designed to identify jets in high energy proton-proton collisions at the Large Hadron Collider and explore its utility in anomaly detection. EDL is a DL approach that treats learning as an evidence acquisition process designed to provide confidence (or epistemic uncertainty) about test data. Using publicly available datasets for jet classification benchmarking, we explore hyperparameter optimizations for EDL applied to the challenge of UQ for jet identification. We also investigate how the uncertainty is distributed for each jet class, how this method can be implemented for the detection of anomalies, how the uncertainty compares with Bayesian ensemble methods, and how the uncertainty maps onto latent spaces for the models. Our studies uncover some pitfalls of EDL applied to anomaly detection and a more effective way to quantify uncertainty from EDL as compared with the foundational EDL setup. These studies illustrate a methodological approach to interpreting EDL in jet classification models, providing new insights on how EDL quantifies uncertainty and detects out-of-distribution data which may lead to improved EDL methods for DL models applied to classification tasks.
中文: 本研究探讨了证据深度学习在喷注分类模型中的不确定性量化应用,揭示了其相对于贝叶斯方法的优势,并提出了改进异常检测和不确定性解释的新途径。
English: This study explores evidential deep learning (EDL) for uncertainty quantification in jet classification models, revealing its advantages over Bayesian methods and identifying improved approaches for anomaly detection and uncertainty interpretation.
Authors:Zhao Yang, Bing Su, Jiahao Chen, Ji-Rong Wen
Abstract:
Predicting multiple functions labeled with Enzyme Commission (EC) numbers from the enzyme sequence is of great significance but remains a challenge due to its sparse multi-label classification nature, i.e., each enzyme is typically associated with only a few labels out of more than 6000 possible EC numbers. However, existing machine learning algorithms generally learn a fixed global representation for each enzyme to classify all functions, thereby they lack interpretability and the fine-grained information of some function-specific local residue fragments may be overwhelmed. Here we present an attention-based framework, namely ProtDETR (Protein Detection Transformer), by casting enzyme function prediction as a detection problem. It uses a set of learnable functional queries to adaptatively extract different local representations from the sequence of residue-level features for predicting different EC numbers. ProtDETR not only significantly outperforms existing deep learning-based enzyme function prediction methods, but also provides a new interpretable perspective on automatically detecting different local regions for identifying different functions through cross-attentions between queries and residue-level features. Code is available at https://github.com/yangzhao1230/ProtDETR.
中文:ProtDETR是一个基于注意力的框架,将酶功能预测视为检测问题,通过功能查询提取局部表征来预测EC编号,在性能和可解释性上均优于现有方法。
English: ProtDETR is an attention-based framework that treats enzyme function prediction as a detection problem, using functional queries to extract local representations for predicting EC numbers, achieving superior performance and interpretability over existing methods.
Authors:Gursimran Singh, Xinglu Wang, Yifan Hu, Timothy Yu, Linzi Xing, Wei Jiang, Zhefeng Wang, Xiaolong Bai, Yi Li, Ying Xiong, Yong Zhang, Zhenan Fan
Abstract:
Large Multimodal Models (LMMs) extend Large Language Models (LLMs) by handling diverse inputs such as images, audio, and video, but at the cost of adding a multimodal encoding stage that increases both computational and memory overhead. This step negatively affects key Service Level Objectives (SLOs), such as time to first token (TTFT) and time per output token (TPOT). We introduce Encode-Prefill-Decode (EPD) Disaggregation, a novel framework that separates the encoding, prefill, and decode stages onto dedicated resources. Unlike current systems, which bundle encoding and prefill together, our approach decouples these steps, unlocking new opportunities and optimizations. These include a mechanism to cache multimedia tokens for efficient transfer, a novel way to parallelize the encoding load within a request, a module for optimal resource allocation for disaggregated serving, and a novel role-switching method to handle changing workload characteristics. Experimental evaluations with popular LMMs show substantial gains in memory efficiency (up to 15x lower peak memory utilization), batch sizes (up to 22x larger), 10x more images per request, and 2.2x larger KV caches. Furthermore, it leads to significant improvements in SLO attainment (up to 90-100% improvement) and TTFT (up to 71% reduction), compared to systems that do not disaggregate. The code is available at https://github.com/vbdi/epdserve.
Chinese: Encode-Prefill-Decode(EPD)解耦框架将多模态模型各处理阶段分配至专用资源,相比集成系统显著提升了内存效率、批处理规模和服务性能。
English: The Encode-Prefill-Decode (EPD) Disaggregation framework separates multimodal model stages onto dedicated resources, significantly improving memory efficiency, batch sizes, and service performance compared to integrated systems.
Authors:Runjie Yan, Yinbo Chen, Xiaolong Wang
Abstract:
Score Distillation Sampling (SDS) has made significant strides in distilling image-generative models for 3D generation. However, its maximum-likelihood-seeking behavior often leads to degraded visual quality and diversity, limiting its effectiveness in 3D applications. In this work, we propose Consistent Flow Distillation (CFD), which addresses these limitations. We begin by leveraging the gradient of the diffusion ODE or SDE sampling process to guide the 3D generation. From the gradient-based sampling perspective, we find that the consistency of 2D image flows across different viewpoints is important for high-quality 3D generation. To achieve this, we introduce multi-view consistent Gaussian noise on the 3D object, which can be rendered from various viewpoints to compute the flow gradient. Our experiments demonstrate that CFD, through consistent flows, significantly outperforms previous methods in text-to-3D generation.
Authors:Yiwen Huang, Aaron Gokaslan, Volodymyr Kuleshov, James Tompkin
Abstract:
There is a widely-spread claim that GANs are difficult to train, and GAN architectures in the literature are littered with empirical tricks. We provide evidence against this claim and build a modern GAN baseline in a more principled manner. First, we derive a well-behaved regularized relativistic GAN loss that addresses issues of mode dropping and non-convergence that were previously tackled via a bag of ad-hoc tricks. We analyze our loss mathematically and prove that it admits local convergence guarantees, unlike most existing relativistic losses. Second, our new loss allows us to discard all ad-hoc tricks and replace outdated backbones used in common GANs with modern architectures. Using StyleGAN2 as an example, we present a roadmap of simplification and modernization that results in a new minimalist baseline -- R3GAN. Despite being simple, our approach surpasses StyleGAN2 on FFHQ, ImageNet, CIFAR, and Stacked MNIST datasets, and compares favorably against state-of-the-art GANs and diffusion models.
中文: 本研究通过引入一种原理性的正则化相对损失,反驳了GAN难以训练的普遍观点,该损失无需经验性技巧并支持现代架构,由此产生的极简基线R3GAN在多个数据集上超越了StyleGAN2,并与顶尖模型相媲美。
English: The study refutes the common belief that GANs are hard to train by introducing a principled, regularized relativistic loss that eliminates the need for empirical tricks and enables the use of modern architectures, resulting in a minimalist baseline, R3GAN, which outperforms StyleGAN2 and competes with top models across multiple datasets.
Authors:Maximilian Dreyer, Jim Berend, Tobias Labarta, Johanna Vielhaben, Thomas Wiegand, Sebastian Lapuschkin, Wojciech Samek
Abstract:
Unlike human-engineered systems such as aeroplanes, where each component's role and dependencies are well understood, the inner workings of AI models remain largely opaque, hindering verifiability and undermining trust. This paper introduces SemanticLens, a universal explanation method for neural networks that maps hidden knowledge encoded by components (e.g., individual neurons) into the semantically structured, multimodal space of a foundation model such as CLIP. In this space, unique operations become possible, including (i) textual search to identify neurons encoding specific concepts, (ii) systematic analysis and comparison of model representations, (iii) automated labelling of neurons and explanation of their functional roles, and (iv) audits to validate decision-making against requirements. Fully scalable and operating without human input, SemanticLens is shown to be effective for debugging and validation, summarizing model knowledge, aligning reasoning with expectations (e.g., adherence to the ABCDE-rule in melanoma classification), and detecting components tied to spurious correlations and their associated training data. By enabling component-level understanding and validation, the proposed approach helps bridge the "trust gap" between AI models and traditional engineered systems. We provide code for SemanticLens on https://github.com/jim-berend/semanticlens and a demo on https://semanticlens.hhi-research-insights.eu.
中文: 本文提出SemanticLens方法,通过将神经网络组件映射到语义空间,实现可扩展的解释功能,有助于调试、验证并增强对AI系统的信任。
English: This paper introduces SemanticLens, a scalable method that explains neural networks by mapping their components to a semantic space, enabling debugging, validation, and enhanced trust in AI systems.
Authors:Fabian Hörst, Moritz Rempe, Helmut Becker, Lukas Heine, Julius Keyl, Jens Kleesiek
Abstract:
Digital Pathology is a cornerstone in the diagnosis and treatment of diseases. A key task in this field is the identification and segmentation of cells in hematoxylin and eosin-stained images. Existing methods for cell segmentation often require extensive annotated datasets for training and are limited to a predefined cell classification scheme. To overcome these limitations, we propose $\text{CellViT}^{\scriptscriptstyle ++}$, a framework for generalized cell segmentation in digital pathology. $\text{CellViT}^{\scriptscriptstyle ++}$ utilizes Vision Transformers with foundation models as encoders to compute deep cell features and segmentation masks simultaneously. To adapt to unseen cell types, we rely on a computationally efficient approach. It requires minimal data for training and leads to a drastically reduced carbon footprint. We demonstrate excellent performance on seven different datasets, covering a broad spectrum of cell types, organs, and clinical settings. The framework achieves remarkable zero-shot segmentation and data-efficient cell-type classification. Furthermore, we show that $\text{CellViT}^{\scriptscriptstyle ++}$ can leverage immunofluorescence stainings to generate training datasets without the need for pathologist annotations. The automated dataset generation approach surpasses the performance of networks trained on manually labeled data, demonstrating its effectiveness in creating high-quality training datasets without expert annotations. To advance digital pathology, $\text{CellViT}^{\scriptscriptstyle ++}$ is available as an open-source framework featuring a user-friendly, web-based interface for visualization and annotation. The code is available under https://github.com/TIO-IKIM/CellViT-plus-plus.
中文: CellViT++框架采用基于基础模型的视觉变换器,实现了数字病理学中的通用细胞分割,在多种数据集上表现出色,仅需少量训练数据即可完成零样本分割。
English: The proposed CellViT++ framework utilizes Vision Transformers with foundation models to achieve generalized cell segmentation in digital pathology, demonstrating excellent performance across diverse datasets while requiring minimal training data and enabling zero-shot segmentation.
Authors:Hounsu Kim, Taegyun Kwon, Juhan Nam
Abstract:
Diffusion models have been widely used in the generative domain due to their convincing performance in modeling complex data distributions. Moreover, they have shown competitive results on discriminative tasks, such as image segmentation. While diffusion models have also been explored for automatic music transcription, their performance has yet to reach a competitive level. In this paper, we focus on discrete diffusion model's refinement capabilities and present a novel architecture for piano transcription. Our model utilizes Neighborhood Attention layers as the denoising module, gradually predicting the target high-resolution piano roll, conditioned on the finetuned features of a pretrained acoustic model. To further enhance refinement, we devise a novel strategy which applies distinct transition states during training and inference stage of discrete diffusion models. Experiments on the MAESTRO dataset show that our approach outperforms previous diffusion-based piano transcription models and the baseline model in terms of F1 score. Our code is available in https://github.com/hanshounsu/d3rm.
Chinese: 本文提出了一种新颖的离散扩散模型用于钢琴转录,该模型利用邻域注意力层及独特的训练与推理转换策略,在MAESTRO数据集上取得了最优的F1分数。
English: This paper introduces a novel discrete diffusion model for piano transcription that leverages Neighborhood Attention layers and a unique transition strategy during training and inference, achieving state-of-the-art F1 scores on the MAESTRO dataset.
Authors:Ronghao Dang, Yuqian Yuan, Wenqi Zhang, Yifei Xin, Boqiang Zhang, Long Li, Liuyi Wang, Qinyang Zeng, Xin Li, Lidong Bing
Abstract:
The enhancement of generalization in robots by large vision-language models (LVLMs) is increasingly evident. Therefore, the embodied cognitive abilities of LVLMs based on egocentric videos are of great interest. However, current datasets for embodied video question answering lack comprehensive and systematic evaluation frameworks. Critical embodied cognitive issues, such as robotic self-cognition, dynamic scene perception, and hallucination, are rarely addressed. To tackle these challenges, we propose ECBench, a high-quality benchmark designed to systematically evaluate the embodied cognitive abilities of LVLMs. ECBench features a diverse range of scene video sources, open and varied question formats, and 30 dimensions of embodied cognition. To ensure quality, balance, and high visual dependence, ECBench uses class-independent meticulous human annotation and multi-round question screening strategies. Additionally, we introduce ECEval, a comprehensive evaluation system that ensures the fairness and rationality of the indicators. Utilizing ECBench, we conduct extensive evaluations of proprietary, open-source, and task-specific LVLMs. ECBench is pivotal in advancing the embodied cognitive capabilities of LVLMs, laying a solid foundation for developing reliable core models for embodied agents. All data and code are available at https://github.com/Rh-Dang/ECBench.
中文: ECBench是一个新颖的基准,旨在通过多样化的第一人称视角视频和30个认知维度,系统评估大型视觉语言模型的具身认知能力,弥补现有数据集的不足,并通过精细人工标注和综合评估体系确保公平性。
English: ECBench is a novel benchmark designed to systematically evaluate the embodied cognitive abilities of large vision-language models using diverse egocentric videos and 30 cognitive dimensions, addressing gaps in current datasets while ensuring fairness through meticulous human annotation and a comprehensive evaluation system.
Authors:HyunGi Kim, Siwon Kim, Jisoo Mok, Sungroh Yoon
Abstract:
Deep Neural Networks have spearheaded remarkable advancements in time series forecasting (TSF), one of the major tasks in time series modeling. Nonetheless, the non-stationarity of time series undermines the reliability of pre-trained source time series forecasters in mission-critical deployment settings. In this study, we introduce a pioneering test-time adaptation framework tailored for TSF (TSF-TTA). TAFAS, the proposed approach to TSF-TTA, flexibly adapts source forecasters to continuously shifting test distributions while preserving the core semantic information learned during pre-training. The novel utilization of partially-observed ground truth and gated calibration module enables proactive, robust, and model-agnostic adaptation of source forecasters. Experiments on diverse benchmark datasets and cutting-edge architectures demonstrate the efficacy and generality of TAFAS, especially in long-term forecasting scenarios that suffer from significant distribution shifts. The code is available at https://github.com/kimanki/TAFAS.
中文: 深度神经网络推动了时间序列预测的显著进展,但非平稳性影响了其可靠性;为此提出的TAFAS测试时适应框架,能在保持预训练核心语义的同时,灵活调整预测器以适应持续变化的测试分布。
English: Deep Neural Networks have advanced time series forecasting, but their reliability is challenged by non-stationarity, leading to the development of TAFAS, a test-time adaptation framework that robustly adjusts forecasters to shifting distributions while preserving pre-trained knowledge.
Authors:Guannan Lai, Yihui Feng, Xin Yang, Xiaoyu Deng, Hao Yu, Shuyin Xia, Guoyin Wang, Tianrui Li
Abstract:
Federated Learning (FL) facilitates collaborative model training while prioritizing privacy by avoiding direct data sharing. However, most existing articles attempt to address challenges within the model's internal parameters and corresponding outputs, while neglecting to solve them at the input level. To address this gap, we propose a novel framework called Granular-Ball Federated Learning (GrBFL) for image classification. GrBFL diverges from traditional methods that rely on the finest-grained input data. Instead, it segments images into multiple regions with optimal coarse granularity, which are then reconstructed into a graph structure. We designed a two-dimensional binary search segmentation algorithm based on variance constraints for GrBFL, which effectively removes redundant information while preserving key representative features. Extensive theoretical analysis and experiments demonstrate that GrBFL not only safeguards privacy and enhances efficiency but also maintains robust utility, consistently outperforming other state-of-the-art FL methods. The code is available at https://github.com/AIGNLAI/GrBFL.
中文: 本文提出了一种名为粒度球联邦学习(GrBFL)的新框架,通过粗粒度分割和图像重构为图结构,在保护隐私的同时提高了联邦学习的效率和性能,优于现有先进方法。
English: This paper introduces Granular-Ball Federated Learning (GrBFL), a novel framework that processes images through coarse-grained segmentation and graph reconstruction to enhance privacy, efficiency, and utility in federated learning, outperforming existing methods.
Authors:Jake H. Lee, Michael Kiper, David R. Thompson, Philip G. Brodrick
Abstract:
Current and upcoming generations of visible-shortwave infrared (VSWIR) imaging spectrometers promise unprecedented capacity to quantify Earth System processes across the globe. However, reliable cloud screening remains a fundamental challenge for these instruments, where traditional spatial and temporal approaches are limited by cloud variability and limited temporal coverage. The Spectroscopic Transformer (SpecTf) addresses these challenges with a spectroscopy-specific deep learning architecture that performs cloud detection using only spectral information (no spatial or temporal data are required). By treating spectral measurements as sequences rather than image channels, SpecTf learns fundamental physical relationships without relying on spatial context. Our experiments demonstrate that SpecTf significantly outperforms the current baseline approach implemented for the EMIT instrument, and performs comparably with other machine learning methods with orders of magnitude fewer learned parameters. Critically, we demonstrate SpecTf's inherent interpretability through its attention mechanism, revealing physically meaningful spectral features the model has learned. Finally, we present SpecTf's potential for cross-instrument generalization by applying it to a different instrument on a different platform without modifications, opening the door to instrument agnostic data driven algorithms for future imaging spectroscopy tasks.
中文: 光谱变换器(SpecTf)提出了一种专用于光谱学的深度学习架构,仅利用光谱信息进行云检测,不仅显著优于现有方法,还具备可解释性和跨仪器通用性,无需依赖空间或时间数据。
English: The Spectroscopic Transformer (SpecTf) introduces a spectroscopy-specific deep learning model that uses only spectral data for cloud detection, outperforming existing methods while offering interpretability and cross-instrument generalization without requiring spatial or temporal information.
Authors:Golriz Hosseinimanesh, Farnoosh Ghadiri, Francois Guibault, Farida Cheriet, Julia Keren
Abstract:
Designing a dental crown is a time-consuming and labor intensive process. Our goal is to simplify crown design and minimize the tediousness of making manual adjustments while still ensuring the highest level of accuracy and consistency. To this end, we present a new end- to-end deep learning approach, coined Dental Mesh Completion (DMC), to generate a crown mesh conditioned on a point cloud context. The dental context includes the tooth prepared to receive a crown and its surroundings, namely the two adjacent teeth and the three closest teeth in the opposing jaw. We formulate crown generation in terms of completing this point cloud context. A feature extractor first converts the input point cloud into a set of feature vectors that represent local regions in the point cloud. The set of feature vectors is then fed into a transformer to predict a new set of feature vectors for the missing region (crown). Subsequently, a point reconstruction head, followed by a multi-layer perceptron, is used to predict a dense set of points with normals. Finally, a differentiable point-to-mesh layer serves to reconstruct the crown surface mesh. We compare our DMC method to a graph-based convolutional neural network which learns to deform a crown mesh from a generic crown shape to the target geometry. Extensive experiments on our dataset demonstrate the effectiveness of our method, which attains an average of 0.062 Chamfer Distance.The code is available at:https://github.com/Golriz-code/DMC.gi
中文: 本文提出DMC深度学习模型,通过点云数据自动生成牙冠网格,在保证精度的同时显著简化了传统牙冠设计的繁琐流程。
English: This paper introduces Dental Mesh Completion (DMC), an end-to-end deep learning method that automatically generates dental crown meshes from point cloud data to streamline design while maintaining high accuracy.
Authors:Seyed Amir Bidaki, Amir Mohammadkhah, Kiyan Rezaee, Faeze Hassani, Sadegh Eskandari, Maziar Salahi, Mohammad M. Ghassemi
Abstract:
Online Continual Learning (OCL) is a critical area in machine learning, focusing on enabling models to adapt to evolving data streams in real-time while addressing challenges such as catastrophic forgetting and the stability-plasticity trade-off. This study conducts the first comprehensive Systematic Literature Review (SLR) on OCL, analyzing 81 approaches, extracting over 1,000 features (specific tasks addressed by these approaches), and identifying more than 500 components (sub-models within approaches, including algorithms and tools). We also review 83 datasets spanning applications like image classification, object detection, and multimodal vision-language tasks. Our findings highlight key challenges, including reducing computational overhead, developing domain-agnostic solutions, and improving scalability in resource-constrained environments. Furthermore, we identify promising directions for future research, such as leveraging self-supervised learning for multimodal and sequential data, designing adaptive memory mechanisms that integrate sparse retrieval and generative replay, and creating efficient frameworks for real-world applications with noisy or evolving task boundaries. By providing a rigorous and structured synthesis of the current state of OCL, this review offers a valuable resource for advancing this field and addressing its critical challenges and opportunities. The complete SLR methodology steps and extracted data are publicly available through the provided link: https://github.com/kiyan-rezaee/ Systematic-Literature-Review-on-Online-Continual-Learning
本研究首次对在线持续学习领域开展系统性文献综述,分析了81种方法并识别出计算效率等关键挑战,同时提出了自监督学习和自适应记忆机制等未来研究方向。
This study presents the first comprehensive systematic literature review on Online Continual Learning, analyzing 81 approaches and identifying key challenges like computational efficiency and future directions including self-supervised learning and adaptive memory mechanisms.
Authors:Lucas Prieto, Melih Barsbey, Pedro A. M. Mediano, Tolga Birdal
Abstract:
Grokking, the sudden generalization that occurs after prolonged overfitting, is a surprising phenomenon challenging our understanding of deep learning. Although significant progress has been made in understanding grokking, the reasons behind the delayed generalization and its dependence on regularization remain unclear. In this work, we argue that without regularization, grokking tasks push models to the edge of numerical stability, introducing floating point errors in the Softmax function, which we refer to as Softmax Collapse (SC). We demonstrate that SC prevents grokking and that mitigating SC enables grokking without regularization. Investigating the root cause of SC, we find that beyond the point of overfitting, the gradients strongly align with what we call the naïve loss minimization (NLM) direction. This component of the gradient does not alter the model's predictions but decreases the loss by scaling the logits, typically by scaling the weights along their current direction. We show that this scaling of the logits explains the delay in generalization characteristic of grokking and eventually leads to SC, halting further learning. To validate our hypotheses, we introduce two key contributions that address the challenges in grokking tasks: StableMax, a new activation function that prevents SC and enables grokking without regularization, and $\perp$Grad, a training algorithm that promotes quick generalization in grokking tasks by preventing NLM altogether. These contributions provide new insights into grokking, elucidating its delayed generalization, reliance on regularization, and the effectiveness of existing grokking-inducing methods. Code for this paper is available at https://github.com/LucasPrietoAl/grokking-at-the-edge-of-numerical-stability.
中文: 本研究揭示了浮点误差导致的Softmax崩溃是阻碍深度学习模型"顿悟"现象的关键因素,并提出了StableMax激活函数和⊥Grad训练算法,可在无需正则化的情况下实现顿悟。
English: This study identifies Softmax Collapse caused by floating point errors as the key factor preventing grokking in deep learning models and introduces StableMax activation function and ⊥Grad training algorithm to enable grokking without regularization.
Authors:Ruilin Luo, Zhuofan Zheng, Yifan Wang, Xinzhe Ni, Zicheng Lin, Songtao Jiang, Yiyao Yu, Chufan Shi, Ruihang Chu, Jin Zeng, Yujiu Yang
Abstract:
Process Reward Models (PRMs) have shown promise in enhancing the mathematical reasoning capabilities of Large Language Models (LLMs) through Test-Time Scaling (TTS). However, their integration into multimodal reasoning remains largely unexplored. In this work, we take the first step toward unlocking the potential of PRMs in multimodal mathematical reasoning. We identify three key challenges: (1) the scarcity of high-quality reasoning data constrains the capabilities of foundation Multimodal Large Language Models (MLLMs), which imposes further limitations on the upper bounds of TTS and reinforcement learning (RL); (2) a lack of automated methods for process labeling within multimodal contexts persists; (3) the employment of process rewards in unimodal RL faces issues like reward hacking, which may extend to multimodal scenarios. To address these issues, we introduce URSA, a three-stage Unfolding multimodal Process-Supervision Aided training framework. We first construct MMathCoT-1M, a high-quality large-scale multimodal Chain-of-Thought (CoT) reasoning dataset, to build a stronger math reasoning foundation MLLM, URSA-8B. Subsequently, we go through an automatic process to synthesize process supervision data, which emphasizes both logical correctness and perceptual consistency. We introduce DualMath-1.1M to facilitate the training of URSA-8B-RM. Finally, we propose Process-Supervised Group-Relative-Policy-Optimization (PS-GRPO), pioneering a multimodal PRM-aided online RL method that outperforms vanilla GRPO. With PS-GRPO application, URSA-8B-PS-GRPO outperforms Gemma3-12B and GPT-4o by 8.4% and 2.7% on average across 6 benchmarks. Code, data and checkpoint can be found at https://github.com/URSA-MATH.
中文: 本研究提出URSA三阶段框架,通过构建高质量多模态推理数据集、开发自动化过程监督方法及创新强化学习算法,显著提升多模态数学推理能力,在多个基准测试中超越主流模型表现。
English: This study introduces URSA, a three-stage framework that enhances multimodal mathematical reasoning by developing a high-quality dataset, creating automated process supervision, and implementing a novel reinforcement learning method, achieving superior performance over leading models.
Authors:Daniele Molino, Francesco Di Feola, Eliodoro Faiella, Deborah Fazzini, Domiziana Santucci, Linlin Shen, Valerio Guarrasi, Paolo Soda
Abstract:
The adoption of Artificial Intelligence in medical imaging holds great promise, yet it remains hindered by challenges such as data scarcity, privacy concerns, and the need for robust multimodal integration. While recent advances in generative modeling have enabled high-quality synthetic data generation, existing approaches are often limited to unimodal, unidirectional synthesis and therefore lack the ability to jointly synthesize multiple modalities while preserving clinical consistency. To address this challenge, we introduce XGeM, a 6.77-billion-parameter multimodal generative model designed to support flexible, any-to-any synthesis between medical data modalities. XGeM constructs a shared latent space via contrastive learning and introduces a novel Multi-Prompt Training strategy, enabling conditioning on arbitrary subsets of input modalities. This design allows the model to adapt to heterogeneous clinical inputs and generate multiple outputs jointly, preserving both semantic and structural coherence. We extensively validate XGeM: first we benchmark it against five competitors on the MIMIC-CXR dataset, a state-of-the-art dataset for multi-view Chest X-ray and radiological report generation. Secondly, we perform a Visual Turing Test with expert radiologists to assess the realism and clinical relevance of the generated data, ensuring alignment with real-world scenarios. Finally, we show how XGeM can support key medical data challenges such as anonymization, class imbalance, and data scarcity, underscoring its utility as a foundation model for medical data synthesis. Project page is at https://cosbidev.github.io/XGeM/.
中文: XGeM是一个67.7亿参数的多模态生成模型,通过构建共享潜空间实现医疗数据模态间的灵活双向合成,在保持临床一致性的同时有效解决了数据稀缺和隐私保护等关键医学挑战。
English: XGeM is a 6.77-billion-parameter multimodal generative model that enables flexible any-to-any synthesis between medical data modalities while preserving clinical consistency, addressing key challenges like data scarcity and privacy through validated performance and expert assessment.
Authors:Eric Chen, Xi Chen, Arian Maleki, Shirin Jalali
Abstract:
Unrolled networks have become prevalent in various computer vision and imaging tasks. Although they have demonstrated remarkable efficacy in solving specific computer vision and computational imaging tasks, their adaptation to other applications presents considerable challenges. This is primarily due to the multitude of design decisions that practitioners working on new applications must navigate, each potentially affecting the network's overall performance. These decisions include selecting the optimization algorithm, defining the loss function, and determining the number of convolutional layers, among others. Compounding the issue, evaluating each design choice requires time-consuming simulations to train, fine-tune the neural network, and optimize for its performance. As a result, the process of exploring multiple options and identifying the optimal configuration becomes time-consuming and computationally demanding. The main objectives of this paper are (1) to unify some ideas and methodologies used in unrolled networks to reduce the number of design choices a user has to make, and (2) to report a comprehensive ablation study to discuss the impact of each of the choices involved in designing unrolled networks and present practical recommendations based on our findings. We anticipate that this study will help scientists and engineers design unrolled networks for their applications and diagnose problems within their networks efficiently.
Chinese: 本文旨在通过统一方法和提供全面的消融研究及实用建议,简化展开网络的设计过程,从而降低其在新应用中适应时的复杂性和计算需求。
English: This paper aims to simplify the design of unrolled networks by unifying methodologies and providing a comprehensive ablation study with practical recommendations to reduce the complexity and computational demands of adapting them to new applications.
Authors:Qingmei Wang, Yuxin Wu, Yujie Long, Jing Huang, Fengyuan Ran, Bing Su, Hongteng Xu
Abstract:
An event sequence generated by a temporal point process is often associated with a hidden and structured event branching process that captures the triggering relations between its historical and current events. In this study, we design a new plug-and-play module based on the Bregman ADMM (BADMM) algorithm, which infers event branches associated with event sequences in the maximum likelihood estimation framework of temporal point processes (TPPs). Specifically, we formulate the inference of event branches as an optimization problem for the event transition matrix under sparse and low-rank constraints, which is embedded in existing TPP models or their learning paradigms. We can implement this optimization problem based on subspace clustering and sparse group-lasso, respectively, and solve it using the Bregman ADMM algorithm, whose unrolling leads to the proposed BADMM module. When learning a classic TPP (e.g., Hawkes process) by the expectation-maximization algorithm, the BADMM module helps derive structured responsibility matrices in the E-step. Similarly, the BADMM module helps derive low-rank and sparse attention maps for the neural TPPs with self-attention layers. The structured responsibility matrices and attention maps, which work as learned event transition matrices, indicate event branches, e.g., inferring isolated events and those key events triggering many subsequent events. Experiments on both synthetic and real-world data show that plugging our BADMM module into existing TPP models and learning paradigms can improve model performance and provide us with interpretable structured event branches. The code is available at \url{https://github.com/qingmeiwangdaily/BADMM_TPP}.
Chinese: 本研究提出了一种即插即用的BADMM模块,通过稀疏和低秩约束优化事件转移矩阵,推断时间点过程中的结构化事件分支,从而提升模型性能并增强可解释性。
English: This study introduces a plug-and-play BADMM module that infers structured event branching in temporal point processes by optimizing event transition matrices under sparse and low-rank constraints, enhancing both model performance and interpretability.
Authors:Xueqiang Ouyang, Jia Wei, Wenjie Huo, Xiaocong Wang, Rui Li, Jianlong Zhou
Abstract:
Temporal embryo images and parental fertility table indicators are both valuable for pregnancy prediction in \textbf{in vitro fertilization embryo transfer} (IVF-ET). However, current machine learning models cannot make full use of the complementary information between the two modalities to improve pregnancy prediction performance. In this paper, we propose a Decoupling Fusion Network called DeFusion to effectively integrate the multi-modal information for IVF-ET pregnancy prediction. Specifically, we propose a decoupling fusion module that decouples the information from the different modalities into related and unrelated information, thereby achieving a more delicate fusion. And we fuse temporal embryo images with a spatial-temporal position encoding, and extract fertility table indicator information with a table transformer. To evaluate the effectiveness of our model, we use a new dataset including 4046 cases collected from Southern Medical University. The experiments show that our model outperforms state-of-the-art methods. Meanwhile, the performance on the eye disease prediction dataset reflects the model's good generalization. Our code is available at https://github.com/Ou-Young-1999/DFNet.
中文: 提出的DeFusion网络通过解耦融合和专门编码,有效整合胚胎时序图像与生育表格数据,在IVF-ET妊娠预测中实现优越性能并展现良好泛化能力。
English: The proposed DeFusion network effectively integrates temporal embryo images and fertility table data through decoupled fusion and specialized encoding, achieving superior IVF-ET pregnancy prediction performance with demonstrated generalization capability.
Authors:Hyogon Ryu, NaHyeon Park, Hyunjung Shim
Abstract:
Despite the widespread use of text-to-image diffusion models across various tasks, their computational and memory demands limit practical applications. To mitigate this issue, quantization of diffusion models has been explored. It reduces memory usage and computational costs by compressing weights and activations into lower-bit formats. However, existing methods often struggle to preserve both image quality and text-image alignment, particularly in lower-bit($<$ 8bits) quantization. In this paper, we analyze the challenges associated with quantizing text-to-image diffusion models from a distributional perspective. Our analysis reveals that activation outliers play a crucial role in determining image quality. Additionally, we identify distinctive patterns in cross-attention scores, which significantly affects text-image alignment. To address these challenges, we propose Distribution-aware Group Quantization (DGQ), a method that identifies and adaptively handles pixel-wise and channel-wise outliers to preserve image quality. Furthermore, DGQ applies prompt-specific logarithmic quantization scales to maintain text-image alignment. Our method demonstrates remarkable performance on datasets such as MS-COCO and PartiPrompts. We are the first to successfully achieve low-bit quantization of text-to-image diffusion models without requiring additional fine-tuning of weight quantization parameters. Code is available at https://github.com/ugonfor/DGQ.
中文: 本文提出的分布感知分组量化(DGQ)方法通过自适应处理激活异常值并采用提示特定的对数量化尺度,成功解决了文本到图像扩散模型在低位量化中保持图像质量和图文一致性的难题,无需额外微调即可实现高效压缩。
English: This paper introduces Distribution-aware Group Quantization (DGQ), a novel method that effectively addresses the challenges of low-bit quantization in text-to-image diffusion models by adaptively handling activation outliers and applying prompt-specific logarithmic scales to preserve both image quality and text-image alignment without additional fine-tuning.
Authors:Youran Zhou, Mohamed Reda Bouadjenek, Jonathan Wells, Sunil Aryal
Abstract:
Handling incomplete and heterogeneous data remains a central challenge in real-world machine learning, where missing values may follow complex mechanisms (MCAR, MAR, MNAR) and features can be of mixed types (numerical and categorical). Existing methods often rely on imputation, which may introduce bias or privacy risks, or fail to jointly address data heterogeneity and structured missingness. We propose the \textbf{H}eterogeneous \textbf{I}ncomplete \textbf{P}robability \textbf{M}ass \textbf{K}ernel (\textbf{HI-PMK}), a novel data-dependent representation learning approach that eliminates the need for imputation. HI-PMK introduces two key innovations: (1) a probability mass-based dissimilarity measure that adapts to local data distributions across heterogeneous features (numerical, ordinal, nominal), and (2) a missingness-aware uncertainty strategy (MaxU) that conservatively handles all three missingness mechanisms by assigning maximal plausible dissimilarity to unobserved entries. Our approach is privacy-preserving, scalable, and readily applicable to downstream tasks such as classification and clustering. Extensive experiments on over 15 benchmark datasets demonstrate that HI-PMK consistently outperforms traditional imputation-based pipelines and kernel methods across a wide range of missing data settings. Code is available at: https://github.com/echoid/Incomplete-Heter-Kernel
中文: 提出的HI-PMK方法通过基于概率质量的差异度量和缺失感知策略,解决了不完整和异构数据的挑战,无需数据填补即可在不同缺失数据场景中超越现有方法。
English: The proposed HI-PMK method addresses incomplete and heterogeneous data challenges through a probability mass-based dissimilarity measure and missingness-aware strategy, eliminating imputation needs while outperforming existing methods across various missing data scenarios.
Authors:Hyungjin Chung, Dohun Lee, Zihui Wu, Byung-Hoon Kim, Katherine L. Bouman, Jong Chul Ye
Abstract:
Compressed sensing MRI seeks to accelerate MRI acquisition processes by sampling fewer k-space measurements and then reconstructing the missing data algorithmically. The success of these approaches often relies on strong priors or learned statistical models. While recent diffusion model-based priors have shown great potential, previous methods typically ignore clinically available metadata (e.g. patient demographics, imaging parameters, slice-specific information). In practice, metadata contains meaningful cues about the anatomy and acquisition protocol, suggesting it could further constrain the reconstruction problem. In this work, we propose ContextMRI, a text-conditioned diffusion model for MRI that integrates granular metadata into the reconstruction process. We train a pixel-space diffusion model directly on minimally processed, complex-valued MRI images. During inference, metadata is converted into a structured text prompt and fed to the model via CLIP text embeddings. By conditioning the prior on metadata, we unlock more accurate reconstructions and show consistent gains across multiple datasets, acceleration factors, and undersampling patterns. Our experiments demonstrate that increasing the fidelity of metadata, ranging from slice location and contrast to patient age, sex, and pathology, systematically boosts reconstruction performance. This work highlights the untapped potential of leveraging clinical context for inverse problems and opens a new direction for metadata-driven MRI reconstruction.
中文: ContextMRI是一种新型扩散模型,通过将临床元数据转化为文本提示来增强MRI重建效果,在多种数据集和加速因子下均实现了更精准的图像恢复。
English: ContextMRI is a novel diffusion model that enhances MRI reconstruction by incorporating clinical metadata as text prompts, achieving superior accuracy across various datasets and acceleration factors.
Authors:Parv Kapoor, Kazuki Mizuta, Eunsuk Kang, Karen Leung
Abstract:
Signal Temporal Logic (STL) offers a concise yet expressive framework for specifying and reasoning about spatio-temporal behaviors of robotic systems. Attractively, STL admits the notion of robustness, the degree to which an input signal satisfies or violates an STL specification, thus providing a nuanced evaluation of system performance. In particular, the differentiability of STL robustness enables direct integration to robotic workflows that rely on gradient-based optimization, such as trajectory optimization and deep learning. However, existing approaches to evaluating and differentiating STL robustness rely on recurrent computations, which become inefficient with longer sequences, limiting their use in time-sensitive applications. In this paper, we present STLCG++, a masking-based approach that parallelizes STL robustness evaluation and backpropagation across timesteps, \revised{achieving more than 1000$\times$ faster computation time than the recurrent approach (STLCG++).}{achieving significant speed-ups compared to a recurrent approach.} We also introduce a smoothing technique to enable the differentiation of time interval bounds, thereby expanding STL's applicability in gradient-based optimization tasks involving spatial and temporal variables. Finally, we demonstrate STLCG++'s benefits through three robotics use cases and provide JAX and PyTorch libraries for seamless integration into modern robotics workflows. Project website with demo and code: https://uw-ctrl.github.io/stlcg/.
Authors:Sungjae Park, Seungho Lee, Mingi Choi, Jiye Lee, Jeonghwan Kim, Jisoo Kim, Hanbyul Joo
Abstract:
We present a method for teaching dexterous manipulation tasks to robots from human hand motion demonstrations. Unlike existing approaches that solely rely on kinematics information without taking into account the plausibility of robot and object interaction, our method directly infers plausible robot manipulation actions from human motion demonstrations. To address the embodiment gap between the human hand and the robot system, our approach learns a joint motion manifold that maps human hand movements, robot hand actions, and object movements in 3D, enabling us to infer one motion component from others. Our key idea is the generation of pseudo-supervision triplets, which pair human, object, and robot motion trajectories synthetically. Through real-world experiments with robot hand manipulation, we demonstrate that our data-driven retargeting method significantly outperforms conventional retargeting techniques, effectively bridging the embodiment gap between human and robotic hands. Website at https://rureadyo.github.io/MocapRobot/.
Authors:Siddharth Joshi, Besmira Nushi, Vidhisha Balachandran, Varun Chandrasekaran, Vibhav Vineet, Neel Joshi, Baharan Mirzasoleiman
Abstract:
Vision-language models (VLMs) are highly effective but often underperform on specialized tasks; for example, Llava-1.5 struggles with chart and diagram understanding due to scarce task-specific training data. Existing training data, sourced from general-purpose datasets, fails to capture the nuanced details needed for these tasks. We introduce MM-Gen, a scalable method that generates task-specific, high-quality synthetic text for candidate images by leveraging stronger models. MM-Gen employs a three-stage targeted process: partitioning data into subgroups, generating targeted text based on task descriptions, and filtering out redundant and outlier data. Fine-tuning VLMs with data generated by MM-Gen leads to significant performance gains, including 29% on spatial reasoning and 15% on diagram understanding for Llava-1.5 (7B). Compared to human-curated caption data, MM-Gen achieves up to 1.6x better improvements for the original models, proving its effectiveness in enhancing task-specific VLM performance and bridging the gap between general-purpose datasets and specialized requirements. Code available at https://github.com/sjoshi804/MM-Gen.
Chinese: MM-Gen是一种可扩展的方法,通过为图像生成高质量的合成文本来提升视觉语言模型在专业任务上的性能,例如使Llava-1.5在空间推理和图表理解方面分别实现了29%和15%的显著提升。
English: MM-Gen is a scalable method that generates high-quality synthetic text for images to enhance vision-language models' performance on specialized tasks, achieving significant improvements such as 29% on spatial reasoning and 15% on diagram understanding for Llava-1.5.
Authors:Xiaoqing Zhang, Ang Lv, Yuhan Liu, Flood Sung, Wei Liu, Jian Luan, Shuo Shang, Xiuying Chen, Rui Yan
Abstract:
Large language models (LLMs) excel at few-shot in-context learning (ICL) without requiring parameter updates. However, as ICL demonstrations increase from a few to many, performance tends to plateau and eventually decline. We identify two primary causes for this trend: the suboptimal negative log-likelihood (NLL) optimization objective and the incremental data noise. To address these issues, we introduce \textit{DrICL}, a novel optimization method that enhances model performance through \textit{Differentiated} and \textit{Reweighting} objectives. Globally, DrICL utilizes differentiated learning to optimize the NLL objective, ensuring that many-shot performance surpasses zero-shot levels. Locally, it dynamically adjusts the weighting of many-shot demonstrations by leveraging cumulative advantages inspired by reinforcement learning, thereby mitigating the impact of noisy data. Recognizing the lack of multi-task datasets with diverse many-shot distributions, we develop the \textit{Many-Shot ICL Benchmark} (ICL-50)-a large-scale benchmark of 50 tasks that cover shot numbers from 1 to 350 within sequences of up to 8,000 tokens-for both fine-tuning and evaluation purposes. Experimental results demonstrate that LLMs enhanced with DrICL achieve significant improvements in many-shot setups across various tasks, including both in-domain and out-of-domain scenarios. We release the code and dataset hoping to facilitate further research in many-shot ICL\footnote{https://github.com/xiaoqzhwhu/DrICL}.
中文: 大语言模型在多样本上下文学习中因优化目标不理想和数据噪声导致性能下降,而提出的DrICL方法通过差异化学习和动态加权机制有效解决这些问题,在新构建的基准测试中显著提升了多样本场景下的任务表现。
English: Large language models face performance degradation in many-shot in-context learning due to suboptimal optimization objectives and data noise, which the proposed DrICL method addresses through differentiated learning and dynamic demonstration reweighting, achieving significant improvements across tasks as validated on a newly developed benchmark.
Authors:Yuqi Li, Xingyou Lin, Kai Zhang, Chuanguang Yang, Zhongliang Guo, Jianping Gou, Yanli Li
Abstract:
Federated Learning (FL) provides novel solutions for machine learning (ML)-based lithography hotspot detection (LHD) under distributed privacy-preserving settings. Currently, two research pipelines have been investigated to aggregate local models and achieve global consensus, including parameter/nonparameter based (also known as knowledge distillation, namely KD). While these two kinds of methods show effectiveness in specific scenarios, we note they have not fully utilized and transferred the information learned, leaving the potential of FL-based LDH remains unexplored. Thus, we propose FedKDhybrid in this study to mitigate the research gap. Specifically, FedKD-hybrid clients agree on several identical layers across all participants and a public dataset for achieving global consensus. During training, the trained local model will be evaluated on the public dataset, and the generated logits will be uploaded along with the identical layer parameters. The aggregated information is consequently used to update local models via the public dataset as a medium. We compare our proposed FedKD-hybrid with several state-of-the-art (SOTA) FL methods under ICCAD-2012 and FAB (real-world collected) datasets with different settings; the experimental results demonstrate the superior performance of the FedKD-hybrid algorithm. Our code is available at https://github.com/itsnotacie/NN-FedKD-hybrid
中文: 本研究提出FedKD-hybrid方法,通过结合参数与知识蒸馏策略,利用共享层和公共数据集提升联邦学习在光刻热点检测中的全局共识能力,实验证明其性能优于现有先进方法。
English: This study introduces FedKD-hybrid, a federated learning method that combines parameter and knowledge distillation approaches to enhance lithography hotspot detection by utilizing shared layers and a public dataset for improved global model consensus, outperforming existing methods in experiments.
Authors:Jiayao Gu, Liting Chen, Yihong Li
Abstract:
Data selection is critical for enhancing the performance of language models, particularly when aligning training datasets with a desired target distribution. This study explores the effects of different data selection methods and feature types on model performance. We evaluate whether selecting data subsets can influence downstream tasks, whether n-gram features improve alignment with target distributions, and whether embedding-based neural features provide complementary benefits. Through comparative experiments using baseline random selection methods and distribution aligned approaches, we provide insights into the interplay between data selection strategies and model training efficacy. All code for this study can be found on \href{https://github.com/jgu13/HIR-Hybrid-Importance-Resampling-for-Language-Models}{github repository}.
Chinese: 本研究通过对比实验探讨不同数据选择方法和特征类型如何影响语言模型性能,分析其与目标分布的匹配程度及对下游任务的作用。
English: This study investigates how various data selection methods and feature types impact language model performance, examining their alignment with target distributions and effects on downstream tasks through comparative experiments.
Authors:Avishai Elmakies, Omri Abend, Yossi Adi
Abstract:
In this paper, we introduce an unsupervised approach for Speech Segmentation, which builds on previously researched approaches, e.g., Speaker Diarization, while being applicable to an inclusive set of acoustic-semantic distinctions, paving a path towards a general Unsupervised Speech Segmentation approach. Unlike traditional speech and audio segmentation, which mainly focuses on spectral changes in the input signal, e.g., phone segmentation, our approach tries to segment the spoken utterance into chunks with differing acoustic-semantic styles, focusing on acoustic-semantic information that does not translate well into text, e.g., emotion or speaker. While most Speech Segmentation tasks only handle one style change, e.g., emotion diarization, our approach tries to handle multiple acoustic-semantic style changes. Leveraging recent advances in Speech Language Models (SLMs), we propose a simple unsupervised method to segment a given speech utterance. We empirically demonstrate the effectiveness of the proposed approach by considering several setups. Results suggest that the proposed method is superior to the evaluated baselines on boundary detection, segment purity, and over-segmentation. Code is available at https://github.com/avishaiElmakies/unsupervised_speech_segmentation_using_slm.
中文: 本文提出了一种无监督语音分割方法,利用语音语言模型处理多种声学语义风格变化,在边界检测和分段纯度方面优于基线方法。
English: This paper presents an unsupervised speech segmentation method that leverages Speech Language Models to handle multiple acoustic-semantic style changes, outperforming baselines in boundary detection and segment purity.
Authors:Liyue Chen, Jiangyi Fang, Tengfei Liu, Fangyuan Gao, Leye Wang
Abstract:
In smart cities, context-aware spatio-temporal crowd flow prediction (STCFP) models leverage contextual features (e.g., weather) to identify unusual crowd mobility patterns and enhance prediction accuracy. However, the best practice for incorporating contextual features remains unclear due to inconsistent usage of contextual features in different papers. Developing a multifaceted dataset with rich types of contextual features and STCFP scenarios is crucial for establishing a principled context modeling paradigm. Existing open crowd flow datasets lack an adequate range of contextual features, which poses an urgent requirement to build a multifaceted dataset to fill these research gaps. To this end, we create STContext, a multifaceted dataset for developing context-aware STCFP models. Specifically, STContext provides nine spatio-temporal datasets across five STCFP scenarios and includes ten contextual features, including weather, air quality index, holidays, points of interest, road networks, etc. Besides, we propose a unified workflow for incorporating contextual features into deep STCFP methods, with steps including feature transformation, dependency modeling, representation fusion, and training strategies. Through extensive experiments, we have obtained several useful guidelines for effective context modeling and insights for future research. The STContext is open-sourced at https://github.com/Liyue-Chen/STContext.
中文: STContext数据集旨在解决现有开放人群流量数据中上下文特征不足的问题,提供了多场景数据和统一的工作流程,以提升预测准确性并为未来研究提供指导。
English: The STContext dataset is introduced to address the lack of diverse contextual features in existing crowd flow prediction models, providing comprehensive data and a unified workflow to improve accuracy and guide future research.
Authors:NVIDIA, :, Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, Daniel Dworakowski, Jiaojiao Fan, Michele Fenzi, Francesco Ferroni, Sanja Fidler, Dieter Fox, Songwei Ge, Yunhao Ge, Jinwei Gu, Siddharth Gururani, Ethan He, Jiahui Huang, Jacob Huffman, Pooya Jannaty, Jingyi Jin, Seung Wook Kim, Gergely Klár, Grace Lam, Shiyi Lan, Laura Leal-Taixe, Anqi Li, Zhaoshuo Li, Chen-Hsuan Lin, Tsung-Yi Lin, Huan Ling, Ming-Yu Liu, Xian Liu, Alice Luo, Qianli Ma, Hanzi Mao, Kaichun Mo, Arsalan Mousavian, Seungjun Nah, Sriharsha Niverty, David Page, Despoina Paschalidou, Zeeshan Patel, Lindsey Pavao, Morteza Ramezanali, Fitsum Reda, Xiaowei Ren, Vasanth Rao Naik Sabavat, Ed Schmerling, Stella Shi, Bartosz Stefaniak, Shitao Tang, Lyne Tchapmi, Przemek Tredak, Wei-Cheng Tseng, Jibin Varghese, Hao Wang, Haoxiang Wang, Heng Wang, Ting-Chun Wang, Fangyin Wei, Xinyue Wei, Jay Zhangjie Wu, Jiashu Xu, Wei Yang, Lin Yen-Chen, Xiaohui Zeng, Yu Zeng, Jing Zhang, Qinsheng Zhang, Yuxuan Zhang, Qingqing Zhao, Artur Zolkowski
Abstract:
Physical AI needs to be trained digitally first. It needs a digital twin of itself, the policy model, and a digital twin of the world, the world model. In this paper, we present the Cosmos World Foundation Model Platform to help developers build customized world models for their Physical AI setups. We position a world foundation model as a general-purpose world model that can be fine-tuned into customized world models for downstream applications. Our platform covers a video curation pipeline, pre-trained world foundation models, examples of post-training of pre-trained world foundation models, and video tokenizers. To help Physical AI builders solve the most critical problems of our society, we make Cosmos open-source and our models open-weight with permissive licenses available via https://github.com/nvidia-cosmos/cosmos-predict1.
物理AI需要自身和世界的数字孪生,Cosmos平台通过可定制的世界基础模型及开源工具提供支持,以解决社会关键问题。
Physical AI requires digital twins of itself and the world, which the Cosmos platform provides through customizable world foundation models and open-source tools to address societal challenges.
Authors:Fatemeh Ghofrani, Pooyan Jamshidi
Abstract:
Self-supervised learning (SSL) has significantly advanced image representation learning, yet efficiency challenges persist, particularly with adversarial training. Many SSL methods require extensive epochs to achieve convergence, a demand further amplified in adversarial settings. To address this inefficiency, we revisit the robust EMP-SSL framework, emphasizing the importance of increasing the number of crops per image to accelerate learning. Unlike traditional contrastive learning, robust EMP-SSL leverages multi-crop sampling, integrates an invariance term and regularization, and reduces training epochs, enhancing time efficiency. Evaluated with both standard linear classifiers and multi-patch embedding aggregation, robust EMP-SSL provides new insights into SSL evaluation strategies.
Our results show that robust crop-based EMP-SSL not only accelerates convergence but also achieves a superior balance between clean accuracy and adversarial robustness, outperforming multi-crop embedding aggregation. Additionally, we extend this approach with free adversarial training in Multi-Crop SSL, introducing the Cost-Free Adversarial Multi-Crop Self-Supervised Learning (CF-AMC-SSL) method. CF-AMC-SSL demonstrates the effectiveness of free adversarial training in reducing training time while simultaneously improving clean accuracy and adversarial robustness. These findings underscore the potential of CF-AMC-SSL for practical SSL applications. Our code is publicly available at https://github.com/softsys4ai/CF-AMC-SSL.
中文摘要:鲁棒EMP-SSL框架通过增加图像裁剪数量和减少训练周期来加速自监督学习,而扩展的CF-AMC-SSL方法通过免费对抗训练进一步提升了效率与鲁棒性。
English Summary: The robust EMP-SSL framework accelerates self-supervised learning by increasing image crops and reducing training epochs, while the extended CF-AMC-SSL method further enhances efficiency and robustness through free adversarial training.
Authors:Nandan Kumar Jha, Brandon Reagen
Abstract:
The pervasiveness of proprietary language models has raised critical privacy concerns, necessitating advancements in private inference (PI), where computations are performed directly on encrypted data without revealing users' sensitive information. While PI offers a promising solution, its practical deployment is hindered by substantial communication and latency overheads, primarily stemming from nonlinear operations. To address this, we introduce an information-theoretic framework to characterize the role of nonlinearities in decoder-only language models, laying a principled foundation for optimizing transformer-architectures tailored to the demands of PI.
By leveraging Shannon's entropy as a quantitative measure, we uncover the previously unexplored dual significance of nonlinearities: beyond ensuring training stability, they are crucial for maintaining attention head diversity. Specifically, we find that their removal triggers two critical failure modes: {\em entropy collapse} in deeper layers that destabilizes training, and {\em entropic overload} in earlier layers that leads to under-utilization of Multi-Head Attention's (MHA) representational capacity.
We propose an entropy-guided attention mechanism paired with a novel entropy regularization technique to mitigate entropic overload. Additionally, we explore PI-friendly alternatives to layer normalization for preventing entropy collapse and stabilizing the training of LLMs with reduced-nonlinearities. Our study bridges the gap between information theory and architectural design, establishing entropy dynamics as a principled guide for developing efficient PI architectures. The code and implementation are available at https://github.com/Nandan91/entropy-guided-attention-llm
中文摘要:本研究提出了一种信息论框架,通过分析非线性在Transformer模型中的双重作用来解决私有推理中的通信和延迟挑战,并引入熵引导机制在保持模型稳定性和注意力多样性的同时优化架构设计。
English Summary: This study introduces an information-theoretic framework to address the communication and latency challenges in private inference by analyzing nonlinearities' dual role in transformer models, proposing entropy-guided mechanisms to optimize architectures while maintaining model stability and attention diversity.
Authors:Liyang Qin, Xiaoli Wang, Chunhua Yang, Huaiwen Zou, Haochuan Zhang
Abstract:
Among the existing Transformer-based multivariate time series forecasting methods, iTransformer, which treats each variable sequence as a token and only explicitly extracts cross-variable dependencies, and PatchTST, which adopts a channel-independent strategy and only explicitly extracts cross-time dependencies, both significantly outperform most Channel-Dependent Transformer that simultaneously extract cross-time and cross-variable dependencies. This indicates that existing Transformer-based multivariate time series forecasting methods still struggle to effectively fuse these two types of information. We attribute this issue to the dynamic time lags in the causal relationships between different variables. Therefore, we propose a new multivariate time series forecasting Transformer, Sensorformer, which first compresses the global patch information and then simultaneously extracts cross-variable and cross-time dependencies from the compressed representations. Sensorformer can effectively capture the correct inter-variable correlations and causal relationships, even in the presence of dynamic causal lags between variables, while also reducing the computational complexity of pure cross-patch self-attention from $O(D^2 \cdot Patch\_num^2 \cdot d\_model)$ to $O(D^2 \cdot Patch\_num \cdot d\_model)$. Extensive comparative and ablation experiments on 9 mainstream real-world multivariate time series forecasting datasets demonstrate the superiority of Sensorformer. The implementation of Sensorformer, following the style of the Time-series-library and scripts for reproducing the main results, is publicly available at https://github.com/BigYellowTiger/Sensorformer
中文: Sensorformer是一种新型Transformer模型,通过压缩全局片段信息有效捕捉多元时间序列预测中的跨变量和跨时间依赖关系,解决了动态因果滞后问题并降低了计算复杂度。
English: Sensorformer is a novel Transformer model that effectively captures cross-variable and cross-time dependencies in multivariate time series forecasting by compressing global patch information, addressing dynamic causal lags while reducing computational complexity.
Authors:Haozhen Zhang, Haodong Yue, Xi Xiao, Le Yu, Qing Li, Zhen Ling, Ye Zhang
Abstract:
With the growing significance of network security, the classification of encrypted traffic has emerged as an urgent challenge. Traditional byte-based traffic analysis methods are constrained by the rigid granularity of information and fail to fully exploit the diverse correlations between bytes. To address these limitations, this paper introduces MH-Net, a novel approach for classifying network traffic that leverages multi-view heterogeneous traffic graphs to model the intricate relationships between traffic bytes. The essence of MH-Net lies in aggregating varying numbers of traffic bits into multiple types of traffic units, thereby constructing multi-view traffic graphs with diverse information granularities. By accounting for different types of byte correlations, such as header-payload relationships, MH-Net further endows the traffic graph with heterogeneity, significantly enhancing model performance. Notably, we employ contrastive learning in a multi-task manner to strengthen the robustness of the learned traffic unit representations. Experiments conducted on the ISCX and CIC-IoT datasets for both the packet-level and flow-level traffic classification tasks demonstrate that MH-Net achieves the best overall performance compared to dozens of SOTA methods.
中文: 本文提出MH-Net加密流量分类方法,通过构建多视图异构图捕捉字节间复杂关联并采用对比学习,在基准数据集上实现了最优性能。
English: This paper proposes MH-Net, a novel encrypted traffic classification method that constructs multi-view heterogeneous graphs to capture complex byte relationships and employs contrastive learning, achieving state-of-the-art performance on benchmark datasets.
Authors:Qi Wang, Marco Federici, Herke van Hoof
Abstract:
The neural process (NP) is a family of computationally efficient models for learning distributions over functions. However, it suffers from under-fitting and shows suboptimal performance in practice. Researchers have primarily focused on incorporating diverse structural inductive biases, \textit{e.g.} attention or convolution, in modeling. The topic of inference suboptimality and an analysis of the NP from the optimization objective perspective has hardly been studied in earlier work. To fix this issue, we propose a surrogate objective of the target log-likelihood of the meta dataset within the expectation maximization framework. The resulting model, referred to as the Self-normalized Importance weighted Neural Process (SI-NP), can learn a more accurate functional prior and has an improvement guarantee concerning the target log-likelihood. Experimental results show the competitive performance of SI-NP over other NPs objectives and illustrate that structural inductive biases, such as attention modules, can also augment our method to achieve SOTA performance. Our code is available at \url{https://github.com/hhq123gogogo/SI_NPs}.
Chinese: 自归一化重要性加权神经过程(SI-NP)通过期望最大化框架中的替代目标优化,解决了神经过程拟合不足和性能欠佳的问题,提升了函数先验学习能力并取得了优越性能。
English: The Self-normalized Importance weighted Neural Process (SI-NP) is introduced to address the under-fitting and suboptimal performance of neural processes by optimizing a surrogate objective within the expectation maximization framework, enhancing functional prior learning and achieving competitive results.
Authors:Jian Hu, Jason Klein Liu, Haotian Xu, Wei Shen
Abstract:
Reinforcement Learning from Human Feedback (RLHF) plays a crucial role in aligning large language models (LLMs) with human values and preferences. While state-of-the-art applications like ChatGPT or GPT-4 commonly employ Proximal Policy Optimization (PPO), the inclusion of a critic network introduces significant computational overhead. REINFORCE-based methods, such as REINFORCE Leave One-Out (RLOO), ReMax, and Group Relative Policy Optimization (GRPO), address this limitation by eliminating the critic network. However, these approaches face challenges in accurate advantage estimation. Specifically, they estimate advantages independently for responses to each prompt, which can lead to overfitting on simpler prompts and vulnerability to reward hacking and may be biased. To address these challenges, we introduce REINFORCE++, a novel approach that removes the critic model while using the global advantage normalization which is unbiased to improve the training stability. Our empirical evaluation demonstrates that REINFORCE++ exhibits robust performance across various reward models without requiring prompt set truncation. Furthermore, it achieves superior generalization in both RLHF and long chain-of-thought (CoT) settings compared to existing REINFORCE-based methods. The implementation is available at https://github.com/OpenRLHF/OpenRLHF.
中文: REINFORCE++是一种创新的强化学习方法,它通过移除评论家网络并采用全局优势归一化来提高训练稳定性,在使大语言模型与人类偏好对齐方面展现出鲁棒性能和卓越的泛化能力。
English: REINFORCE++ is a novel reinforcement learning method that eliminates the critic network and employs global advantage normalization to enhance training stability, demonstrating robust performance and superior generalization in aligning large language models with human preferences.
Authors:Guoxuan Chen, Lianghao Xia, Chao Huang
Abstract:
Graph neural networks (GNNs) have demonstrated superior performance in collaborative recommendation through their ability to conduct high-order representation smoothing, effectively capturing structural information within users' interaction patterns. However, existing GNN paradigms face significant challenges in scalability and robustness when handling large-scale, noisy, and real-world datasets. To address these challenges, we present LightGNN, a lightweight and distillation-based GNN pruning framework designed to substantially reduce model complexity while preserving essential collaboration modeling capabilities. Our LightGNN framework introduces a computationally efficient pruning module that adaptively identifies and removes redundant edges and embedding entries for model compression. The framework is guided by a resource-friendly hierarchical knowledge distillation objective, whose intermediate layer augments the observed graph to maintain performance, particularly in high-rate compression scenarios. Extensive experiments on public datasets demonstrate LightGNN's effectiveness, significantly improving both computational efficiency and recommendation accuracy. Notably, LightGNN achieves an 80% reduction in edge count and 90% reduction in embedding entries while maintaining performance comparable to more complex state-of-the-art baselines. The implementation of our LightGNN framework is available at the github repository: https://github.com/HKUDS/LightGNN.
中文摘要:LightGNN是一种轻量级图神经网络框架,通过知识蒸馏自适应剪枝冗余边和嵌入,在保持性能的同时显著提升推荐系统的计算效率与模型压缩率。
English Summary: LightGNN is a lightweight graph neural network framework that enhances collaborative filtering by adaptively pruning redundant edges and embeddings through knowledge distillation, achieving significant model compression with maintained performance.
Authors:Beichen Zhang, Yuhong Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Haodong Duan, Yuhang Cao, Dahua Lin, Jiaqi Wang
Abstract:
Large language models (LLMs) have demonstrated impressive ability in solving complex mathematical problems with multi-step reasoning and can be further enhanced with well-designed in-context learning (ICL) examples. However, this potential is often constrained by two major challenges in ICL: granularity mismatch and irrelevant information. We observe that while LLMs excel at decomposing mathematical problems, they often struggle with reasoning errors in fine-grained steps. Moreover, ICL examples retrieved at the question level may omit critical steps or even mislead the model with irrelevant details. To address this issue, we propose BoostStep, a method that enhances reasoning accuracy through step-aligned ICL, a novel mechanism that carefully aligns retrieved reference steps with the corresponding reasoning steps. Additionally, BoostStep incorporates an effective "first-try" strategy to deliver exemplars highly relevant to the current state of reasoning. BoostStep is a flexible and powerful method that integrates seamlessly with chain-of-thought (CoT) and tree search algorithms, refining both candidate selection and decision-making. Empirical results show that BoostStep improves GPT-4o's CoT performance by 4.6% across mathematical benchmarks, significantly surpassing traditional few-shot learning's 1.2%. Moreover, it can achieve an additional 7.5\% gain combined with tree search. Surprisingly, it enhances state-of-the-art LLMs to solve challenging math problems using simpler examples. It improves DeepSeek-R1-671B's performance on AIME by 2.2%, leveraging simple examples only from the MATH dataset.
中文摘要:BoostStep通过步骤对齐的上下文学习机制和有效示范策略,显著提升大语言模型在数学推理中的准确性,在多项测试中取得突破性性能提升。
English Summary: BoostStep enhances large language models' mathematical reasoning by aligning in-context learning examples with specific reasoning steps and using relevant exemplars, significantly improving accuracy across benchmarks.
Authors:Ali Al-Lawati, Jason Lucas, Prasenjit Mitra
Abstract:
Large Language Models (LLMs) have demonstrated remarkable performance in various NLP tasks, including semantic parsing, which translates natural language into formal code representations. However, the reverse process, translating code into natural language, termed semantic captioning, has received less attention. This task is becoming increasingly important as LLMs are integrated into platforms for code generation, security analysis, and educational purposes. In this paper, we focus on the captioning of SQL query (SQL2Text) to address the critical need for understanding and explaining SQL queries in an era where LLM-generated code poses potential security risks. We repurpose Text2SQL datasets for SQL2Text by introducing an iterative ICL prompt using GPT-4o to generate multiple additional utterances, which enhances the robustness of the datasets for the reverse task. We conduct our experiments using in-context learning (ICL) based on different sample selection methods, emphasizing smaller, more computationally efficient LLMs. Our findings demonstrate that leveraging the inherent graph properties of SQL for ICL sample selection significantly outperforms random selection by up to 39% on BLEU score and provides better results than alternative methods. Dataset and codes are published: https://github.com/aliwister/ast-icl.
中文: 本研究针对将SQL查询转化为自然语言的语义描述任务,通过GPT-4o增强数据集并证明基于图结构的示例选择方法能显著提升小规模语言模型的性能表现,效果优于随机选择达39%。
English: This study addresses the semantic captioning task of translating SQL queries into natural language (SQL2Text) by repurposing datasets with GPT-4o-generated utterances and demonstrating that graph-based sample selection for in-context learning significantly outperforms random methods, especially in smaller LLMs.
Authors:Valery Istomin, Oleg Pereziabov, Ilya Afanasyev
Abstract:
This research focuses on developing a method for restoring the topology of digital images of paper documents captured by a camera, using algorithms for detection, segmentation, geometry restoration, and dewarping. Our methodology employs deep learning (DL) for document outline detection, followed by computer vision (CV) to create a topological 2D grid using cubic polynomial interpolation and correct nonlinear distortions by remapping the image. Using classical CV methods makes the document topology restoration process more efficient and faster, as it requires significantly fewer computational resources and memory. We developed a new pipeline for automatic document dewarping and reconstruction, along with a framework and annotated dataset to demonstrate its efficiency. Our experiments confirm the promise of our methodology and its superiority over existing benchmarks (including mobile apps and popular DL solutions, such as RectiNet, DocGeoNet, and DocTr++) both visually and in terms of document readability via Optical Character Recognition (OCR) and geometry restoration metrics. This paves the way for creating high-quality digital copies of paper documents and enhancing the efficiency of OCR systems. Project page: https://github.com/HorizonParadox/DRCCBI
中文: 本研究提出了一种结合深度学习轮廓检测与计算机视觉技术的文档复原方法,通过非线性畸变校正显著提升了OCR可读性和几何复原指标,性能优于当前主流方案。
English: This study introduces an efficient document restoration method combining deep learning for outline detection with computer vision techniques to correct distortions, demonstrating superior performance in OCR readability and geometry metrics compared to existing solutions.
Authors:Dylan Bouchard, Mohit Singh Chauhan, David Skarbrevik, Viren Bajaj, Zeya Ahmad
Abstract:
Large Language Models (LLMs) have been observed to exhibit bias in numerous ways, potentially creating or worsening outcomes for specific groups identified by protected attributes such as sex, race, sexual orientation, or age. To help address this gap, we introduce LangFair, an open-source Python package that aims to equip LLM practitioners with the tools to evaluate bias and fairness risks relevant to their specific use cases. The package offers functionality to easily generate evaluation datasets, comprised of LLM responses to use-case-specific prompts, and subsequently calculate applicable metrics for the practitioner's use case. To guide in metric selection, LangFair offers an actionable decision framework.
Chinese: LangFair 是一个开源 Python 工具包,旨在帮助大语言模型从业者通过生成评估数据集和计算相关指标来评估偏见与公平性风险,并提供可操作的指标选择框架。
English: LangFair is an open-source Python package designed to help LLM practitioners assess bias and fairness risks by enabling easy generation of evaluation datasets and calculation of relevant metrics, supported by a decision framework for metric selection.
Authors:Guy Yariv, Yuval Kirstain, Amit Zohar, Shelly Sheynin, Yaniv Taigman, Yossi Adi, Sagie Benaim, Adam Polyak
Abstract:
We consider the task of Image-to-Video (I2V) generation, which involves transforming static images into realistic video sequences based on a textual description. While recent advancements produce photorealistic outputs, they frequently struggle to create videos with accurate and consistent object motion, especially in multi-object scenarios. To address these limitations, we propose a two-stage compositional framework that decomposes I2V generation into: (i) An explicit intermediate representation generation stage, followed by (ii) A video generation stage that is conditioned on this representation. Our key innovation is the introduction of a mask-based motion trajectory as an intermediate representation, that captures both semantic object information and motion, enabling an expressive but compact representation of motion and semantics. To incorporate the learned representation in the second stage, we utilize object-level attention objectives. Specifically, we consider a spatial, per-object, masked-cross attention objective, integrating object-specific prompts into corresponding latent space regions and a masked spatio-temporal self-attention objective, ensuring frame-to-frame consistency for each object. We evaluate our method on challenging benchmarks with multi-object and high-motion scenarios and empirically demonstrate that the proposed method achieves state-of-the-art results in temporal coherence, motion realism, and text-prompt faithfulness. Additionally, we introduce \benchmark, a new challenging benchmark for single-object and multi-object I2V generation, and demonstrate our method's superiority on this benchmark. Project page is available at https://guyyariv.github.io/TTM/.
Authors:Duygu Sezen Islakoglu, Jan-Christoph Kalo
Abstract:
Large Language Models (LLMs) have achieved remarkable success in various NLP tasks, yet they still face significant challenges in reasoning and arithmetic. Temporal reasoning, a critical component of natural language understanding, has raised increasing research attention. However, comprehensive testing of Allen's interval relations (e.g., before, after, during) -- a fundamental framework for temporal relationships -- remains underexplored. To fill this gap, we present ChronoSense, a new benchmark for evaluating LLMs' temporal understanding. It includes 16 tasks, focusing on identifying the Allen relation between two temporal events and temporal arithmetic, using both abstract events and real-world data from Wikidata. We assess the performance of seven recent LLMs using this benchmark and the results indicate that models handle Allen relations, even symmetrical ones, quite differently. Moreover, the findings suggest that the models may rely on memorization to answer time-related questions. Overall, the models' low performance highlights the need for improved temporal understanding in LLMs and ChronoSense offers a robust framework for future research in this area. Our dataset and the source code are available at https://github.com/duyguislakoglu/chronosense.
中文: 大型语言模型在时间推理方面存在不足,因此开发了ChronoSense基准测试,揭示其表现不稳定且依赖记忆,凸显了提升时间理解能力的必要性。
English: Large Language Models struggle with temporal reasoning, prompting the creation of ChronoSense, a benchmark that reveals their inconsistent performance and reliance on memorization, underscoring the need for enhanced temporal understanding.
Authors:Can Gao, Xiaofeng Tan, Jie Zhou, Weiping Ding, Witold Pedrycz
Abstract:
Outlier detection refers to the identification of anomalous samples that deviate significantly from the distribution of normal data and has been extensively studied and used in a variety of practical tasks. However, most unsupervised outlier detection methods are carefully designed to detect specified outliers, while real-world data may be entangled with different types of outliers. In this study, we propose a fuzzy rough sets-based multi-scale outlier detection method to identify various types of outliers. Specifically, a novel fuzzy rough sets-based method that integrates relative fuzzy granule density is first introduced to improve the capability of detecting local outliers. Then, a multi-scale view generation method based on granular-ball computing is proposed to collaboratively identify group outliers at different levels of granularity. Moreover, reliable outliers and inliers determined by the three-way decision are used to train a weighted support vector machine to further improve the performance of outlier detection. The proposed method innovatively transforms unsupervised outlier detection into a semi-supervised classification problem and for the first time explores the fuzzy rough sets-based outlier detection from the perspective of multi-scale granular balls, allowing for high adaptability to different types of outliers. Extensive experiments carried out on both artificial and UCI datasets demonstrate that the proposed outlier detection method significantly outperforms the state-of-the-art methods, improving the results by at least 8.48% in terms of the Area Under the ROC Curve (AUROC) index. { The source codes are released at \url{https://github.com/Xiaofeng-Tan/MGBOD}. }
中文: 本研究提出了一种基于模糊粗糙集的多尺度异常检测方法,将无监督检测转化为半监督分类问题,在AUROC指标上显著优于现有方法至少8.48%。
English: This study introduces a fuzzy rough sets-based multi-scale outlier detection method that transforms unsupervised detection into semi-supervised classification, significantly outperforming existing techniques by at least 8.48% in AUROC metrics.
Authors:Stephan Goerttler, Yucheng Wang, Emadeldeen Eldele, Min Wu, Fei He
Abstract:
Recent advancements in machine learning-based signal analysis, coupled with open data initiatives, have fuelled efforts in automatic sleep stage classification. Despite the proliferation of classification models, few have prioritised reducing model complexity, which is a crucial factor for practical applications. In this work, we introduce Multi-Scale and Attention Convolutional Neural Network (MSA-CNN), a lightweight architecture featuring as few as ~10,000 parameters. MSA-CNN leverages a novel multi-scale module employing complementary pooling to eliminate redundant filter parameters and dense convolutions. Model complexity is further reduced by separating temporal and spatial feature extraction and using cost-effective global spatial convolutions. This separation of tasks not only reduces model complexity but also mirrors the approach used by human experts in sleep stage scoring. We evaluated both small and large configurations of MSA-CNN against nine state-of-the-art baseline models across three public datasets, treating univariate and multivariate models separately. Our evaluation, based on repeated cross-validation and re-evaluation of all baseline models, demonstrated that the large MSA-CNN outperformed all baseline models on all three datasets in terms of accuracy and Cohen's kappa, despite its significantly reduced parameter count. Lastly, we explored various model variants and conducted an in-depth analysis of the key modules and techniques, providing deeper insights into the underlying mechanisms. The code for our models, baselines, and evaluation procedures is available at https://github.com/sgoerttler/MSA-CNN.
中文摘要:本研究提出的MSA-CNN轻量级神经网络仅含约1万个参数,在三个数据集上的睡眠分期分类任务中,以显著降低的模型复杂度超越了九个先进基线模型。
English Summary: This study introduces MSA-CNN, a lightweight neural network with only about 10,000 parameters that outperforms nine state-of-the-art models in sleep stage classification across three datasets despite its reduced complexity.
Authors:Shi Bin Hoo, Samuel Müller, David Salinas, Frank Hutter
Abstract:
Foundation models have become increasingly popular for forecasting due to their ability to provide predictions without requiring a lot of training data. In this work, we demonstrate how TabPFN-v2, a general tabular foundation model, can be effectively applied to time series forecasting. We introduce TabPFN-TS, a simple method that combines TabPFN-v2 with lightweight feature engineering to enable both point and probabilistic forecasting. Despite its simplicity and compact size (11M parameters), TabPFN-TS achieves top rank on the public GIFT-Eval leaderboard in both forecasting tasks. Through ablation studies, we investigate factors contributing to this surprising effectiveness, especially considering TabPFN-v2 was pretrained solely on synthetic tabular data with no exposure to time series. Our results highlights the potential of tabular foundation models like TabPFN-v2 as a valuable new approach for time series forecasting. Our implementation is available at https://github.com/PriorLabs/tabpfn-time-series.
基础模型如TabPFN-v2为时间序列预测提供了新方法,尽管仅基于合成表格数据预训练,但通过轻量级特征工程即在基准测试中取得领先性能。
Foundation models like TabPFN-v2 offer a novel approach to time series forecasting, achieving top performance on benchmarks through minimal feature engineering despite being pretrained only on synthetic tabular data.
Authors:Sahar Salimpour, Jorge Peña-Queralta, Diego Paez-Granados, Jukka Heikkonen, Tomi Westerlund
Abstract:
Unprecedented agility and dexterous manipulation have been demonstrated with controllers based on deep reinforcement learning (RL), with a significant impact on legged and humanoid robots. Modern tooling and simulation platforms, such as NVIDIA Isaac Sim, have been enabling such advances. This article focuses on demonstrating the applications of Isaac in local planning and obstacle avoidance as one of the most fundamental ways in which a mobile robot interacts with its environments. Although there is extensive research on proprioception-based RL policies, the article highlights less standardized and reproducible approaches to exteroception. At the same time, the article aims to provide a base framework for end-to-end local navigation policies and how a custom robot can be trained in such simulation environment. We benchmark end-to-end policies with the state-of-the-art Nav2, navigation stack in Robot Operating System (ROS). We also cover the sim-to-real transfer process by demonstrating zero-shot transferability of policies trained in the Isaac simulator to real-world robots. This is further evidenced by the tests with different simulated robots, which show the generalization of the learned policy. Finally, the benchmarks demonstrate comparable performance to Nav2, opening the door to quick deployment of state-of-the-art end-to-end local planners for custom robot platforms, but importantly furthering the possibilities by expanding the state and action spaces or task definitions for more complex missions. Overall, with this article we introduce the most important steps, and aspects to consider, in deploying RL policies for local path planning and obstacle avoidance with Isaac Sim training, Gazebo testing, and ROS 2 for real-time inference in real robots. The code is available at https://github.com/sahars93/RL-Navigation.
中文: 本文展示了基于NVIDIA Isaac Sim的深度强化学习如何实现移动机器人的局部路径规划与避障,其性能媲美Nav2导航系统,并成功完成了从仿真到现实环境的策略迁移。
English: This article demonstrates how deep reinforcement learning in NVIDIA Isaac Sim enables effective local planning and obstacle avoidance for mobile robots, achieving comparable performance to Nav2 and showcasing successful sim-to-real transfer.
Authors:Niloufar Eghbali, Hassan Bagher-Ebadian, Tuka Alhanai, Mohammad M. Ghassemi
Abstract:
Vision Transformers (ViTs) have shown promise in medical image semantic segmentation (MISS) by capturing long-range correlations. However, ViTs often struggle to model local spatial information effectively, which is essential for accurately segmenting fine anatomical details, particularly when applied to small datasets without extensive pre-training. We introduce Gabor and Laplacian of Gaussian Convolutional Swin Network (GLoG-CSUnet), a novel architecture enhancing Transformer-based models by incorporating learnable radiomic features. This approach integrates dynamically adaptive Gabor and Laplacian of Gaussian (LoG) filters to capture texture, edge, and boundary information, enhancing the feature representation processed by the Transformer model. Our method uniquely combines the long-range dependency modeling of Transformers with the texture analysis capabilities of Gabor and LoG features. Evaluated on the Synapse multi-organ and ACDC cardiac segmentation datasets, GLoG-CSUnet demonstrates significant improvements over state-of-the-art models, achieving a 1.14% increase in Dice score for Synapse and 0.99% for ACDC, with minimal computational overhead (only 15 and 30 additional parameters, respectively). GLoG-CSUnet's flexible design allows integration with various base models, offering a promising approach for incorporating radiomics-inspired feature extraction in Transformer architectures for medical image analysis. The code implementation is available on GitHub at: https://github.com/HAAIL/GLoG-CSUnet.
中文: GLoG-CSUnet架构通过整合自适应Gabor和拉普拉斯高斯滤波器来增强视觉Transformer在医学图像分割中的表现,能有效捕捉细微纹理特征,在基准数据集上以极小的计算开销实现了性能突破。
English: The GLoG-CSUnet architecture enhances Vision Transformers for medical image segmentation by integrating adaptive Gabor and Laplacian of Gaussian filters to capture fine texture details, achieving superior performance on benchmark datasets with minimal computational overhead.
Authors:Saleh Ashkboos, Mahdi Nikdan, Soroush Tabesh, Roberto L. Castro, Torsten Hoefler, Dan Alistarh
Abstract:
Quantized training of Large Language Models (LLMs) remains an open challenge, as maintaining accuracy while performing all matrix multiplications in low precision has proven difficult. This is particularly the case when fine-tuning pre-trained models, which can have large weight and activation outlier values that make lower-precision optimization difficult. To address this, we present HALO, a novel quantization-aware training approach for Transformers that enables accurate and efficient low-precision training by combining 1) strategic placement of Hadamard rotations in both forward and backward passes, which mitigate outliers, 2) high-performance kernel support, and 3) FSDP integration for low-precision communication. Our approach ensures that all large matrix multiplications during the forward and backward passes are executed in lower precision. Applied to LLAMA-family models, HALO achieves near-full-precision-equivalent results during fine-tuning on various tasks, while delivering up to 1.41x end-to-end speedup for full fine-tuning on RTX 4090 GPUs. HALO efficiently supports both standard and parameterefficient fine-tuning (PEFT). Our results demonstrate the first practical approach to fully quantized LLM fine-tuning that maintains accuracy in 8-bit precision, while delivering performance benefits. Code is available at \url{https://github.com/IST-DASLab/HALO}.
中文: HALO提出了一种新颖的Transformer量化感知训练方法,通过巧妙运用哈达玛旋转和高性能内核,实现了LLMs在低精度下的精确微调,在保持接近全精度结果的同时获得最高1.41倍的加速效果。
English: HALO introduces a novel quantization-aware training method for Transformers that enables accurate low-precision fine-tuning of LLMs by strategically using Hadamard rotations and high-performance kernels, achieving near-full-precision results with up to 1.41x speedup.
Authors:Jiaping Wang, Simiao Zhang, Qiao-Chu He, Yifan Chen
Abstract:
The machine learning and data science community has made significant while dispersive progress in accelerating transformer-based large language models (LLMs), and one promising approach is to replace the original causal attention in a generative pre-trained transformer (GPT) with \emph{exponentially decaying causal linear attention}. In this paper, we present LeetDecoding, which is the first Python package that provides a large set of computation routines for this fundamental operator. The launch of LeetDecoding was motivated by the current lack of (1) clear understanding of the complexity regarding this operator, (2) a comprehensive collection of existing computation methods (usually spread in seemingly unrelated fields), and (3) CUDA implementations for fast inference on GPU. LeetDecoding's design is easy to integrate with existing linear-attention LLMs, and allows for researchers to benchmark and evaluate new computation methods for exponentially decaying causal linear attention. The usage of LeetDecoding does not require any knowledge of GPU programming and the underlying complexity analysis, intentionally making LeetDecoding accessible to LLM practitioners. The source code of LeetDecoding is provided at \href{https://github.com/Computational-Machine-Intelligence/LeetDecoding}{this GitHub repository}, and users can simply install LeetDecoding by the command \texttt{pip install leet-decoding}.
中文: LeetDecoding是首个提供指数衰减因果线性注意力计算功能的Python工具包,旨在解决该算子理解不足、方法分散和CUDA实现缺失的问题,支持无缝集成与性能评估,且无需GPU编程知识即可使用。
English: LeetDecoding is the first Python package offering comprehensive computation routines for exponentially decaying causal linear attention in transformers, addressing the lack of understanding, method collection, and CUDA implementations while enabling easy integration and benchmarking without requiring GPU programming expertise.
Authors:Lin Wang, Qing Li
Abstract:
Graph condensation reduces the size of large graphs while preserving performance, addressing the scalability challenges of Graph Neural Networks caused by computational inefficiencies on large datasets. Existing methods often rely on bi-level optimization, requiring extensive GNN training and limiting their scalability. To address these issues, this paper proposes Graph Condensation via Gaussian Process (GCGP), a novel and computationally efficient approach to graph condensation. GCGP utilizes a Gaussian Process (GP), with the condensed graph serving as observations, to estimate the posterior distribution of predictions. This approach eliminates the need for the iterative and resource-intensive training typically required by GNNs. To enhance the capability of the GCGP in capturing dependencies between function values, we derive a specialized covariance function that incorporates structural information. This covariance function broadens the receptive field of input nodes by local neighborhood aggregation, thereby facilitating the representation of intricate dependencies within the nodes. To address the challenge of optimizing binary structural information in condensed graphs, Concrete random variables are utilized to approximate the binary adjacency matrix in a continuous counterpart. This relaxation process allows the adjacency matrix to be represented in a differentiable form, enabling the application of gradient-based optimization techniques to discrete graph structures. Experimental results show that the proposed GCGP method efficiently condenses large-scale graph data while preserving predictive performance, addressing the scalability and efficiency challenges. The implementation of our method is publicly available at https://github.com/WANGLin0126/GCGP.
中文摘要:本文提出基于高斯过程的图压缩方法(GCGP),通过高斯过程替代传统图神经网络训练,利用定制协方差函数和梯度优化技术,在保持预测性能的同时高效压缩大规模图数据,解决了可扩展性难题。
English Summary: This paper introduces Graph Condensation via Gaussian Process (GCGP), a novel method that efficiently reduces graph size using Gaussian Processes to bypass intensive GNN training, incorporating specialized covariance functions and gradient optimization to maintain performance while addressing scalability.
Authors:Tara Radvand, Mojtaba Abdolmaleki, Mohamed Mostagir, Ambuj Tewari
Abstract:
Verifying the provenance of content is crucial to the function of many organizations, e.g., educational institutions, social media platforms, firms, etc. This problem is becoming increasingly challenging as text generated by Large Language Models (LLMs) becomes almost indistinguishable from human-generated content. In addition, many institutions utilize in-house LLMs and want to ensure that external, non-sanctioned LLMs do not produce content within the institution. In this paper, we answer the following question: Given a piece of text, can we identify whether it was produced by a particular LLM or not? We model LLM-generated text as a sequential stochastic process with complete dependence on history. We then design zero-shot statistical tests to (i) distinguish between text generated by two different known sets of LLMs $A$ (non-sanctioned) and $B$ (in-house), and (ii) identify whether text was generated by a known LLM or generated by any unknown model, e.g., a human or some other language generation process. We prove that the type I and type II errors of our test decrease exponentially with the length of the text. For that, we show that if $B$ generates the text, then except with an exponentially small probability in string length, the log-perplexity of the string under $A$ converges to the average cross-entropy of $B$ and $A$. We then present experiments using LLMs with white-box access to support our theoretical results and empirically examine the robustness of our results to black-box settings and adversarial attacks. In the black-box setting, our method achieves an average TPR of 82.5\% at a fixed FPR of 5\%. Under adversarial perturbations, our minimum TPR is 48.6\% at the same FPR threshold. Both results outperform all non-commercial baselines. See https://github.com/TaraRadvand74/llm-text-detection for code, data, and an online demo of the project.
Chinese: 本文提出了一种通过将文本建模为随机过程并使用零样本统计测试来验证其是否由特定大型语言模型生成的方法,该方法在理论和实验中均表现出色,错误率随文本长度呈指数级下降,并在白盒和黑盒环境下均优于现有基线。
English: This paper introduces a method to verify whether text is generated by a specific large language model by modeling it as a stochastic process and using zero-shot statistical tests, with proven exponential error reduction and strong performance in both white-box and black-box settings.
Authors:Ambroise Odonnat, Wassim Bouaziz, Vivien Cabannes
Abstract:
Gradient descent is the method of choice for training large artificial intelligence systems. As these systems become larger, a better understanding of the mechanisms behind gradient training would allow us to alleviate compute costs and help steer these systems away from harmful behaviors. To that end, we suggest utilizing the circuit perspective brought forward by mechanistic interpretability. After laying out our intuition, we illustrate how it enables us to design a curriculum for efficient learning in a controlled setting. The code is available at \url{https://github.com/facebookresearch/pal}.
中文: 梯度下降是训练大型人工智能系统的关键方法,通过机制可解释性理解其机制可以降低计算成本并避免有害行为,正如在受控环境中设计的课程所展示的那样。
English: Gradient descent is essential for training large AI systems, and understanding its mechanisms through mechanistic interpretability can reduce computational costs and prevent harmful behaviors, as demonstrated by a designed curriculum in a controlled setting.
Authors:Zongwei Li, Lianghao Xia, Hua Hua, Shijie Zhang, Shuangyang Wang, Chao Huang
Abstract:
Recent advances in Graph Neural Networks (GNNs) have revolutionized graph-structured data modeling, yet traditional GNNs struggle with complex heterogeneous structures prevalent in real-world scenarios. Despite progress in handling heterogeneous interactions, two fundamental challenges persist: noisy data significantly compromising embedding quality and learning performance, and existing methods' inability to capture intricate semantic transitions among heterogeneous relations, which impacts downstream predictions. To address these fundamental issues, we present the Heterogeneous Graph Diffusion Model (DiffGraph), a pioneering framework that introduces an innovative cross-view denoising strategy. This advanced approach transforms auxiliary heterogeneous data into target semantic spaces, enabling precise distillation of task-relevant information. At its core, DiffGraph features a sophisticated latent heterogeneous graph diffusion mechanism, implementing a novel forward and backward diffusion process for superior noise management. This methodology achieves simultaneous heterogeneous graph denoising and cross-type transition, while significantly simplifying graph generation through its latent-space diffusion capabilities. Through rigorous experimental validation on both public and industrial datasets, we demonstrate that DiffGraph consistently surpasses existing methods in link prediction and node classification tasks, establishing new benchmarks for robustness and efficiency in heterogeneous graph processing. The model implementation is publicly available at: https://github.com/HKUDS/DiffGraph.
中文摘要:本文提出的异质图扩散模型通过跨视图去噪策略和潜在扩散机制,有效解决了异质图中噪声干扰和语义转换的难题,在多项图学习任务中实现了最优性能。
English Summary: The proposed Heterogeneous Graph Diffusion Model (DiffGraph) addresses noise and semantic transition challenges in heterogeneous graphs through cross-view denoising and latent diffusion mechanisms, achieving state-of-the-art performance in graph learning tasks.
Authors:Zongxia Li, Xiyang Wu, Hongyang Du, Fuxiao Liu, Huy Nghiem, Guangyao Shi
Abstract:
Multimodal Vision Language Models (VLMs) have emerged as a transformative topic at the intersection of computer vision and natural language processing, enabling machines to perceive and reason about the world through both visual and textual modalities. For example, models such as CLIP, Claude, and GPT-4V demonstrate strong reasoning and understanding abilities on visual and textual data and beat classical single modality vision models on zero-shot classification [93]. With their rapid advancements in research and growing popularity in various applications, we provide a comprehensive survey of VLMs. Specifically, we provide a systematic overview of VLMs in the following aspects: [1] model information of the major VLMs developed up to 2025; [2] the transition of VLM architectures and the newest VLM alignment methods; [3] summary and categorization of the popular benchmarks and evaluation metrics of VLMs; [4] the challenges and issues faced by current VLMs such as hallucination, alignment, fairness, and safety. Detailed collections including papers and model repository links are listed in https://github.com/zli12321/Vision-Language-Models-Overview.
中文: 多模态视觉语言模型融合计算机视觉与自然语言处理,使机器能通过视觉和文本模态感知与推理,本综述系统梳理了其发展历程、架构、评估基准及幻觉与安全等挑战。
English: Multimodal Vision Language Models (VLMs) integrate computer vision and natural language processing to enable machines to perceive and reason through visual and textual data, with this survey systematically reviewing their development, architectures, benchmarks, and challenges like hallucination and safety.
Authors:Ziwei Zheng, Junyao Zhao, Le Yang, Lijun He, Fan Li
Abstract:
With the integration of an additional modality, large vision-language models (LVLMs) exhibit greater vulnerability to safety risks (e.g., jailbreaking) compared to their language-only predecessors. Although recent studies have devoted considerable effort to the post-hoc alignment of LVLMs, the inner safety mechanisms remain largely unexplored. In this paper, we discover that internal activations of LVLMs during the first token generation can effectively identify malicious prompts across different attacks. This inherent safety perception is governed by sparse attention heads, which we term ``safety heads." Further analysis reveals that these heads act as specialized shields against malicious prompts; ablating them leads to higher attack success rates, while the model's utility remains unaffected. By locating these safety heads and concatenating their activations, we construct a straightforward but powerful malicious prompt detector that integrates seamlessly into the generation process with minimal extra inference overhead. Despite its simple structure of a logistic regression model, the detector surprisingly exhibits strong zero-shot generalization capabilities. Experiments across various prompt-based attacks confirm the effectiveness of leveraging safety heads to protect LVLMs. Code is available at \url{https://github.com/Ziwei-Zheng/SAHs}.
中文: 大型视觉语言模型比纯文本模型更易受安全风险威胁,但其内部的"安全头部"能在首个令牌生成时有效识别恶意提示,从而以最小计算成本实现高效防护。
English: Large vision-language models are more vulnerable to safety risks than text-only models, but their internal "safety heads" can detect malicious prompts during the first token generation, enabling effective protection with minimal computational cost.
Authors:Hwa Hui Tew, Fan Ding, Gaoxuan Li, Junn Yong Loo, Chee-Ming Ting, Ze Yang Ding, Chee Pin Tan
Abstract:
Higher-order sensor networks are more accurate in characterizing the nonlinear dynamics of sensory time-series data in modern industrial settings by allowing multi-node connections beyond simple pairwise graph edges. In light of this, we propose a deep spatio-temporal hypergraph convolutional neural network for soft sensing (ST-HCSS). In particular, our proposed framework is able to construct and leverage a higher-order graph (hypergraph) to model the complex multi-interactions between sensor nodes in the absence of prior structural knowledge. To capture rich spatio-temporal relationships underlying sensor data, our proposed ST-HCSS incorporates stacked gated temporal and hypergraph convolution layers to effectively aggregate and update hypergraph information across time and nodes. Our results validate the superiority of ST-HCSS compared to existing state-of-the-art soft sensors, and demonstrates that the learned hypergraph feature representations aligns well with the sensor data correlations. The code is available at https://github.com/htew0001/ST-HCSS.git
中文: 提出的深度时空超图卷积神经网络(ST-HCSS)通过超图结构有效建模传感器间的复杂交互并捕捉丰富的时空关系,验证了其优于现有软传感器的性能表现。
English: The proposed deep spatio-temporal hypergraph convolutional neural network (ST-HCSS) effectively models complex sensor interactions through hypergraph structures and captures rich spatio-temporal relationships, demonstrating superior performance over existing soft sensors.
Authors:Keng Hou Leong, Yuxuan Xiu, Wai Kin, Chan
Abstract:
The representations of conditional entropy and conditional mutual information are significant in explaining the unique effects among variables. While previous studies based on conditional contrastive sampling have effectively removed information regarding discrete sensitive variables, they have not yet extended their scope to continuous cases. This paper introduces Information Subtraction, a framework designed to generate representations that preserve desired information while eliminating the undesired. We implement a generative-based architecture that outputs these representations by simultaneously maximizing an information term and minimizing another. With its flexibility in disentangling information, we can iteratively apply Information Subtraction to represent arbitrary information components between continuous variables, thereby explaining the various relationships that exist between them. Our results highlight the representations' ability to provide semantic features of conditional entropy. By subtracting sensitive and domain-specific information, our framework demonstrates effective performance in fair learning and domain generalization. The code for this paper is available at https://github.com/jh-liang/Information-Subtraction
中文摘要:本文提出的信息减法框架通过最大化目标信息和最小化无关信息来生成表征,能够有效分离连续变量间的信息成分,在公平学习和领域泛化中表现出色。
English Summary: This paper introduces Information Subtraction, a flexible framework that generates representations by maximizing desired information and minimizing unwanted information, enabling effective performance in fair learning and domain generalization by disentangling information components between continuous variables.
Authors:Jiahao Qin, Feng Liu
Abstract:
Electrocardiogram (ECG) analysis plays a crucial role in diagnosing cardiovascular diseases, but accurate interpretation of these complex signals remains challenging. This paper introduces a novel multimodal framework(GAF-FusionNet) for ECG classification that integrates time-series analysis with image-based representation using Gramian Angular Fields (GAF). Our approach employs a dual-layer cross-channel split attention module to adaptively fuse temporal and spatial features, enabling nuanced integration of complementary information. We evaluate GAF-FusionNet on three diverse ECG datasets: ECG200, ECG5000, and the MIT-BIH Arrhythmia Database. Results demonstrate significant improvements over state-of-the-art methods, with our model achieving 94.5\%, 96.9\%, and 99.6\% accuracy on the respective datasets. Our code will soon be available at https://github.com/Cross-Innovation-Lab/GAF-FusionNet.git.
中文摘要:本文提出GAF-FusionNet多模态心电图分类框架,通过格拉米角场融合时序分析与图像表征,在三个基准数据集上实现了最先进的分类准确率。
English Summary: This paper presents GAF-FusionNet, a multimodal ECG classification framework that combines time-series analysis with image-based representations using Gramian Angular Fields, achieving state-of-the-art accuracy across three benchmark datasets.
Authors:Shvetank Prakash, Andrew Cheng, Jason Yik, Arya Tschand, Radhika Ghosal, Ikechukwu Uchendu, Jessica Quaye, Jeffrey Ma, Shreyas Grampurohit, Sofia Giannuzzi, Arnav Balyan, Fin Amin, Aadya Pipersenia, Yash Choudhary, Ankita Nayak, Amir Yazdanbakhsh, Vijay Janapa Reddi
Abstract:
We introduce QuArch, a dataset of 1500 human-validated question-answer pairs designed to evaluate and enhance language models' understanding of computer architecture. The dataset covers areas including processor design, memory systems, and performance optimization. Our analysis highlights a significant performance gap: the best closed-source model achieves 84% accuracy, while the top small open-source model reaches 72%. We observe notable struggles in memory systems, interconnection networks, and benchmarking. Fine-tuning with QuArch improves small model accuracy by up to 8%, establishing a foundation for advancing AI-driven computer architecture research. The dataset and leaderboard are at https://harvard-edge.github.io/QuArch/.
Authors:Fengrui Zhang, Yujia Yin, Hongzong Li, Yifan Chen, Tianyi Qu
Abstract:
Despite significant advancements in causal research on graphs and its application to cracking label imbalance, the role of edge features in detecting the causal effects within graphs has been largely overlooked, leaving existing methods with untapped potential for further performance gains. In this paper, we enhance the causal attention mechanism through effectively leveraging edge information to disentangle the causal subgraph from the original graph, as well as further utilizing edge features to reshape graph representations. Capturing more comprehensive causal signals, our design leads to improved performance on graph classification tasks with label imbalance issues. We evaluate our approach on real-word datasets PTC, Tox21, and ogbg-molhiv, observing improvements over baselines. Overall, we highlight the importance of edge features in graph causal detection and provide a promising direction for addressing label imbalance challenges in graph-level tasks. The model implementation details and the codes are available on https://github.com/fengrui-z/ECAL
中文摘要:本研究通过将边特征融入因果注意力机制,增强了图因果检测能力,有效解决了标签不平衡问题,并在真实数据集上提升了分类性能。
English Summary: This study enhances graph causal detection by integrating edge features into the causal attention mechanism, effectively addressing label imbalance and improving classification performance on real-world datasets.
Authors:Tien Dang, Viet Thanh Duy Nguyen, Minh Tuan Le, Truong-Son Hy
Abstract:
Biomedical Knowledge Graphs (BKGs) integrate diverse datasets to elucidate complex relationships within the biomedical field. Effective link prediction on these graphs can uncover valuable connections, such as potential novel drug-disease relations. We introduce a novel multimodal approach that unifies embeddings from specialized Language Models (LMs) with Graph Contrastive Learning (GCL) to enhance intra-entity relationships while employing a Knowledge Graph Embedding (KGE) model to capture inter-entity relationships for effective link prediction. To address limitations in existing BKGs, we present PrimeKG++, an enriched knowledge graph incorporating multimodal data, including biological sequences and textual descriptions for each entity type. By combining semantic and relational information in a unified representation, our approach demonstrates strong generalizability, enabling accurate link predictions even for unseen nodes. Experimental results on PrimeKG++ and the DrugBank drug-target interaction dataset demonstrate the effectiveness and robustness of our method across diverse biomedical datasets. Our source code, pre-trained models, and data are publicly available at https://github.com/HySonLab/BioMedKG
我们提出的多模态方法将语言模型嵌入与图对比学习和知识图谱嵌入相结合,显著提升了生物医学链接预测效果,在PrimeKG++和DrugBank等丰富知识图谱数据集上表现出卓越性能。
Our novel multimodal approach integrates language model embeddings with graph contrastive learning and knowledge graph embeddings to enhance biomedical link prediction, demonstrating strong performance on enriched knowledge graphs like PrimeKG++ and DrugBank datasets.
Authors:Ved G. Shah, Alex Gagliano, Konstantin Malanchev, Gautham Narayan, The LSST Dark Energy Science Collaboration
Abstract:
We present ORACLE, the first hierarchical deep-learning model for real-time, context-aware classification of transient and variable astrophysical phenomena. ORACLE is a recurrent neural network with Gated Recurrent Units (GRUs), and has been trained using a custom hierarchical cross-entropy loss function to provide high-confidence classifications along an observationally-driven taxonomy with as little as a single photometric observation. Contextual information for each object, including host galaxy photometric redshift, offset, ellipticity and brightness, is concatenated to the light curve embedding and used to make a final prediction. Training on $\sim$0.5M events from the Extended LSST Astronomical Time-Series Classification Challenge, we achieve a top-level (Transient vs Variable) macro-averaged precision of 0.96 using only 1 day of photometric observations after the first detection in addition to contextual information, for each event; this increases to $>$0.99 once 64 days of the light curve has been obtained, and 0.83 at 1024 days after first detection for 19-way classification (including supernova sub-types, active galactic nuclei, variable stars, microlensing events, and kilonovae). We also compare ORACLE with other state-of-the-art classifiers and report comparable performance for the 19-way classification task, in addition to delivering accurate top-level classifications much earlier. The code and model weights used in this work are publicly available at our associated GitHub repository (https://github.com/uiucsn/ELAsTiCC-Classification).
ORACLE是首个利用门控循环单元和上下文数据的分层深度学习模型,能够实时分类天体物理现象,仅需少量观测即可实现高精度识别,并在早期检测中优于其他分类器。
ORACLE is the first hierarchical deep-learning model using GRUs and contextual data to classify astrophysical phenomena in real time, achieving high precision with minimal observations and outperforming other classifiers in early-stage detection.
Authors:George Yuanji Wang, Srisharan Murugesan, Aditya Prince Rohatgi
Abstract:
Identifying druggable genes is essential for developing effective pharmaceuticals. With the availability of extensive, high-quality data, computational methods have become a significant asset. Protein Interaction Network (PIN) is valuable but challenging to implement due to its high dimensionality and sparsity. Previous methods relied on indirect integration, leading to resolution loss. This study proposes GAN-TAT, a framework utilizing an advanced graph embedding technology, ImGAGN, to directly integrate PIN for druggable gene inference work. Tested on three Pharos datasets, GAN-TAT achieved the highest AUC-ROC score of 0.951 on Tclin. Further evaluation shows that GAN-TAT's predictions are supported by clinical evidence, highlighting its potential practical applications in pharmacogenomics. This research represents a methodological attempt with the direct utilization of PIN, expanding potential new solutions for developing drug targets. The source code of GAN-TAT is available at (https://github.com/george-yuanji-wang/GAN-TAT).
中文: 本研究提出GAN-TAT框架,通过先进图嵌入技术直接整合蛋白质相互作用网络来精准识别可成药基因,在测试中表现优异,并展现出在药物基因组学中的实际应用潜力。
English: This study introduces GAN-TAT, a framework that directly integrates Protein Interaction Networks using advanced graph embedding to accurately identify druggable genes, achieving top performance in tests and demonstrating practical potential in pharmacogenomics.
Authors:Jingfeng Yao, Bin Yang, Xinggang Wang
Abstract:
Latent diffusion models with Transformer architectures excel at generating high-fidelity images. However, recent studies reveal an optimization dilemma in this two-stage design: while increasing the per-token feature dimension in visual tokenizers improves reconstruction quality, it requires substantially larger diffusion models and more training iterations to achieve comparable generation performance. Consequently, existing systems often settle for sub-optimal solutions, either producing visual artifacts due to information loss within tokenizers or failing to converge fully due to expensive computation costs. We argue that this dilemma stems from the inherent difficulty in learning unconstrained high-dimensional latent spaces. To address this, we propose aligning the latent space with pre-trained vision foundation models when training the visual tokenizers. Our proposed VA-VAE (Vision foundation model Aligned Variational AutoEncoder) significantly expands the reconstruction-generation frontier of latent diffusion models, enabling faster convergence of Diffusion Transformers (DiT) in high-dimensional latent spaces. To exploit the full potential of VA-VAE, we build an enhanced DiT baseline with improved training strategies and architecture designs, termed LightningDiT. The integrated system achieves state-of-the-art (SOTA) performance on ImageNet 256x256 generation with an FID score of 1.35 while demonstrating remarkable training efficiency by reaching an FID score of 2.11 in just 64 epochs--representing an over 21 times convergence speedup compared to the original DiT. Models and codes are available at: https://github.com/hustvl/LightningDiT.
中文: 该研究揭示了潜在扩散模型中视觉分词器质量与计算效率之间的优化困境,提出VA-VAE方法将潜在空间与预训练视觉模型对齐,通过集成LightningDiT系统实现了最先进的图像生成性能,并显著加快了收敛速度。
English: The study identifies an optimization dilemma in latent diffusion models where improving visual tokenizer quality conflicts with computational efficiency, and proposes VA-VAE to align latent spaces with pre-trained vision models, achieving state-of-the-art image generation with significantly faster convergence through the integrated LightningDiT system.
Authors:Yoshitomo Matsubara, Matteo Mendula, Marco Levorato
Abstract:
Split computing ($\neq$ split learning) is a promising approach to deep learning models for resource-constrained edge computing systems, where weak sensor (mobile) devices are wirelessly connected to stronger edge servers through channels with limited communication capacity. State-of-theart work on split computing presents methods for single tasks such as image classification, object detection, or semantic segmentation. The application of existing methods to multitask problems degrades model accuracy and/or significantly increase runtime latency. In this study, we propose Ladon, the first multi-task-head supervised compression model for multi-task split computing. Experimental results show that the multi-task supervised compression model either outperformed or rivaled strong lightweight baseline models in terms of predictive performance for ILSVRC 2012, COCO 2017, and PASCAL VOC 2012 datasets while learning compressed representations at its early layers. Furthermore, our models reduced end-to-end latency (by up to 95.4%) and energy consumption of mobile devices (by up to 88.2%) in multi-task split computing scenarios.
中文: Ladon是首个用于分割计算的多任务监督压缩模型,它在多个数据集上提升预测性能的同时,大幅降低了移动设备的延迟和能耗。
English: Ladon is the first multi-task supervised compression model for split computing that enhances predictive performance on multiple datasets while significantly reducing latency and energy consumption on mobile devices.
Authors:Or Patashnik, Rinon Gal, Daniil Ostashev, Sergey Tulyakov, Kfir Aberman, Daniel Cohen-Or
Abstract:
Personalizing text-to-image models to generate images of specific subjects across diverse scenes and styles is a rapidly advancing field. Current approaches often face challenges in maintaining a balance between identity preservation and alignment with the input text prompt. Some methods rely on a single textual token to represent a subject, which limits expressiveness, while others employ richer representations but disrupt the model's prior, diminishing prompt alignment. In this work, we introduce Nested Attention, a novel mechanism that injects a rich and expressive image representation into the model's existing cross-attention layers. Our key idea is to generate query-dependent subject values, derived from nested attention layers that learn to select relevant subject features for each region in the generated image. We integrate these nested layers into an encoder-based personalization method, and show that they enable high identity preservation while adhering to input text prompts. Our approach is general and can be trained on various domains. Additionally, its prior preservation allows us to combine multiple personalized subjects from different domains in a single image.
Chinese Summary: 嵌套注意力是一种创新机制,通过将丰富的图像表征注入交叉注意力层,在保持身份特征与提示对齐之间实现更优平衡,同时支持多主体图像合成。
English Summary: Nested Attention is a novel mechanism that enhances text-to-image personalization by injecting rich image representations into cross-attention layers, achieving superior balance between identity preservation and prompt alignment while enabling multi-subject composition.
Authors:Xiaoshuai Song, Yanan Wu, Weixun Wang, Jiaheng Liu, Wenbo Su, Bo Zheng
Abstract:
Self-Correction aims to enable large language models (LLMs) to self-verify and self-refine their initial responses without external feedback. However, LLMs often fail to effectively self-verify and generate correct feedback, further misleading refinement and leading to the failure of self-correction, especially in complex reasoning tasks. In this paper, we propose Program-driven Self-Correction (ProgCo). First, program-driven verification (ProgVe) achieves complex verification logic and extensive validation through self-generated, self-executing verification pseudo-programs. Then, program-driven refinement (ProgRe) receives feedback from ProgVe, conducts dual reflection and refinement on both responses and verification programs to mitigate misleading of incorrect feedback in complex reasoning tasks. Experiments on three instruction-following and mathematical benchmarks indicate that ProgCo achieves effective self-correction, and can be further enhance performance when combined with real program tools. We release our code at https://github.com/songxiaoshuai/progco.
大语言模型的自校正通过ProgCo得以增强,它利用自生成的验证伪程序来验证和优化响应,从而在复杂推理任务中提升性能。
Self-correction in large language models is enhanced by ProgCo, which uses self-generated verification pseudo-programs to verify and refine responses, improving performance in complex reasoning tasks.
Authors:Yongle Huang, Haodong Chen, Zhenbang Xu, Zihan Jia, Haozhou Sun, Dian Shao
Abstract:
Human action understanding is crucial for the advancement of multimodal systems. While recent developments, driven by powerful large language models (LLMs), aim to be general enough to cover a wide range of categories, they often overlook the need for more specific capabilities. In this work, we address the more challenging task of Fine-grained Action Recognition (FAR), which focuses on detailed semantic labels within shorter temporal duration (e.g., "salto backward tucked with 1 turn"). Given the high costs of annotating fine-grained labels and the substantial data needed for fine-tuning LLMs, we propose to adopt semi-supervised learning (SSL). Our framework, SeFAR, incorporates several innovative designs to tackle these challenges. Specifically, to capture sufficient visual details, we construct Dual-level temporal elements as more effective representations, based on which we design a new strong augmentation strategy for the Teacher-Student learning paradigm through involving moderate temporal perturbation. Furthermore, to handle the high uncertainty within the teacher model's predictions for FAR, we propose the Adaptive Regulation to stabilize the learning process. Experiments show that SeFAR achieves state-of-the-art performance on two FAR datasets, FineGym and FineDiving, across various data scopes. It also outperforms other semi-supervised methods on two classical coarse-grained datasets, UCF101 and HMDB51. Further analysis and ablation studies validate the effectiveness of our designs. Additionally, we show that the features extracted by our SeFAR could largely promote the ability of multimodal foundation models to understand fine-grained and domain-specific semantics.
Chinese: 本文提出SeFAR框架,通过双级时间元素和自适应调节的半监督学习方法,在细粒度动作识别任务中实现了最优性能,并有效提升了多模态基础模型的领域语义理解能力。
English: This paper introduces SeFAR, a semi-supervised learning framework designed for Fine-grained Action Recognition that incorporates dual-level temporal elements and adaptive regulation to achieve state-of-the-art performance on specialized datasets.
Authors:Zhiyao Wang, Xu Chen, Chengming Xu, Junwei Zhu, Xiaobin Hu, Jiangning Zhang, Chengjie Wang, Yuqi Liu, Yiyi Zhou, Rongrong Ji
Abstract:
Face Restoration (FR) is a crucial area within image and video processing, focusing on reconstructing high-quality portraits from degraded inputs. Despite advancements in image FR, video FR remains relatively under-explored, primarily due to challenges related to temporal consistency, motion artifacts, and the limited availability of high-quality video data. Moreover, traditional face restoration typically prioritizes enhancing resolution and may not give as much consideration to related tasks such as facial colorization and inpainting. In this paper, we propose a novel approach for the Generalized Video Face Restoration (GVFR) task, which integrates video BFR, inpainting, and colorization tasks that we empirically show to benefit each other. We present a unified framework, termed as stable video face restoration (SVFR), which leverages the generative and motion priors of Stable Video Diffusion (SVD) and incorporates task-specific information through a unified face restoration framework. A learnable task embedding is introduced to enhance task identification. Meanwhile, a novel Unified Latent Regularization (ULR) is employed to encourage the shared feature representation learning among different subtasks. To further enhance the restoration quality and temporal stability, we introduce the facial prior learning and the self-referred refinement as auxiliary strategies used for both training and inference. The proposed framework effectively combines the complementary strengths of these tasks, enhancing temporal coherence and achieving superior restoration quality. This work advances the state-of-the-art in video FR and establishes a new paradigm for generalized video face restoration. Code and video demo are available at https://github.com/wangzhiyaoo/SVFR.git.
中文: 本文提出了一种新颖的通用视频人脸修复框架,通过整合修复与着色等多任务,利用稳定视频扩散先验和统一潜在正则化方法,显著提升了视频修复的时间一致性与质量。
English: This paper introduces a novel generalized video face restoration framework that integrates multiple tasks like inpainting and colorization using Stable Video Diffusion priors, enhancing temporal consistency and restoration quality through unified latent regularization and auxiliary strategies.
Authors:Amil Bhagat, Milind Jain, A. V. Subramanyam
Abstract:
Consistency models have emerged as a promising alternative to diffusion models, offering high-quality generative capabilities through single-step sample generation. However, their application to multi-domain image translation tasks, such as cross-modal translation and low-light image enhancement remains largely unexplored. In this paper, we introduce Conditional Consistency Models (CCMs) for multi-domain image translation by incorporating additional conditional inputs. We implement these modifications by introducing task-specific conditional inputs that guide the denoising process, ensuring that the generated outputs retain structural and contextual information from the corresponding input domain. We evaluate CCMs on 10 different datasets demonstrating their effectiveness in producing high-quality translated images across multiple domains. Code is available at https://github.com/amilbhagat/Conditional-Consistency-Models.
中文: 条件一致性模型通过引入任务特定的条件输入,实现了跨多个领域的高质量单步图像翻译,并在10个不同数据集上验证了其有效性。
English: Conditional Consistency Models (CCMs) introduce task-specific conditional inputs to enable high-quality, single-step image translation across multiple domains, demonstrating effectiveness on 10 diverse datasets.
Authors:Haina Zhu, Yizhi Zhou, Hangting Chen, Jianwei Yu, Ziyang Ma, Rongzhi Gu, Yi Luo, Wei Tan, Xie Chen
Abstract:
Recent years have witnessed the success of foundation models pre-trained with self-supervised learning (SSL) in various music informatics understanding tasks, including music tagging, instrument classification, key detection, and more. In this paper, we propose a self-supervised music representation learning model for music understanding. Distinguished from previous studies adopting random projection or existing neural codec, the proposed model, named MuQ, is trained to predict tokens generated by Mel Residual Vector Quantization (Mel-RVQ). Our Mel-RVQ utilizes residual linear projection structure for Mel spectrum quantization to enhance the stability and efficiency of target extraction and lead to better performance. Experiments in a large variety of downstream tasks demonstrate that MuQ outperforms previous self-supervised music representation models with only 0.9K hours of open-source pre-training data. Scaling up the data to over 160K hours and adopting iterative training consistently improve the model performance. To further validate the strength of our model, we present MuQ-MuLan, a joint music-text embedding model based on contrastive learning, which achieves state-of-the-art performance in the zero-shot music tagging task on the MagnaTagATune dataset. Code and checkpoints are open source in https://github.com/tencent-ailab/MuQ.
Chinese: MuQ模型采用梅尔残差向量量化进行自监督音乐表示学习,仅用少量预训练数据即在多项任务中表现卓越,并能通过数据扩展持续提升性能。
English: The MuQ model introduces a self-supervised music representation learning approach using Mel Residual Vector Quantization, achieving superior performance across multiple tasks with minimal pre-training data and scaling effectively to larger datasets.
Authors:Shuo Yu, Shan Jin, Ming Li, Tabinda Sarwar, Feng Xia
Abstract:
Understanding communication and information processing among brain regions of interest (ROIs) is highly dependent on long-range connectivity, which plays a crucial role in facilitating diverse functional neural integration across the entire brain. However, previous studies generally focused on the short-range dependencies within brain networks while neglecting the long-range dependencies, limiting an integrated understanding of brain-wide communication. To address this limitation, we propose Adaptive Long-range aware TransformER (ALTER), a brain graph transformer to capture long-range dependencies between brain ROIs utilizing biased random walk. Specifically, we present a novel long-range aware strategy to explicitly capture long-range dependencies between brain ROIs. By guiding the walker towards the next hop with higher correlation value, our strategy simulates the real-world brain-wide communication. Furthermore, by employing the transformer framework, ALERT adaptively integrates both short- and long-range dependencies between brain ROIs, enabling an integrated understanding of multi-level communication across the entire brain. Extensive experiments on ABIDE and ADNI datasets demonstrate that ALTER consistently outperforms generalized state-of-the-art graph learning methods (including SAN, Graphormer, GraphTrans, and LRGNN) and other graph learning based brain network analysis methods (including FBNETGEN, BrainNetGNN, BrainGNN, and BrainNETTF) in neurological disease diagnosis. Cases of long-range dependencies are also presented to further illustrate the effectiveness of ALTER. The implementation is available at https://github.com/yushuowiki/ALTER.
中文: 本研究提出ALTER,一种通过偏置随机游走捕捉大脑区域间长程依赖性的脑图转换器,在神经疾病诊断中优于现有方法。
English: This study introduces ALTER, a brain graph transformer that captures long-range dependencies between brain regions using biased random walks, outperforming existing methods in neurological disease diagnosis.
Authors:Xiaohui Chen, Yinkai Wang, Jiaxing He, Yuanqi Du, Soha Hassoun, Xiaolin Xu, Li-Ping Liu
Abstract:
Graph generation is a critical task in numerous domains, including molecular design and social network analysis, due to its ability to model complex relationships and structured data. While most modern graph generative models utilize adjacency matrix representations, this work revisits an alternative approach that represents graphs as sequences of node set and edge set. We advocate for this approach due to its efficient encoding of graphs and propose a novel representation. Based on this representation, we introduce the Graph Generative Pre-trained Transformer (G2PT), an auto-regressive model that learns graph structures via next-token prediction. To further exploit G2PT's capabilities as a general-purpose foundation model, we explore fine-tuning strategies for two downstream applications: goal-oriented generation and graph property prediction. We conduct extensive experiments across multiple datasets. Results indicate that G2PT achieves superior generative performance on both generic graph and molecule datasets. Furthermore, G2PT exhibits strong adaptability and versatility in downstream tasks from molecular design to property prediction. Code available at https://github.com/tufts-ml/G2PT,
Chinese: 本研究提出了图生成预训练变换器(G2PT),该自回归模型采用新颖的序列化图表示方法,在分子设计和性质预测等下游任务中展现出卓越的生成性能和强大的适应能力。
English: This work introduces the Graph Generative Pre-trained Transformer (G2PT), an auto-regressive model that uses a novel sequence-based graph representation to achieve superior generative performance and strong adaptability in downstream tasks like molecular design and property prediction.
Authors:Md Osama, Ashim Dey, Kawsar Ahmed, Muhammad Ashad Kabir
Abstract:
Automatic text summarization, particularly headline generation, remains a critical yet underexplored area for Bengali religious news. Existing approaches to headline generation typically rely solely on the article content, overlooking crucial contextual features such as sentiment, category, and aspect. This limitation significantly hinders their effectiveness and overall performance. This study addresses this limitation by introducing a novel corpus, BeliN (Bengali Religious News) - comprising religious news articles from prominent Bangladeshi online newspapers, and MultiGen - a contextual multi-input feature fusion headline generation approach. Leveraging transformer-based pre-trained language models such as BanglaT5, mBART, mT5, and mT0, MultiGen integrates additional contextual features - including category, aspect, and sentiment - with the news content. This fusion enables the model to capture critical contextual information often overlooked by traditional methods. Experimental results demonstrate the superiority of MultiGen over the baseline approach that uses only news content, achieving a BLEU score of 18.61 and ROUGE-L score of 24.19, compared to baseline approach scores of 16.08 and 23.08, respectively. These findings underscore the importance of incorporating contextual features in headline generation for low-resource languages. By bridging linguistic and cultural gaps, this research advances natural language processing for Bengali and other underrepresented languages. To promote reproducibility and further exploration, the dataset and implementation code are publicly accessible at https://github.com/akabircs/BeliN.
Chinese: 本研究提出了一种新颖的上下文多输入特征融合方法MultiGen,用于孟加拉语宗教新闻标题生成,通过整合情感、类别和方面特征,在BLEU和ROUGE-L得分上优于仅基于内容的方法。
English: This study introduces a novel contextual multi-input feature fusion approach, MultiGen, for Bengali religious news headline generation, which outperforms content-only methods by integrating sentiment, category, and aspect features, achieving superior BLEU and ROUGE-L scores.
Authors:Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, Luis Ceze
Abstract:
Transformers, driven by attention mechanisms, form the foundation of large language models (LLMs). As these models scale up, efficient GPU attention kernels become essential for high-throughput and low-latency inference. Diverse LLM applications demand flexible and high-performance attention solutions. We present FlashInfer: a customizable and efficient attention engine for LLM serving. FlashInfer tackles KV-cache storage heterogeneity using block-sparse format and composable formats to optimize memory access and reduce redundancy. It also offers a customizable attention template, enabling adaptation to various settings through Just-In-Time (JIT) compilation. Additionally, FlashInfer's load-balanced scheduling algorithm adjusts to dynamism of user requests while maintaining compatibility with CUDAGraph which requires static configuration. FlashInfer have been integrated into leading LLM serving frameworks like SGLang, vLLM and MLC-Engine. Comprehensive kernel-level and end-to-end evaluations demonstrate FlashInfer's ability to significantly boost kernel performance across diverse inference scenarios: compared to state-of-the-art LLM serving solutions, FlashInfer achieve 29-69% inter-token-latency reduction compared to compiler backends for LLM serving benchmark, 28-30% latency reduction for long-context inference, and 13-17% speedup for LLM serving with parallel generation.
Chinese: FlashInfer 是一个用于大语言模型服务的可定制高效注意力引擎,通过块稀疏 KV 缓存存储、即时编译和负载均衡调度优化内存访问与性能,在各种推理场景中显著降低了延迟。
English: FlashInfer is a customizable and efficient attention engine for LLM serving that optimizes memory access and performance through block-sparse KV-cache storage, JIT compilation, and load-balanced scheduling, achieving significant latency reductions across various inference scenarios.
Authors:Wenqi Zhang, Hang Zhang, Xin Li, Jiashuo Sun, Yongliang Shen, Weiming Lu, Deli Zhao, Yueting Zhuang, Lidong Bing
Abstract:
Compared to image-text pair data, interleaved corpora enable Vision-Language Models (VLMs) to understand the world more naturally like humans. However, such existing datasets are crawled from webpage, facing challenges like low knowledge density, loose image-text relations, and poor logical coherence between images. On the other hand, the internet hosts vast instructional videos (e.g., online geometry courses) that are widely used by humans to learn foundational subjects, yet these valuable resources remain underexplored in VLM training. In this paper, we introduce a high-quality \textbf{multimodal textbook} corpus with richer foundational knowledge for VLM pretraining. It collects over 2.5 years of instructional videos, totaling 22,000 class hours. We first use an LLM-proposed taxonomy to systematically gather instructional videos. Then we progressively extract and refine visual (keyframes), audio (ASR), and textual knowledge (OCR) from the videos, and organize as an image-text interleaved corpus based on temporal order. Compared to its counterparts, our video-centric textbook offers more coherent context, richer knowledge, and better image-text alignment. Experiments demonstrate its superb pretraining performance, particularly in knowledge- and reasoning-intensive tasks like ScienceQA and MathVista. Moreover, VLMs pre-trained on our textbook exhibit outstanding interleaved context awareness, leveraging visual and textual cues in their few-shot context for task solving. Our code are available at https://github.com/DAMO-NLP-SG/multimodal_textbook.
中文摘要:本文提出了一种从教学视频中提取的高质量多模态教材语料库,为视觉语言模型提供了更丰富的知识基础和更好的图文对齐,显著提升了其在知识密集型和推理任务中的表现。
English Summary: This paper introduces a high-quality multimodal textbook corpus derived from instructional videos, which provides richer foundational knowledge and better image-text alignment for Vision-Language Models, significantly enhancing their performance in knowledge-intensive and reasoning tasks.
Authors:David Wu, Sanjiban Choudhury
Abstract:
Aligning large language models (LLMs) to human preferences is challenging in domains where preference data is unavailable. We address the problem of learning reward models for such target domains by leveraging feedback collected from simpler source domains, where human preferences are easier to obtain. Our key insight is that, while domains may differ significantly, human preferences convey \emph{domain-agnostic} concepts that can be effectively captured by a reward model. We propose \method, a framework that trains domain-invariant reward models by optimizing a dual loss: a domain loss that minimizes the divergence between source and target distribution, and a source loss that optimizes preferences on the source domain. We show \method is a general approach that we evaluate and analyze across 4 distinct settings: (1) Cross-lingual transfer (accuracy: $0.621 \rightarrow 0.661$), (2) Clean-to-noisy (accuracy: $0.671 \rightarrow 0.703$), (3) Few-shot-to-full transfer (accuracy: $0.845 \rightarrow 0.920$), and (4) Simple-to-complex tasks transfer (correlation: $0.508 \rightarrow 0.556$). Our code, models and data are available at \url{https://github.com/portal-cornell/dial}.
Chinese: 本研究提出了一种训练领域不变奖励模型的框架,通过利用从较简单领域收集的反馈来在缺乏人类偏好数据的目标领域中调整大语言模型,并在多种设置下实现了性能提升。
English: This study introduces a framework for training domain-invariant reward models by using feedback from simpler domains to align LLMs in target domains where human preference data is unavailable, achieving improved performance across diverse settings.
Authors:Mingjia Li, Shuang Li, Tongrui Su, Longhui Yuan, Jian Liang, Wei Li
Abstract:
Capitalizing on the complementary advantages of generative and discriminative models has always been a compelling vision in machine learning, backed by a growing body of research. This work discloses the hidden semantic structure within score-based generative models, unveiling their potential as effective discriminative priors. Inspired by our theoretical findings, we propose DUSA to exploit the structured semantic priors underlying diffusion score to facilitate the test-time adaptation of image classifiers or dense predictors. Notably, DUSA extracts knowledge from a single timestep of denoising diffusion, lifting the curse of Monte Carlo-based likelihood estimation over timesteps. We demonstrate the efficacy of our DUSA in adapting a wide variety of competitive pre-trained discriminative models on diverse test-time scenarios. Additionally, a thorough ablation study is conducted to dissect the pivotal elements in DUSA. Code is publicly available at https://github.com/BIT-DA/DUSA.
Chinese Summary: 本研究提出DUSA方法,利用基于分数的生成模型中的语义先验,仅通过单个扩散步骤即可实现判别模型的高效测试时适应。
English Summary: This research introduces DUSA, a method that leverages the semantic priors in score-based generative models to enable efficient test-time adaptation of discriminative models using just a single diffusion timestep.
Authors:Nicholas Magal, Minh Tran, Riku Arakawa, Suzanne Nie
Abstract:
This paper aims to document an effective way to improve multimodal co-learning by using aggressive modality dropout. We find that by using aggressive modality dropout we are able to reverse negative co-learning (NCL) to positive co-learning (PCL). Aggressive modality dropout can be used to "prep" a multimodal model for unimodal deployment, and dramatically increases model performance during negative co-learning, where during some experiments we saw a 20% gain in accuracy. We also benchmark our modality dropout technique against PCL to show that our modality drop out technique improves co-learning during PCL, although it does not have as much as an substantial effect as it does during NCL. Github: https://github.com/nmagal/modality_drop_for_colearning
中文摘要:本文通过采用激进的模态丢弃方法,成功将负协同学习转化为正协同学习,使模型准确率最高提升20%,并为单模态部署做好了准备。
English Summary: This paper demonstrates that aggressive modality dropout effectively reverses negative co-learning into positive co-learning, significantly boosting model accuracy by up to 20% and preparing models for unimodal deployment.
Authors:Van Quang Nguyen, Quoc Chuong Nguyen, Thu Huong Dang, Truong-Son Hy
Abstract:
The Hierarchical Directed Capacitated Arc Routing Problem (HDCARP) is an extension of the Capacitated Arc Routing Problem (CARP), where the arcs of a graph are divided into classes based on their priority. The traversal of these classes is determined by either precedence constraints or a hierarchical objective, resulting in two distinct HDCARP variants. To the best of our knowledge, only one matheuristic has been proposed for these variants, but it performs relatively slowly, particularly for large-scale instances (Ha et al., 2024). In this paper, we propose a fast heuristic to efficiently address the computational challenges of HDCARP. Furthermore, we incorporate Reinforcement Learning (RL) into our heuristic to effectively guide the selection of local search operators, resulting in a hybrid algorithm. We name this hybrid algorithm as the Hybrid Reinforcement Learning and Heuristic Algorithm for Directed Arc Routing (HRDA). The hybrid algorithm adapts to changes in the problem dynamically, using real-time feedback to improve routing strategies and solution's quality by integrating heuristic methods. Extensive computational experiments on artificial instances demonstrate that this hybrid approach significantly improves the speed of the heuristic without deteriorating the solution quality. Our source code is publicly available at: https://github.com/HySonLab/ArcRoute
Chinese: 本文提出HRDA混合算法,将强化学习与启发式方法相结合,有效解决分层有向容量弧路径问题,在保持解质量的同时显著提升了计算速度。
English: This paper introduces HRDA, a hybrid algorithm combining reinforcement learning with heuristics to efficiently solve the Hierarchical Directed Capacitated Arc Routing Problem, significantly improving computational speed while maintaining solution quality.
Authors:Mengran Li, Chaojun Ding, Junzhou Chen, Wenbin Xing, Cong Ye, Ronghui Zhang, Songlin Zhuang, Jia Hu, Tony Z. Qiu, Huijun Gao
Abstract:
Missing attribute issues are prevalent in the graph learning, leading to biased outcomes in Graph Neural Networks (GNNs). Existing methods that rely on feature propagation are prone to cold start problem, particularly when dealing with attribute resetting and low-degree nodes, which hinder effective propagation and convergence. To address these challenges, we propose AttriReBoost (ARB), a novel method that incorporates propagation-based method to mitigate cold start problems in attribute-missing graphs. ARB enhances global feature propagation by redefining initial boundary conditions and strategically integrating virtual edges, thereby improving node connectivity and ensuring more stable and efficient convergence. This method facilitates gradient-free attribute reconstruction with lower computational overhead. The proposed method is theoretically grounded, with its convergence rigorously established. Extensive experiments on several real-world benchmark datasets demonstrate the effectiveness of ARB, achieving an average accuracy improvement of 5.11% over state-of-the-art methods. Additionally, ARB exhibits remarkable computational efficiency, processing a large-scale graph with 2.49 million nodes in just 16 seconds on a single GPU. Our code is available at https://github.com/limengran98/ARB.
中文: 提出的AttriReBoost方法通过重新定义边界条件和策略性添加虚拟边来增强全局特征传播,有效解决了属性缺失图中的冷启动问题,在显著提升精度的同时保持了优异的计算效率。
English: The proposed AttriReBoost method effectively addresses cold start problems in attribute-missing graphs by enhancing global feature propagation through redefined boundary conditions and virtual edges, achieving significant accuracy improvements and computational efficiency.
Authors:Jiajun Zhu, Peihao Wang, Ruisi Cai, Jason D. Lee, Pan Li, Zhangyang Wang
Abstract:
Transformers rely on both content-based and position-based addressing mechanisms to make predictions, but existing positional encoding techniques often diminish the effectiveness of position-based addressing. Many current methods enforce rigid patterns in attention maps, limiting the ability to model long-range dependencies and adapt to diverse tasks. Additionally, most positional encodings are learned as general biases, lacking the specialization required for different instances within a dataset. To address this, we propose con\textbf{T}extualized equivari\textbf{A}nt \textbf{P}osition \textbf{E}ncoding (\textbf{TAPE}), a novel framework that enhances positional embeddings by incorporating sequence content across layers. TAPE introduces dynamic, context-aware positional encodings, overcoming the constraints of traditional fixed patterns. We show that TAPE can provably facilitate LLM reasoning ability by emulating a broader class of algorithms. By enforcing permutation and orthogonal equivariance, TAPE ensures the stability of positional encodings during updates, improving long-context ability. Our method can be easily integrated into pre-trained transformers, offering parameter-efficient fine-tuning with minimal overhead. Extensive experiments show that TAPE achieves superior performance in language modeling, arithmetic reasoning, and long-context retrieval tasks compared to existing positional embedding techniques. Code is available at https://github.com/VITA-Group/TAPE.
中文: 提出的TAPE框架通过在各层融入序列内容,引入动态、上下文感知的位置编码,增强了Transformer的长上下文建模和推理能力,同时利用等变性确保稳定性,从而提升整体性能。
English: The proposed TAPE framework introduces dynamic, context-aware positional encodings that enhance transformer performance by incorporating sequence content across layers, improving long-context modeling and reasoning abilities while ensuring stability through equivariance properties.
Authors:Chethan Bhateja, Joseph O'Brien, Afnaan Hashmi, Eva Prakash
Abstract:
In machine learning, metric elicitation refers to the selection of performance metrics that best reflect an individual's implicit preferences for a given application. Currently, metric elicitation methods only consider metrics that depend on the accuracy values encoded within a given model's confusion matrix. However, focusing solely on confusion matrices does not account for other model feasibility considerations such as varied monetary costs or latencies. In our work, we build upon the multiclass metric elicitation framework of Hiranandani et al., extrapolating their proposed Diagonal Linear Performance Metric Elicitation (DLPME) algorithm to account for additional bounded costs and rewards. Our experimental results with synthetic data demonstrate our approach's ability to quickly converge to the true metric.
中文摘要:本研究在多分类度量诱导框架基础上,引入超出混淆矩阵准确率的额外有界成本和奖励因素,通过合成数据实验证明所提方法能有效收敛至真实性能度量。
English Summary: This study extends the multiclass metric elicitation framework by incorporating additional bounded costs and rewards beyond confusion matrix accuracy, demonstrating through synthetic data experiments that the proposed approach efficiently converges to the true performance metric.
Authors:Md Rakibul Hasan, Yue Yao, Md Zakir Hossain, Aneesh Krishna, Imre Rudas, Shafin Rahman, Tom Gedeon
Abstract:
Large language models (LLMs) have revolutionised many fields, with LLM-as-a-service (LLMSaaS) offering accessible, general-purpose solutions without costly task-specific training. In contrast to the widely studied prompt engineering for directly solving tasks (in vivo), this paper explores LLMs' potential for in-vitro applications: using LLM-generated labels to improve supervised training of mainstream models. We examine two strategies - (1) noisy label correction and (2) training data augmentation - in empathy computing, an emerging task to predict psychology-based questionnaire outcomes from inputs like textual narratives. Crowdsourced datasets in this domain often suffer from noisy labels that misrepresent underlying empathy. We show that replacing or supplementing these crowdsourced labels with LLM-generated labels, developed using psychology-based scale-aware prompts, achieves statistically significant accuracy improvements. Notably, the RoBERTa pre-trained language model (PLM) trained with noise-reduced labels yields a state-of-the-art Pearson correlation coefficient of 0.648 on the public NewsEmp benchmarks. This paper further analyses evaluation metric selection and demographic biases to help guide the future development of more equitable empathy computing models. Code and LLM-generated labels are available at https://github.com/hasan-rakibul/LLMPathy.
中文: 本研究证明,利用大语言模型生成的标签进行噪声标签校正和数据增强,显著提升了共情计算中监督模型的准确性,在基准数据集上达到了最先进的性能水平。
English: This study demonstrates that using large language model-generated labels for noisy label correction and data augmentation significantly improves the accuracy of supervised models in empathy computing, achieving state-of-the-art performance on benchmark datasets.
Authors:Peihao Wang, Ruisi Cai, Yuehao Wang, Jiajun Zhu, Pragya Srivastava, Zhangyang Wang, Pan Li
Abstract:
Structured State Space Models (SSMs) have emerged as alternatives to transformers. While SSMs are often regarded as effective in capturing long-sequence dependencies, we rigorously demonstrate that they are inherently limited by strong recency bias. Our empirical studies also reveal that this bias impairs the models' ability to recall distant information and introduces robustness issues. Our scaling experiments then discovered that deeper structures in SSMs can facilitate the learning of long contexts. However, subsequent theoretical analysis reveals that as SSMs increase in depth, they exhibit another inevitable tendency toward over-smoothing, e.g., token representations becoming increasingly indistinguishable. This fundamental dilemma between recency and over-smoothing hinders the scalability of existing SSMs. Inspired by our theoretical findings, we propose to polarize two channels of the state transition matrices in SSMs, setting them to zero and one, respectively, simultaneously addressing recency bias and over-smoothing. Experiments demonstrate that our polarization technique consistently enhances the associative recall accuracy of long-range tokens and unlocks SSMs to benefit further from deeper architectures. All source codes are released at https://github.com/VITA-Group/SSM-Bottleneck.
中文摘要:结构化状态空间模型(SSM)存在固有的近期偏好和深度引发的过度平滑问题,但通过极化状态转移矩阵的新方法可有效缓解这些限制,同时提升长距离关联回忆能力。
English Summary: Structured State Space Models (SSMs) suffer from inherent recency bias and depth-induced over-smoothing, but a novel polarization technique for state transition matrices effectively mitigates both limitations while enhancing long-range associative recall.
Authors:Kim Sung-Bin, Kim Jun-Seong, Junseok Ko, Yewon Kim, Tae-Hyun Oh
Abstract:
We propose SoundBrush, a model that uses sound as a brush to edit and manipulate visual scenes. We extend the generative capabilities of the Latent Diffusion Model (LDM) to incorporate audio information for editing visual scenes. Inspired by existing image-editing works, we frame this task as a supervised learning problem and leverage various off-the-shelf models to construct a sound-paired visual scene dataset for training. This richly generated dataset enables SoundBrush to learn to map audio features into the textual space of the LDM, allowing for visual scene editing guided by diverse in-the-wild sound. Unlike existing methods, SoundBrush can accurately manipulate the overall scenery or even insert sounding objects to best match the audio inputs while preserving the original content. Furthermore, by integrating with novel view synthesis techniques, our framework can be extended to edit 3D scenes, facilitating sound-driven 3D scene manipulation. Demos are available at https://soundbrush.github.io/.
Authors:Yuchuan Tian, Jing Han, Chengcheng Wang, Yuchen Liang, Chao Xu, Hanting Chen
Abstract:
Diffusion models have shown exceptional performance in visual generation tasks. Recently, these models have shifted from traditional U-Shaped CNN-Attention hybrid structures to fully transformer-based isotropic architectures. While these transformers exhibit strong scalability and performance, their reliance on complicated self-attention operation results in slow inference speeds. Contrary to these works, we rethink one of the simplest yet fastest module in deep learning, 3x3 Convolution, to construct a scaled-up purely convolutional diffusion model. We first discover that an Encoder-Decoder Hourglass design outperforms scalable isotropic architectures for Conv3x3, but still under-performing our expectation. Further improving the architecture, we introduce sparse skip connections to reduce redundancy and improve scalability. Based on the architecture, we introduce conditioning improvements including stage-specific embeddings, mid-block condition injection, and conditional gating. These improvements lead to our proposed Diffusion CNN (DiC), which serves as a swift yet competitive diffusion architecture baseline. Experiments on various scales and settings show that DiC surpasses existing diffusion transformers by considerable margins in terms of performance while keeping a good speed advantage. Project page: https://github.com/YuchuanTian/DiC
中文摘要:提出的扩散卷积网络(DiC)通过采用优化的卷积结构、稀疏连接和改进的条件机制,在保持速度优势的同时,其性能显著超越了现有的扩散变换器模型。
English Summary: The proposed Diffusion CNN (DiC) replaces complex transformer architectures with optimized convolutional networks featuring sparse connections and enhanced conditioning, achieving superior performance and faster inference than existing diffusion models.
Authors:Jiawei Yang, Jiahui Huang, Yuxiao Chen, Yan Wang, Boyi Li, Yurong You, Apoorva Sharma, Maximilian Igl, Peter Karkus, Danfei Xu, Boris Ivanovic, Yue Wang, Marco Pavone
Abstract:
We present STORM, a spatio-temporal reconstruction model designed for reconstructing dynamic outdoor scenes from sparse observations. Existing dynamic reconstruction methods often rely on per-scene optimization, dense observations across space and time, and strong motion supervision, resulting in lengthy optimization times, limited generalization to novel views or scenes, and degenerated quality caused by noisy pseudo-labels for dynamics. To address these challenges, STORM leverages a data-driven Transformer architecture that directly infers dynamic 3D scene representations--parameterized by 3D Gaussians and their velocities--in a single forward pass. Our key design is to aggregate 3D Gaussians from all frames using self-supervised scene flows, transforming them to the target timestep to enable complete (i.e., "amodal") reconstructions from arbitrary viewpoints at any moment in time. As an emergent property, STORM automatically captures dynamic instances and generates high-quality masks using only reconstruction losses. Extensive experiments on public datasets show that STORM achieves precise dynamic scene reconstruction, surpassing state-of-the-art per-scene optimization methods (+4.3 to 6.6 PSNR) and existing feed-forward approaches (+2.1 to 4.7 PSNR) in dynamic regions. STORM reconstructs large-scale outdoor scenes in 200ms, supports real-time rendering, and outperforms competitors in scene flow estimation, improving 3D EPE by 0.422m and Acc5 by 28.02%. Beyond reconstruction, we showcase four additional applications of our model, illustrating the potential of self-supervised learning for broader dynamic scene understanding.
中文: STORM是一种创新的时空重建模型,通过Transformer架构从稀疏观测中单次前向推断动态户外场景,在重建质量和效率上均优于现有方法。
English: STORM is a novel spatio-temporal reconstruction model that uses a Transformer architecture to reconstruct dynamic outdoor scenes from sparse observations in a single forward pass, achieving superior performance over existing methods in both reconstruction quality and efficiency.
Authors:Madeleine Darbyshire, Elizabeth Sklar, Simon Parsons
Abstract:
Precision agriculture leverages data and machine learning so that farmers can monitor their crops and target interventions precisely. This enables the precision application of herbicide only to weeds, or the precision application of fertilizer only to undernourished crops, rather than to the entire field. The approach promises to maximize yields while minimizing resource use and harm to the surrounding environment. To this end, we propose a hierarchical panoptic segmentation method that simultaneously determines leaf count (as an identifier of plant growth)and locates weeds within an image. In particular, our approach aims to improve the segmentation of smaller instances like the leaves and weeds by incorporating focal loss and boundary loss. Not only does this result in competitive performance, achieving a PQ+ of 81.89 on the standard training set, but we also demonstrate we can improve leaf-counting accuracy with our method. The code is available at https://github.com/madeleinedarbyshire/HierarchicalMask2Former.
Chinese: 精准农业利用机器学习实现针对性作物监测与干预,通过结合焦点损失和边界损失的层次全景分割方法,提升了叶片计数和杂草检测的准确性,取得了优异性能。
English: Precision agriculture employs machine learning for targeted crop monitoring and intervention, using a hierarchical panoptic segmentation method with focal and boundary loss to enhance leaf counting and weed detection, achieving high performance and improved accuracy.
Authors:Jianjie Luo, Jingwen Chen, Yehao Li, Yingwei Pan, Jianlin Feng, Hongyang Chao, Ting Yao
Abstract:
Recently, zero-shot image captioning has gained increasing attention, where only text data is available for training. The remarkable progress in text-to-image diffusion model presents the potential to resolve this task by employing synthetic image-caption pairs generated by this pre-trained prior. Nonetheless, the defective details in the salient regions of the synthetic images introduce semantic misalignment between the synthetic image and text, leading to compromised results. To address this challenge, we propose a novel Patch-wise Cross-modal feature Mix-up (PCM) mechanism to adaptively mitigate the unfaithful contents in a fine-grained manner during training, which can be integrated into most of encoder-decoder frameworks, introducing our PCM-Net. Specifically, for each input image, salient visual concepts in the image are first detected considering the image-text similarity in CLIP space. Next, the patch-wise visual features of the input image are selectively fused with the textual features of the salient visual concepts, leading to a mixed-up feature map with less defective content. Finally, a visual-semantic encoder is exploited to refine the derived feature map, which is further incorporated into the sentence decoder for caption generation. Additionally, to facilitate the model training with synthetic data, a novel CLIP-weighted cross-entropy loss is devised to prioritize the high-quality image-text pairs over the low-quality counterparts. Extensive experiments on MSCOCO and Flickr30k datasets demonstrate the superiority of our PCM-Net compared with state-of-the-art VLMs-based approaches. It is noteworthy that our PCM-Net ranks first in both in-domain and cross-domain zero-shot image captioning. The synthetic dataset SynthImgCap and code are available at https://jianjieluo.github.io/SynthImgCap.
中文摘要:提出的PCM-Net通过逐块跨模态特征混合机制,在零样本图像描述任务中自适应融合文本特征以修正合成图像的语义偏差,在基准数据集上实现了最优性能。
English Summary: The proposed PCM-Net introduces a patch-wise cross-modal feature mix-up mechanism to address semantic misalignment in zero-shot image captioning by adaptively refining synthetic image features with textual concepts, achieving state-of-the-art performance on benchmark datasets.
Authors:Fangchen Yu, Ruilizhen Hu, Yidong Lin, Yuqi Ma, Zhenghao Huang, Wenye Li
Abstract:
The Kolmogorov-Arnold Network (KAN) has recently gained attention as an alternative to traditional multi-layer perceptrons (MLPs), offering improved accuracy and interpretability by employing learnable activation functions on edges. In this paper, we introduce the Kolmogorov-Arnold Auto-Encoder (KAE), which integrates KAN with autoencoders (AEs) to enhance representation learning for retrieval, classification, and denoising tasks. Leveraging the flexible polynomial functions in KAN layers, KAE captures complex data patterns and non-linear relationships. Experiments on benchmark datasets demonstrate that KAE improves latent representation quality, reduces reconstruction errors, and achieves superior performance in downstream tasks such as retrieval, classification, and denoising, compared to standard autoencoders and other KAN variants. These results suggest KAE's potential as a useful tool for representation learning. Our code is available at \url{https://github.com/SciYu/KAE/}.
Chinese: Kolmogorov-Arnold自编码器(KAE)将KAN的可学习激活函数与自编码器结合,提升了表示学习能力,在检索、分类和去噪任务中表现优于传统方法。
English: The Kolmogorov-Arnold Auto-Encoder (KAE) combines KAN's learnable activation functions with autoencoders to enhance representation learning, achieving superior performance in retrieval, classification, and denoising tasks compared to standard methods.
Authors:Wenhao Dong, Yueyang Li, Weiming Zeng, Lei Chen, Hongjie Yan, Wai Ting Siok, Nizhuan Wang
Abstract:
Many existing methods that use functional magnetic resonance imaging (fMRI) classify brain disorders, such as autism spectrum disorder (ASD) and attention deficit hyperactivity disorder (ADHD), often overlook the integration of spatial and temporal dependencies of the blood oxygen level-dependent (BOLD) signals, which may lead to inaccurate or imprecise classification results. To solve this problem, we propose a Spatio-Temporal Aggregation eorganization ransformer (STARFormer) that effectively captures both spatial and temporal features of BOLD signals by incorporating three key modules. The region of interest (ROI) spatial structure analysis module uses eigenvector centrality (EC) to reorganize brain regions based on effective connectivity, highlighting critical spatial relationships relevant to the brain disorder. The temporal feature reorganization module systematically segments the time series into equal-dimensional window tokens and captures multiscale features through variable window and cross-window attention. The spatio-temporal feature fusion module employs a parallel transformer architecture with dedicated temporal and spatial branches to extract integrated features. The proposed STARFormer has been rigorously evaluated on two publicly available datasets for the classification of ASD and ADHD. The experimental results confirm that the STARFormer achieves state-of-the-art performance across multiple evaluation metrics, providing a more accurate and reliable tool for the diagnosis of brain disorders and biomedical research. The codes are available at: https://github.com/NZWANG/STARFormer.
中文:提出的STARFormer模型通过专门模块有效整合fMRI BOLD信号的空间与时间特征,在自闭症和多动症分类任务中取得最优性能,同时解决了以往方法在捕捉这些依赖关系方面的不足。
English: The proposed STARFormer model effectively integrates spatial and temporal features of fMRI BOLD signals through specialized modules, achieving state-of-the-art performance in classifying autism and ADHD while addressing previous methods' limitations in capturing these dependencies.
Authors:Wanlong Liu, Junying Chen, Ke Ji, Li Zhou, Wenyu Chen, Benyou Wang
Abstract:
Retrieval-Augmented Generation (RAG) has emerged as a key paradigm for enhancing large language models (LLMs) by incorporating external knowledge. However, current RAG methods face two limitations: (1) they only cover limited RAG scenarios. (2) They suffer from limited task diversity due to the lack of a general RAG dataset. To address these limitations, we propose RAG-Instruct, a general method for synthesizing diverse and high-quality RAG instruction data based on any source corpus. Our approach leverages (1) five RAG paradigms, which encompass diverse query-document relationships, and (2) instruction simulation, which enhances instruction diversity and quality by utilizing the strengths of existing instruction datasets. Using this method, we construct a 40K instruction dataset from Wikipedia, comprehensively covering diverse RAG scenarios and tasks. Experiments demonstrate that RAG-Instruct effectively enhances LLMs' RAG capabilities, achieving strong zero-shot performance and significantly outperforming various RAG baselines across a diverse set of tasks. RAG-Instruct is publicly available at https://github.com/FreedomIntelligence/RAG-Instruct.
中文摘要:RAG-Instruct是一种基于任意语料库、通过五种RAG范式和指令模拟技术合成多样化高质量指令数据的新方法,能显著提升大语言模型在多种任务中的检索增强生成性能。
English Summary: RAG-Instruct is a novel method that synthesizes diverse, high-quality instruction data from any corpus using five RAG paradigms and instruction simulation, effectively enhancing LLMs' retrieval-augmented generation capabilities across various tasks.
Authors:Runnan Chen, Xiangyu Sun, Zhaoqing Wang, Youquan Liu, Jiepeng Wang, Lingdong Kong, Jiankang Deng, Mingming Gong, Liang Pan, Wenping Wang, Tongliang Liu
Abstract:
Open-vocabulary scene understanding using 3D Gaussian (3DGS) representations has garnered considerable attention. However, existing methods mostly lift knowledge from large 2D vision models into 3DGS on a scene-by-scene basis, restricting the capabilities of open-vocabulary querying within their training scenes so that lacking the generalizability to novel scenes. In this work, we propose \textbf{OVGaussian}, a generalizable \textbf{O}pen-\textbf{V}ocabulary 3D semantic segmentation framework based on the 3D \textbf{Gaussian} representation. We first construct a large-scale 3D scene dataset based on 3DGS, dubbed \textbf{SegGaussian}, which provides detailed semantic and instance annotations for both Gaussian points and multi-view images. To promote semantic generalization across scenes, we introduce Generalizable Semantic Rasterization (GSR), which leverages a 3D neural network to learn and predict the semantic property for each 3D Gaussian point, where the semantic property can be rendered as multi-view consistent 2D semantic maps. In the next, we propose a Cross-modal Consistency Learning (CCL) framework that utilizes open-vocabulary annotations of 2D images and 3D Gaussians within SegGaussian to train the 3D neural network capable of open-vocabulary semantic segmentation across Gaussian-based 3D scenes. Experimental results demonstrate that OVGaussian significantly outperforms baseline methods, exhibiting robust cross-scene, cross-domain, and novel-view generalization capabilities. Code and the SegGaussian dataset will be released. (https://github.com/runnanchen/OVGaussian).
中文: OVGaussian 是一种创新的开放式词汇三维语义分割框架,通过大规模标注数据集和跨模态学习,实现了跨场景和跨领域的强大泛化能力。
English: OVGaussian is a novel open-vocabulary 3D semantic segmentation framework that leverages a large-scale annotated dataset and cross-modal learning to achieve robust generalization across diverse scenes and domains.
Authors:Haoyu Han, Yu Wang, Harry Shomer, Kai Guo, Jiayuan Ding, Yongjia Lei, Mahantesh Halappanavar, Ryan A. Rossi, Subhabrata Mukherjee, Xianfeng Tang, Qi He, Zhigang Hua, Bo Long, Tong Zhao, Neil Shah, Amin Javari, Yinglong Xia, Jiliang Tang
Abstract:
Retrieval-augmented generation (RAG) is a powerful technique that enhances downstream task execution by retrieving additional information, such as knowledge, skills, and tools from external sources. Graph, by its intrinsic "nodes connected by edges" nature, encodes massive heterogeneous and relational information, making it a golden resource for RAG in tremendous real-world applications. As a result, we have recently witnessed increasing attention on equipping RAG with Graph, i.e., GraphRAG. However, unlike conventional RAG, where the retriever, generator, and external data sources can be uniformly designed in the neural-embedding space, the uniqueness of graph-structured data, such as diverse-formatted and domain-specific relational knowledge, poses unique and significant challenges when designing GraphRAG for different domains. Given the broad applicability, the associated design challenges, and the recent surge in GraphRAG, a systematic and up-to-date survey of its key concepts and techniques is urgently desired. Following this motivation, we present a comprehensive and up-to-date survey on GraphRAG. Our survey first proposes a holistic GraphRAG framework by defining its key components, including query processor, retriever, organizer, generator, and data source. Furthermore, recognizing that graphs in different domains exhibit distinct relational patterns and require dedicated designs, we review GraphRAG techniques uniquely tailored to each domain. Finally, we discuss research challenges and brainstorm directions to inspire cross-disciplinary opportunities. Our survey repository is publicly maintained at https://github.com/Graph-RAG/GraphRAG/.
Chinese: GraphRAG通过利用图结构数据增强检索生成能力,针对不同领域的独特挑战提出了系统性框架和定制化技术解决方案。
English: GraphRAG enhances retrieval-augmented generation by leveraging graph-structured data, addressing unique challenges across domains through a systematic framework and tailored techniques.
Authors:Shi-Feng Peng, Guolei Sun, Yong Li, Hongsong Wang, Guo-Sen Xie
Abstract:
The primary challenge of cross-domain few-shot segmentation (CD-FSS) is the domain disparity between the training and inference phases, which can exist in either the input data or the target classes. Previous models struggle to learn feature representations that generalize to various unknown domains from limited training domain samples. In contrast, the large-scale visual model SAM, pre-trained on tens of millions of images from various domains and classes, possesses excellent generalizability. In this work, we propose a SAM-aware graph prompt reasoning network (GPRN) that fully leverages SAM to guide CD-FSS feature representation learning and improve prediction accuracy. Specifically, we propose a SAM-aware prompt initialization module (SPI) to transform the masks generated by SAM into visual prompts enriched with high-level semantic information. Since SAM tends to divide an object into many sub-regions, this may lead to visual prompts representing the same semantic object having inconsistent or fragmented features. We further propose a graph prompt reasoning (GPR) module that constructs a graph among visual prompts to reason about their interrelationships and enable each visual prompt to aggregate information from similar prompts, thus achieving global semantic consistency. Subsequently, each visual prompt embeds its semantic information into the corresponding mask region to assist in feature representation learning. To refine the segmentation mask during testing, we also design a non-parameter adaptive point selection module (APS) to select representative point prompts from query predictions and feed them back to SAM to refine inaccurate segmentation results. Experiments on four standard CD-FSS datasets demonstrate that our method establishes new state-of-the-art results. Code: https://github.com/CVL-hub/GPRN.
中文: 提出的SAM感知图提示推理网络(GPRN)利用大规模视觉模型SAM的泛化能力,通过将SAM生成的掩码转化为语义提示并借助图推理确保全局一致性,有效解决了跨域少样本分割中的领域差异问题,在多个数据集上取得了最优性能。
English: The proposed SAM-aware graph prompt reasoning network (GPRN) leverages the generalizability of the large-scale visual model SAM to address domain disparity in cross-domain few-shot segmentation by converting SAM masks into semantic prompts and ensuring global consistency through graph reasoning, achieving state-of-the-art results on multiple datasets.
Authors:Rajat Talak, Charis Georgiou, Jingnan Shi, Luca Carlone
Abstract:
Robust training of machine learning models in the presence of outliers has garnered attention across various domains. The use of robust losses is a popular approach and is known to mitigate the impact of outliers. We bring to light two literatures that have diverged in their ways of designing robust losses: one using M-estimation, which is popular in robotics and computer vision, and another using a risk-minimization framework, which is popular in deep learning. We first show that a simple modification of the Black-Rangarajan duality provides a unifying view. The modified duality brings out a definition of a robust loss kernel $Ï$ that is satisfied by robust losses in both the literatures. Secondly, using the modified duality, we propose an Adaptive Alternation Algorithm (AAA) for training machine learning models with outliers. The algorithm iteratively trains the model by using a weighted version of the non-robust loss, while updating the weights at each iteration. The algorithm is augmented with a novel parameter update rule by interpreting the weights as inlier probabilities, and obviates the need for complex parameter tuning. Thirdly, we investigate convergence of the adaptive alternation algorithm to outlier-free optima. Considering arbitrary outliers (i.e., with no distributional assumption on the outliers), we show that the use of robust loss kernels Ï increases the region of convergence. We experimentally show the efficacy of our algorithm on regression, classification, and neural scene reconstruction problems. We release our implementation code: https://github.com/MIT-SPARK/ORT.
中文摘要:本文通过改进的对偶框架提出了鲁棒损失设计的统一视角,并开发了一种自适应交替算法,通过将权重作为内点概率迭代更新来增强含异常值时的模型训练,在多种任务中提升了收敛性和性能。
English Summary: This paper introduces a unified view of robust loss design through a modified duality framework and proposes an Adaptive Alternation Algorithm that enhances model training with outliers by iteratively updating weights as inlier probabilities, improving convergence and performance across various tasks.
Authors:Duo Zhou, Christopher Brix, Grani A Hanasusanto, Huan Zhang
Abstract:
Recently, cutting-plane methods such as GCP-CROWN have been explored to enhance neural network verifiers and made significant advances. However, GCP-CROWN currently relies on generic cutting planes (cuts) generated from external mixed integer programming (MIP) solvers. Due to the poor scalability of MIP solvers, large neural networks cannot benefit from these cutting planes. In this paper, we exploit the structure of the neural network verification problem to generate efficient and scalable cutting planes specific for this problem setting. We propose a novel approach, Branch-and-bound Inferred Cuts with COnstraint Strengthening (BICCOS), which leverages the logical relationships of neurons within verified subproblems in the branch-and-bound search tree, and we introduce cuts that preclude these relationships in other subproblems. We develop a mechanism that assigns influence scores to neurons in each path to allow the strengthening of these cuts. Furthermore, we design a multi-tree search technique to identify more cuts, effectively narrowing the search space and accelerating the BaB algorithm. Our results demonstrate that BICCOS can generate hundreds of useful cuts during the branch-and-bound process and consistently increase the number of verifiable instances compared to other state-of-the-art neural network verifiers on a wide range of benchmarks, including large networks that previous cutting plane methods could not scale to. BICCOS is part of the $α,β$-CROWN verifier, the VNN-COMP 2024 winner. The code is available at http://github.com/Lemutisme/BICCOS .
中文:BICCOS通过利用神经网络验证问题的结构生成专用割平面,显著提升了大规模网络的可验证性,超越了以往方法的局限。
English: BICCOS introduces specialized cutting planes derived from neural network verification's structure, enhancing scalability and verification rates for large networks beyond previous methods' limits.
Authors:Zhengqi Xu, Han Zheng, Jie Song, Li Sun, Mingli Song
Abstract:
Model merging has attracted significant attention as a powerful paradigm for model reuse, facilitating the integration of task-specific models into a singular, versatile framework endowed with multifarious capabilities. Previous studies, predominantly utilizing methods such as Weight Average (WA), have shown that model merging can effectively leverage pretrained models without the need for laborious retraining. However, the inherent heterogeneity among models poses a substantial constraint on its applicability, particularly when confronted with discrepancies in model architectures. To overcome this challenge, we propose an innovative model merging framework designed for heterogeneous models, encompassing both depth and width heterogeneity. To address depth heterogeneity, we introduce a layer alignment strategy that harmonizes model layers by segmenting deeper models, treating consecutive layers with similar representations as a cohesive segment, thus enabling the seamless merging of models with differing layer depths. For width heterogeneity, we propose a novel elastic neuron zipping algorithm that projects the weights from models of varying widths onto a common dimensional space, eliminating the need for identical widths. Extensive experiments validate the efficacy of these proposed methods, demonstrating that the merging of structurally heterogeneous models can achieve performance levels comparable to those of homogeneous merging, across both vision and NLP tasks. Our code is publicly available at https://github.com/zju-vipa/training_free_heterogeneous_model_merging.
Chinese: 本研究提出了一种创新的模型融合框架,通过层对齐策略和弹性神经元压缩算法解决深度和宽度异质性问题,实现了结构异构模型的有效整合且性能无损。
English: This study introduces a novel model merging framework that addresses depth and width heterogeneity through layer alignment and an elastic neuron zipping algorithm, enabling effective integration of structurally diverse models without performance loss.
Authors:Witold WydmaÅski, Ulvi Movsum-zada, Jacek Tabor, Marek Åmieja
Abstract:
Although deep learning models have had great success in natural language processing and computer vision, we do not observe comparable improvements in the case of tabular data, which is still the most common data type used in biological, industrial and financial applications. In particular, it is challenging to transfer large-scale pre-trained models to downstream tasks defined on small tabular datasets. To address this, we propose VisTabNet -- a cross-modal transfer learning method, which allows for adapting Vision Transformer (ViT) with pre-trained weights to process tabular data. By projecting tabular inputs to patch embeddings acceptable by ViT, we can directly apply a pre-trained Transformer Encoder to tabular inputs. This approach eliminates the conceptual cost of designing a suitable architecture for processing tabular data, while reducing the computational cost of training the model from scratch. Experimental results on multiple small tabular datasets (less than 1k samples) demonstrate VisTabNet's superiority, outperforming both traditional ensemble methods and recent deep learning models. The proposed method goes beyond conventional transfer learning practice and shows that pre-trained image models can be transferred to solve tabular problems, extending the boundaries of transfer learning. We share our example implementation as a GitHub repository available at https://github.com/wwydmanski/VisTabNet.
中文: VisTabNet提出了一种跨模态迁移学习方法,通过将表格数据转换为适合预训练视觉Transformer处理的嵌入表示,在小样本表格数据集上超越了传统方法,拓展了迁移学习的应用边界。
English: VisTabNet introduces a cross-modal transfer learning approach that adapts pre-trained Vision Transformers to process tabular data, outperforming traditional methods on small datasets and demonstrating the potential of repurposing image models for tabular tasks.